<<

BOAZBARAK

INTRODUCTIONTO THEORETICAL COMPUTERSCIENCE

TEXTBOOKINPREPARATION. AVAILABLEON HTTPS://INTROTCS.ORG Text available on  https://github.com/boazbk/tcs - please post any issues there - thank you!

This version was compiled on Wednesday 26th August, 2020 18:10

Copyright © 2020 Boaz Barak

This work is licensed under a Creative Commons “Attribution-NonCommercial- NoDerivatives 4.0 International” license. To Ravit, Alma and Goren.

Contents

Preface 9

Preliminaries 17

0 Introduction 19

1 Mathematical Background 37

2 and Representation 73

I Finite computation 111

3 Defining computation 113

4 Syntactic sugar, and every 149

5 Code as , data as code 175

II Uniform computation 205

6 Functions with Infinite domains, Automata, and Regular expressions 207

7 Loops and infinity 241

8 Equivalent models of computation 271

9 Universality and uncomputability 315

10 Restricted computational models 347

11 Is every theorem provable? 365

Compiled on 8.26.2020 18:10 6

III Efficient 385

12 Efficient computation: An informal introduction 387

13 Modeling running time 407

14 Polynomial-time reductions 439

15 NP, NP completeness, and the Cook-Levin Theorem 465

16 What if P equals NP? 483

17 Space bounded computation 503

IV Randomized computation 505

18 Theory 101 507

19 Probabilistic computation 527

20 Modeling randomized computation 539

V Advanced topics 561

21 563

22 Proofs and algorithms 591

23 593

VI Appendices 625 Contents (detailed)

Preface 9 0.1 To the student ...... 10 0.1.1 Is the effort worth it? ...... 11 0.2 To potential instructors ...... 12 0.3 Acknowledgements ...... 14

Preliminaries 17

0 Introduction 19 0.1 Integer multiplication: an example of an . . . 20 0.2 Extended Example: A faster way to multiply (optional) 22 0.3 Algorithms beyond ...... 27 0.4 On the importance of negative results ...... 28 0.5 Roadmap to the rest of this book ...... 29 0.5.1 Dependencies between chapters ...... 30 0.6 Exercises ...... 32 0.7 Bibliographical notes ...... 33

1 Mathematical Background 37 1.1 This chapter: a reader’s manual ...... 37 1.2 A quick overview of mathematical prerequisites . . . . 38 1.3 Reading mathematical texts ...... 39 1.3.1 Definitions ...... 40 1.3.2 Assertions: Theorems, lemmas, claims ...... 40 1.3.3 Proofs ...... 40 1.4 Basic discrete math objects ...... 41 1.4.1 Sets ...... 41 1.4.2 Special sets ...... 42 1.4.3 Functions ...... 44 1.4.4 Graphs ...... 46 1.4.5 Logic operators and quantifiers ...... 49 1.4.6 Quantifiers for summations and products . . . . 50 1.4.7 Parsing : bound and free variables . . . 50 1.4.8 Asymptotics and Big- notation ...... 52

푂 8

1.4.9 Some “rules of thumb” for Big- notation . . . . 53 1.5 Proofs ...... 54 1.5.1 Proofs and programs . . . . .푂 ...... 55 1.5.2 Proof writing style ...... 55 1.5.3 Patterns in proofs ...... 56 1.6 Extended example: Topological Sorting ...... 58 1.6.1 Mathematical induction ...... 60 1.6.2 Proving the result by induction ...... 61 1.6.3 Minimality and uniqueness ...... 63 1.7 This book: notation and conventions ...... 65 1.7.1 Variable name conventions ...... 66 1.7.2 Some idioms ...... 67 1.8 Exercises ...... 69 1.9 Bibliographical notes ...... 71

2 Computation and Representation 73 2.1 Defining representations ...... 75 2.1.1 Representing natural numbers ...... 76 2.1.2 Meaning of representations (discussion) . . . . . 78 2.2 Representations beyond natural numbers ...... 78 2.2.1 Representing (potentially negative) integers . . . 79 2.2.2 Two’s complement representation (optional) . . 79 2.2.3 Rational numbers, and representing pairs of strings ...... 80 2.3 Representing real numbers ...... 82 2.4 Cantor’s Theorem, countable sets, and string represen- tations of the real numbers ...... 83 2.4.1 Corollary: Boolean functions are uncountable . . 89 2.4.2 Equivalent conditions for countability ...... 89 2.5 Representing objects beyond numbers ...... 90 2.5.1 Finite representations ...... 91 2.5.2 Prefix-free encoding ...... 91 2.5.3 Making representations prefix-free ...... 94 2.5.4 “Proof by Python” (optional) ...... 95 2.5.5 Representing letters and text ...... 97 2.5.6 Representing vectors, matrices, images ...... 99 2.5.7 Representing graphs ...... 99 2.5.8 Representing lists and nested lists ...... 99 2.5.9 Notation ...... 100 2.6 Defining computational tasks as mathematical functions 100 2.6.1 Distinguish functions from programs! ...... 102 2.7 Exercises ...... 104 2.8 Bibliographical notes ...... 108 9

I Finite computation 111

3 Defining computation 113 3.1 Defining computation ...... 115 3.2 Computing using AND, OR, and NOT...... 116 3.2.1 Some properties of AND and OR ...... 118 3.2.2 Extended example: Computing XOR from AND, OR, and NOT ...... 119 3.2.3 Informally defining “basic operations” and “algorithms” ...... 121 3.3 Boolean Circuits ...... 123 3.3.1 Boolean circuits: a formal definition ...... 124 3.3.2 Equivalence of circuits and straight-line programs 127 3.4 Physical implementations of computing devices (di- gression) ...... 130 3.4.1 ...... 131 3.4.2 Logical gates from transistors ...... 132 3.4.3 Biological computing ...... 132 3.4.4 Cellular automata and the game of life ...... 132 3.4.5 Neural networks ...... 132 3.4.6 A made from marbles and pipes . . . . 133 3.5 The NAND function ...... 134 3.5.1 NAND Circuits ...... 135 3.5.2 examples of NAND circuits (optional) . . . 136 3.5.3 The NAND-CIRC . . . . 138 3.6 Equivalence of all these models ...... 140 3.6.1 Circuits with other gate sets ...... 141 3.6.2 Specification vs. implementation (again) . . . . . 142 3.7 Exercises ...... 143 3.8 Biographical notes ...... 146

4 Syntactic sugar, and computing every function 149 4.1 Some examples of syntactic sugar ...... 151 4.1.1 User-defined procedures ...... 151 4.1.2 Proof by Python (optional) ...... 153 4.1.3 Conditional statements ...... 154 4.2 Extended example: Addition and Multiplication (op- tional) ...... 157 4.3 The LOOKUP function ...... 158 4.3.1 Constructing a NAND-CIRC program for LOOKUP ...... 159 4.4 Computing every function ...... 161 4.4.1 Proof of NAND’s Universality ...... 162 4.4.2 Improving by a factor of (optional) ...... 163

푛 10

4.5 Computing every function: An alternative proof . . . . 165 4.6 The class SIZE ...... 167 4.7 Exercises ...... 170 4.8 Bibliographical(푇 notes ) ...... 174

5 Code as data, data as code 175 5.1 Representing programs as strings ...... 177 5.2 Counting programs, and lower bounds on the size of NAND-CIRC programs ...... 178 5.2.1 Size hierarchy theorem (optional) ...... 180 5.3 The tuples representation ...... 182 5.3.1 From tuples to strings ...... 183 5.4 A NAND-CIRC in NAND-CIRC ...... 184 5.4.1 Efficient universal programs ...... 185 5.4.2 A NAND-CIRC interpeter in “pseudocode” . . . 186 5.4.3 A NAND interpreter in Python ...... 187 5.4.4 Constructing the NAND-CIRC interpreter in NAND-CIRC ...... 188 5.5 A Python interpreter in NAND-CIRC (discussion) . . . 191 5.6 The physical extended Church-Turing thesis (discus- sion) ...... 193 5.6.1 Attempts refuting the PECTT ...... 195 5.7 Recap of Part I: Finite Computation ...... 199 5.8 Exercises ...... 201 5.9 Bibliographical notes ...... 203

II Uniform computation 205

6 Functions with Infinite domains, Automata, and Regular expressions 207 6.1 Functions with inputs of unbounded length ...... 208 6.1.1 Varying inputs and outputs ...... 209 6.1.2 Formal Languages ...... 211 6.1.3 Restrictions of functions ...... 211 6.2 Deterministic finite automata (optional) ...... 212 6.2.1 Anatomy of an automaton (finite vs. unbounded) 215 6.2.2 DFA-computable functions ...... 216 6.3 Regular expressions ...... 217 6.3.1 Algorithms for matching regular expressions . . 221 6.4 Efficient matching of regular expressions (optional) . . 223 6.4.1 Matching regular expressions using DFAs . . . . 227 6.4.2 Equivalence of regular expressions and automata 228 6.4.3 Closure properties of regular expressions . . . . 230 11

6.5 Limitations of regular expressions and the pumping lemma ...... 231 6.6 Answering semantic questions about regular expres- sions ...... 236 6.7 Exercises ...... 239 6.8 Bibliographical notes ...... 240

7 Loops and infinity 241 7.1 Turing Machines ...... 242 7.1.1 Extended example: A for palin- dromes ...... 244 7.1.2 Turing machines: a formal definition ...... 245 7.1.3 Computable functions ...... 247 7.1.4 Infinite loops and partial functions ...... 248 7.2 Turing machines as programming languages ...... 249 7.2.1 The NAND-TM Programming language . . . . . 251 7.2.2 Sneak peak: NAND-TM vs Turing machines . . . 254 7.2.3 Examples ...... 254 7.3 Equivalence of Turing machines and NAND-TM pro- grams ...... 257 7.3.1 Specification vs implementation (again) . . . . . 260 7.4 NAND-TM syntactic sugar ...... 261 7.4.1 “GOTO” and inner loops ...... 261 7.5 Uniformity, and NAND vs NAND-TM (discussion) . . 264 7.6 Exercises ...... 265 7.7 Bibliographical notes ...... 268

8 Equivalent models of computation 271 8.1 RAM machines and NAND-RAM ...... 273 8.2 The gory details (optional) ...... 277 8.2.1 Indexed access in NAND-TM ...... 277 8.2.2 Two dimensional arrays in NAND-TM ...... 279 8.2.3 All the rest ...... 279 8.3 Turing equivalence (discussion) ...... 280 8.3.1 The “Best of both worlds” paradigm ...... 281 8.3.2 Let’s talk about ...... 281 8.3.3 and equivalence, a formal definition (optional) ...... 283 8.4 Cellular automata ...... 284 8.4.1 One dimensional cellular automata are Turing complete ...... 286 8.4.2 Configurations of Turing machines and the next-step function ...... 287 12

8.5 Lambda and functional programming lan- guages ...... 290 8.5.1 Applying functions to functions ...... 290 8.5.2 Obtaining multi-argument functions via Currying 291 8.5.3 Formal description of the λ calculus ...... 292 8.5.4 Infinite loops in the λ calculus ...... 295 8.6 The “Enhanced” λ calculus ...... 295 8.6.1 Computing a function in the enhanced λ calculus 297 8.6.2 Enhanced λ calculus is Turing-complete . . . . . 298 8.7 From enhanced to pure λ calculus ...... 301 8.7.1 List processing ...... 302 8.7.2 The Y combinator, or recursion without recursion 303 8.8 The Church-Turing Thesis (discussion) ...... 306 8.8.1 Different models of computation ...... 307 8.9 Exercises ...... 308 8.10 Bibliographical notes ...... 312

9 Universality and uncomputability 315 9.1 Universality or a meta-circular evaluator ...... 316 9.1.1 Proving the existence of a universal Turing Machine ...... 318 9.1.2 Implications of universality (discussion) . . . . . 320 9.2 Is every function computable? ...... 321 9.3 The ...... 323 9.3.1 Is the Halting problem really hard? (discussion) 326 9.3.2 A direct proof of the uncomputability of HALT (optional) ...... 327 9.4 Reductions ...... 329 9.4.1 Example: Halting on the zero problem ...... 330 9.5 Rice’s Theorem and the impossibility of general soft- ware verification ...... 333 9.5.1 Rice’s Theorem ...... 335 9.5.2 Halting and Rice’s Theorem for other Turing- complete models ...... 339 9.5.3 Is verification doomed? (discussion) . . 340 9.6 Exercises ...... 342 9.7 Bibliographical notes ...... 345

10 Restricted computational models 347 10.1 Turing completeness as a bug ...... 347 10.2 Context free grammars ...... 349 10.2.1 Context-free grammars as a computational model 351 10.2.2 The power of context free grammars ...... 353 10.2.3 Limitations of context-free grammars (optional) 355 13

10.3 Semantic properties of context free languages ...... 357 10.3.1 Uncomputability of context-free grammar equivalence (optional) ...... 357 10.4 Summary of semantic properties for regular expres- sions and context-free grammars ...... 360 10.5 Exercises ...... 361 10.6 Bibliographical notes ...... 362

11 Is every theorem provable? 365 11.1 Hilbert’s Program and Gödel’s Incompleteness Theorem 365 11.1.1 Defining “Proof Systems” ...... 367 11.2 Gödel’s Incompleteness Theorem: Computational variant ...... 368 11.3 Quantified integer statements ...... 370 11.4 Diophantine equations and the MRDP Theorem . . . . 372 11.5 Hardness of quantified integer statements ...... 374 11.5.1 Step 1: Quantified mixed statements and com- putation histories ...... 375 11.5.2 Step 2: Reducing mixed statements to integer statements ...... 378 11.6 Exercises ...... 380 11.7 Bibliographical notes ...... 382

III Efficient algorithms 385

12 Efficient computation: An informal introduction 387 12.1 Problems on graphs ...... 389 12.1.1 Finding the shortest path in a graph ...... 390 12.1.2 Finding the longest path in a graph ...... 392 12.1.3 Finding the minimum in a graph ...... 392 12.1.4 Min-Cut Max-Flow and Linear programming . . 393 12.1.5 Finding the maximum cut in a graph ...... 395 12.1.6 A note on convexity ...... 395 12.2 Beyond graphs ...... 397 12.2.1 SAT ...... 397 12.2.2 Solving linear equations ...... 398 12.2.3 Solving quadratic equations ...... 399 12.3 More advanced examples ...... 399 12.3.1 Determinant of a matrix ...... 399 12.3.2 Permanent of a matrix ...... 401 12.3.3 Finding a zero-sum equilibrium ...... 401 12.3.4 Finding a Nash equilibrium ...... 402 12.3.5 Primality testing ...... 402 12.3.6 Integer factoring ...... 403 14

12.4 Our current knowledge ...... 403 12.5 Exercises ...... 404 12.6 Bibliographical notes ...... 404 12.7 Further explorations ...... 405

13 Modeling running time 407 13.1 Formally defining running time ...... 409 13.1.1 Polynomial and Exponential Time ...... 410 13.2 Modeling running time using RAM Machines / NAND-RAM ...... 412 13.3 Extended Church-Turing Thesis (discussion) ...... 417 13.4 Efficient universal machine: a NAND-RAM inter- preter in NAND-RAM ...... 418 13.4.1 Timed Universal Turing Machine ...... 420 13.5 The time hierarchy theorem ...... 421 13.6 Non uniform computation ...... 425 13.6.1 Oblivious NAND-TM programs ...... 427 13.6.2 “Unrolling the loop”: algorithmic transforma- tion of Turing Machines to circuits ...... 430 13.6.3 Can uniform algorithms simulate non uniform ones? ...... 432 13.6.4 Uniform vs. Nonuniform computation: A recap . 434 13.7 Exercises ...... 435 13.8 Bibliographical notes ...... 438

14 Polynomial-time reductions 439 14.1 Formal definitions of problems ...... 440 14.2 Polynomial-time reductions ...... 441 14.2.1 Whistling pigs and flying horses ...... 442 14.3 Reducing 3SAT to zero one and quadratic equations . . 444 14.3.1 Quadratic equations ...... 447 14.4 The independent set problem ...... 449 14.5 Some exercises and anatomy of a reduction...... 452 14.5.1 Dominating set ...... 453 14.5.2 Anatomy of a reduction ...... 456 14.6 Reducing Independent Set to Maximum Cut ...... 458 14.7 Reducing 3SAT to Longest Path ...... 459 14.7.1 Summary of relations ...... 462 14.8 Exercises ...... 462 14.9 Bibliographical notes ...... 463

15 NP, NP completeness, and the Cook-Levin Theorem 465 15.1 The class NP ...... 465 15.1.1 Examples of functions in NP ...... 468 15.1.2 Basic facts about NP ...... 470 15

15.2 From NP to 3SAT: The Cook-Levin Theorem ...... 471 15.2.1 What does this mean? ...... 473 15.2.2 The Cook-Levin Theorem: Proof outline . . . . . 473 15.3 The NANDSAT Problem, and why it is NP hard . . . . 474 15.4 The NAND problem ...... 476 15.5 From NAND to SAT ...... 479 15.6 Wrapping3 up ...... 480 15.7 Exercises3 . . . .3 ...... 481 15.8 Bibliographical notes ...... 482

16 What if P equals NP? 483 16.1 Search-to-decision reduction ...... 484 16.2 Optimization ...... 487 16.2.1 Example: ...... 490 16.2.2 Example: Breaking cryptosystems ...... 491 16.3 Finding mathematical proofs ...... 491 16.4 Quantifier elimination (advanced) ...... 492 16.4.1 Application: self improving algorithm for SAT . 495 16.5 Approximating counting problems and posterior sampling (advanced, optional) ...... 3 . . . . 495 16.6 What does all of this imply? ...... 496 16.7 Can P NP be neither true nor false? ...... 498 16.8 Is P NP “in practice”? ...... 499 16.9 What if≠P NP? ...... 500 16.10 Exercises= ...... 502 16.11 Bibliographical≠ notes ...... 502

17 Space bounded computation 503 17.1 Exercises ...... 503 17.2 Bibliographical notes ...... 503

IV Randomized computation 505

18 101 507 18.1 Random coins ...... 507 18.1.1 Random variables ...... 510 18.1.2 Distributions over strings ...... 512 18.1.3 More general sample spaces ...... 512 18.2 Correlations and independence ...... 513 18.2.1 Independent random variables ...... 514 18.2.2 Collections of independent random variables . . 516 18.3 Concentration and tail bounds ...... 516 18.3.1 Chebyshev’s Inequality ...... 518 18.3.2 The Chernoff bound ...... 519 16

18.3.3 Application: Supervised learning and empirical risk minimization ...... 520 18.4 Exercises ...... 522 18.5 Bibliographical notes ...... 525

19 Probabilistic computation 527 19.1 Finding approximately good maximum cuts ...... 528 19.1.1 Amplifying the success of randomized algorithms 529 19.1.2 Success amplification ...... 530 19.1.3 Two-sided amplification ...... 531 19.1.4 What does this mean? ...... 531 19.1.5 Solving SAT through randomization ...... 532 19.1.6 Bipartite matching ...... 534 19.2 Exercises ...... 537 19.3 Bibliographical notes ...... 537 19.4 Acknowledgements ...... 537

20 Modeling randomized computation 539 20.1 Modeling randomized computation ...... 540 20.1.1 An alternative view: random coins as an “extra input” ...... 542 20.1.2 Success amplification of two-sided error algo- rithms ...... 544 20.2 BPP and NP completeness ...... 545 20.3 The power of randomization ...... 547 20.3.1 Solving BPP in exponential time ...... 547 20.3.2 Simulating randomized algorithms by circuits . . 547 20.4 Derandomization ...... 549 20.4.1 Pseudorandom generators ...... 550 20.4.2 From existence to constructivity ...... 552 20.4.3 Usefulness of pseudorandom generators . . . . . 553 20.5 P NP and BPP vs P ...... 554 20.6 Non-constructive existence of pseudorandom genera- tors= (advanced, optional) ...... 557 20.7 Exercises ...... 560 20.8 Bibliographical notes ...... 560

V Advanced topics 561

21 Cryptography 563 21.1 Classical cryptosystems ...... 564 21.2 Defining encryption ...... 566 21.3 Defining security of encryption ...... 567 21.4 Perfect secrecy ...... 569 17

21.4.1 Example: Perfect secrecy in the battlefield . . . . 570 21.4.2 Constructing perfectly secret encryption . . . . . 571 21.5 Necessity of long keys ...... 572 21.6 Computational secrecy ...... 574 21.6.1 Stream ciphers or the “derandomized one-time pad” ...... 576 21.7 Computational secrecy and NP ...... 579 21.8 Public key cryptography ...... 581 21.8.1 Defining public key encryption ...... 583 21.8.2 Diffie-Hellman key exchange ...... 584 21.9 Other security notions ...... 586 21.10 Magic ...... 586 21.10.1 Zero knowledge proofs ...... 586 21.10.2 Fully homomorphic encryption ...... 587 21.10.3 Multiparty secure computation ...... 587 21.11 Exercises ...... 589 21.12 Bibliographical notes ...... 589

22 Proofs and algorithms 591 22.1 Exercises ...... 591 22.2 Bibliographical notes ...... 591

23 Quantum computing 593 23.1 The double slit experiment ...... 594 23.2 Quantum amplitudes ...... 594 23.2.1 Linear quick review ...... 597 23.3 Bell’s Inequality ...... 598 23.4 Quantum weirdness ...... 600 23.5 Quantum computing and computation - an executive summary ...... 600 23.6 Quantum systems ...... 602 23.6.1 Quantum amplitudes ...... 604 23.6.2 Quantum systems: an executive summary . . . . 605 23.7 Analysis of Bell’s Inequality (optional) ...... 605 23.8 Quantum computation ...... 608 23.8.1 Quantum circuits ...... 608 23.8.2 QNAND-CIRC programs (optional) ...... 611 23.8.3 Uniform computation ...... 612 23.9 Physically realizing quantum computation ...... 613 23.10 Shor’s Algorithm: Hearing the shape of prime factors . 614 23.10.1 Period finding ...... 614 23.10.2 Shor’s Algorithm: A bird’s eye view ...... 615 23.11 Quantum Fourier Transform (advanced, optional) . . . 617 18

23.11.1 Quantum Fourier Transform over the Boolean Cube: Simon’s Algorithm ...... 619 23.11.2 From Fourier to Period finding: Simon’s Algo- rithm (advanced, optional) ...... 620 23.11.3 From Simon to Shor (advanced, optional) . . . . 621 23.12 Exercises ...... 623 23.13 Bibliographical notes ...... 623

VI Appendices 625 Preface

“We make ourselves no promises, but we cherish the hope that the unobstructed pursuit of useless knowledge will prove to have consequences in the future as in the past” … “An institution sets free successive generations of souls is amply justified whether or not this graduate or that makesa so-called useful contribution to human knowledge. A poem, a symphony, a painting, a mathematical truth, a new scientific fact, all bear in themselves all the justification that universities, colleges, and institutes of research need or require”, Abraham Flexner, The Usefulness of Useless Knowledge, 1939.

“I suggest that you take the hardest courses that you can, because you learn the most when you challenge yourself… CS 121 I found pretty hard.”, Mark Zuckerberg, 2005.

This is a textbook for an undergraduate introductory course on Theoretical Computer . The educational goals of this book are to convey the following:

• That computation arises in a variety of natural and human-made systems, and not only in modern -based .

• Similarly, beyond being an extremely important tool, computation also serves as a useful lens to describe natural, physical, mathemati- cal and even social concepts.

• The notion of universality of many different computational models, and the related notion of the duality between code and data.

• The idea that one can precisely define a mathematical , and then use that to prove (or sometimes only conjec- ture) lower bounds and impossibility results.

• Some of the surprising results and discoveries in modern theoreti- cal , including the prevalence of NP-completeness, the power of interaction, the power of randomness on one hand and the possibility of derandomization on the other, the ability to use hardness “for good” in cryptography, and the fascinating possibility of quantum computing.

Compiled on 8.26.2020 18:10 20

I hope that following this course, students would be able to rec- ognize computation, with both its power and pitfalls, as it arises in various settings, including seemingly “static” content or “restricted” formalisms such as macros and scripts. They should be able to follow through the logic of proofs about computation, including the cen- tral concept of a reduction, as well as understanding “self-referential” proofs (such as diagonalization-based proofs that involve programs given their own code as input). Students should understand that some problems are inherently intractable, and be able to recognize the potential for intractability when they are faced with a new problem. While this book only touches on cryptography, students should un- derstand the basic idea of how we can use computational hardness for cryptographic purposes. However, more than any specific skill, this book aims to introduce students to a new way of thinking of computa- tion as an object in its own right and to illustrate how this new way of thinking leads to far-reaching insights and applications. My aim in writing this text is to try to convey these concepts in the simplest possible way and try to make sure that the formal notation and model help elucidate, rather than obscure, the main ideas. I also tried to take advantage of modern students’ familiarity (or at least interest!) in programming, and hence use (highly simplified) pro- gramming languages to describe our models of computation. That said, this book does not assume fluency with any particular program- ming language, but rather only some familiarity with the general notion of programming. We will use programming metaphors and idioms, occasionally mentioning specific programming languages such as Python, , or Lisp, but students should be able to follow these descriptions even if they are not familiar with these languages. Proofs in this book, including the existence of a universal Turing Machine, the fact that every finite function can be computed by some circuit, the Cook-Levin theorem, and many others, are often con- structive and algorithmic, in the sense that they ultimately involve transforming one program to another. While it is possible to follow these proofs without seeing the code, I do think that having access to the code, and the ability to play around with it and see how it acts on various programs, can make these theorems more concrete for the students. To that end, an accompanying website (which is still work in progress) allows executing programs in the various computational models we define, as well as see constructive proofs of some ofthe theorems.

0.1 TO THE STUDENT

This book can be challenging, mainly because it brings together a variety of ideas and techniques in the study of computation. There 21

are quite a few technical hurdles to master, whether it is following the diagonalization argument for proving the Halting Problem is undecidable, combinatorial gadgets in NP-completeness reductions, analyzing probabilistic algorithms, or arguing about the adversary to prove the security of cryptographic primitives. The best way to engage with this material is to read these notes ac- tively, so make sure you have a pen ready. While reading, I encourage you to stop and think about the following:

• When I state a theorem, stop and take a shot at proving it on your own before reading the proof. You will be amazed by how much better you can understand a proof even after only 5 minutes of attempting it on your own.

• When reading a definition, make sure that you understand what the definition means, and what the natural examples are of objects that satisfy it and objects that do not. Try to think of the motivation behind the definition, and whether there are other natural ways to formalize the same concept.

• Actively notice which questions arise in your mind as you read the text, and whether or not they are answered in the text.

As a general rule, it is more important that you understand the definitions than the theorems, and it is more important that you understand a theorem statement than its proof. After all, before you can prove a theorem, you need to understand what it states, and to understand what a theorem is about, you need to know the definitions of the objects involved. Whenever a proof of a theorem is at least somewhat complicated, I provide a “proof idea.” Feel free to skip the actual proof in a first reading, focusing only on the proof idea. This book contains some code snippets, but this is by no means a programming text. You don’t need to know how to program to follow this material. The reason we use code is that it is a precise way to describe computation. Particular implementation details are not as important to us, and so we will emphasize code readability at the expense of considerations such as error handling, encapsulation, etc. that can be extremely important for real-world programming.

0.1.1 Is the effort worth it? This is not an easy book, and you might reasonably wonder why should you spend the effort in learning this material. A traditional justification for a “” course is that you might encounter these concepts later on in your career. Perhaps you will come across a hard problem and realize it is NP complete, or a need to use what you learned about regular expressions. This might 22

very well be true, but the main benefit of this book is not in teaching you any practical tool or technique, but instead in giving you a differ- ent way of thinking: an ability to recognize computational phenomena even when they occur in non-obvious settings, a way to model compu- tational tasks and questions, and to reason about them. Regardless of any use you will derive from this book, I believe learning this material is important because it contains concepts that are both beautiful and fundamental. The role that energy and matter played in the is played in the 21st by computation and , not just as tools for our technology and economy, but also as the basic building blocks we use to understand the world. This book will give you a taste of some of the theory behind those, and hopefully spark your curiosity to study more.

0.2 TO POTENTIAL INSTRUCTORS

I wrote this book for my Harvard course, but I hope that other lectur- ers will find it useful as well. To some extent, it is similar incontent to “Theory of Computation” or “Great Ideas” courses such as those taught at CMU or MIT. The most significant difference between our approach and more traditional ones (such as Hopcroft and Ullman’s [HU69; HU79] and Sipser’s [Sip97]) is that we do not start with finite automata as our ini- tial computational model. Instead, our initial computational model 1 1 An earlier book that starts with circuits as the initial is Boolean Circuits. We believe that Boolean Circuits are more fun- model is John Savage’s [Sav98]. damental to the theory of computing (and even its practice!) than automata. In particular, Boolean Circuits are a prerequisite for many concepts that one would want to teach in a modern course on Theoret- ical Computer Science, including cryptography, quantum computing, derandomization, attempts at proving P NP, and more. Even in cases where Boolean Circuits are not strictly required, they can of- ten offer significant simplifications ≠(as in the case of the proof ofthe Cook-Levin Theorem). Furthermore, I believe there are pedagogical reasons to start with Boolean circuits as opposed to finite automata. Boolean circuits area more natural model of computation, and one that corresponds more closely to computing in silicon, making the connection to practice more immediate to the students. Finite functions are arguably easier to grasp than infinite ones, as we can fully write down their truth ta- ble. The theorem that every finite function can be computed by some Boolean circuit is both simple enough and important enough to serve as an excellent starting point for this course. Moreover, many of the main conceptual points of the theory of computation, including the notions of the duality between code and data, and the idea of universal- ity, can already be seen in this context. 23

After Boolean circuits, we on to Turing machines and prove results such as the existence of a universal Turing machine, the un- of the halting problem, and Rice’s Theorem. Automata are discussed after we see Turing machines and undecidability, as an example for a restricted computational model where problems such as determining halting can be effectively solved. While this is not our motivation, the order we present circuits, Tur- ing machines, and automata roughly corresponds to the chronological order of their discovery. goes back to Boole’s and DeMorgan’s works in the 1840s [Boo47; De 47] (though the defini- tion of Boolean circuits and the connection to physical computation was given 90 years later by Shannon [Sha38]). defined what we now call “Turing Machines” in the 1930s [Tur37], while finite automata were introduced in the 1943 work of McCulloch and Pitts [MP43] but only really understood in the seminal 1959 work of Rabin and Scott [RS59]. More importantly, while models such as finite-state machines, reg- ular expressions, and context-free grammars are incredibly important for practice, the main applications for these models (whether it is for parsing, for analyzing properties such as liveness and safety, or even for software-defined routing tables) rely crucially on the fact that these are tractable models for which we can effectively answer semantic ques- tions. This practical motivation can be better appreciated after students see the undecidability of semantic properties of general computing models. The fact that we start with circuits makes proving the Cook-Levin Theorem much easier. In fact, our proof of this theorem can be (and is) done using a handful of lines of Python. Combining this proof with the standard reductions (which are also implemented in Python) allows students to appreciate visually how a question about computa- tion can be mapped into a question about (for example) the existence of an independent set in a graph. Some other differences between this book and previous texts are the following:

1. For measuring time complexity, we use the standard RAM machine model used (implicitly) in algorithms courses, rather than Tur- ing machines. While these two models are of course polynomially equivalent, and hence make no difference for the definitions of the classes P, NP, and EXP, our choice makes the distinction between notions such as or time more meaningful. This choice also ensures that these finer-grained2 time complexity classes corre- spond to the informal푂(푛) definitions푂(푛 ) of linear and quadratic time that 24

students encounter in their algorithms lectures (or their whiteboard coding interviews…).

2. We use the terminology of functions rather than languages. That is, rather than saying that a Turing Machine decides a language , we say that it computes a function . The terminology∗ of “languages” arises from Chomsky’s푀 ∗ work [Cho56퐿 ⊆], {0,but 1} it is often more confusing than illuminating.퐹 ∶ {0, 1} The→ language {0, 1} terminology also makes it cumbersome to discuss concepts such as algorithms that compute functions with more than one of output (including basic tasks such as addition, multiplication, etc…). The fact that we use functions rather than languages means we have to be extra vigilant about students distinguishing between the specification of a computational task (e.g., the function) and its implementation (e.g., the program). On the other hand, this point is so important that it is worth repeatedly emphasizing and drilling into the students, regardless of the notation used. The book does mention the language terminology and reminds of it occasionally, to make it easier for students to consult outside resources.

Reducing the time dedicated to finite automata and context-free languages allows instructors to spend more time on topics that a mod- ern course in the theory of computing needs to upon. These include randomness and computation, the interactions between proofs and programs (including Gödel’s incompleteness theorem, interactive proof systems, and even a bit on the -calculus and the Curry-Howard correspondence), cryptography, and quantum computing. This book contains sufficient detail휆 to enable its use for self-study. Toward that end, every chapter starts with a list of learning objectives, ends with a recap, and is peppered with “pause boxes” which encour- age students to stop and work out an argument or make sure they understand a definition before continuing further. Section 0.5 contains a “roadmap” for this book, with descriptions of the different chapters, as well as the dependency structure between them. This can help in planning a course based on this book.

0.3 ACKNOWLEDGEMENTS

This text is continually evolving, and I am getting input from many people, for which I am deeply grateful. Salil Vadhan co-taught with me the first iteration of this course and gave me a tremendous amount of useful and insights during this . Michele Amoretti and Marika Swanberg carefully read several chapters of this text and gave extremely helpful detailed comments. Dave Evans and Richard Xu contributed many pull requests fixing errors and improving phras- 25

ing. Thanks to Anil Ada, Venkat Guruswami, and Ryan O’Donnell for helpful tips from their experience in teaching CMU 15-251. Thanks to everyone that sent me comments, typo reports, or posted issues or pull requests on the GitHub repository https://github. com/boazbk/tcs. In particular I would like to acknowledge helpful feedback from Scott Aaronson, Michele Amoretti, Aadi Bajpai, Mar- guerite Basta, Anindya Basu, Sam Benkelman, Jarosław Błasiok, Emily Chan, Christy Cheng, Michelle Chiang, Daniel Chiu, Chi-Ning Chou, Michael Colavita, Rodrigo Daboin Sanchez, Robert Darley Waddilove, Anlan Du, Juan Esteller, David Evans, Michael Fine, Simon Fischer, Leor Fishman, Zaymon Foulds-Cook, William Fu, Kent Furuie, Piotr Galuszka, Carolyn Ge, Mark Goldstein, Alexander Golovnev, Sayan Goswami, Michael Haak, Rebecca Hao, Joosep Hook, Thomas HUET, Emily Jia, Chan Kang, Nina Katz-Christy, Vidak Kazic, Eddie Kohler, Estefania Lahera, Allison Lee, Benjamin Lee, Ondřej Lengál, Raymond Lin, Emma Ling, Alex Lombardi, Lisa Lu, Aditya Mahadevan, Chris- tian May, Jacob Meyerson, Leon Mlodzian, George Moe, Glenn Moss, Hamish Nicholson, Owen Niles, Sandip Nirmel, Sebastian Oberhoff, Thomas Orton, Joshua Pan, Pablo Parrilo, Juan Perdomo, Banks Pick- ett, Aaron Sachs, Abdelrhman Saleh, Brian Sapozhnikov, Anthony Scemama, Peter Schäfer, Josh Seides, Alaisha Sharma, Haneul Shin, Noah Singer, Matthew Smedberg, Miguel Solano, Hikari Sorensen, David Steurer, Alec Sun, Amol Surati, Everett Sussman, Marika Swan- berg, Garrett Tanzer, Eric Thomas, Sarah Turnill, Salil Vadhan, Patrick , Jonah Weissman, Ryan Williams, Licheng Xu, Richard Xu, Wan- qian Yang, Elizabeth Yeoh-Wang, Josh Zelinsky, Fred Zhang, Grace Zhang, and Jessica Zhu. I am using many open source software packages in the production of these notes for which I am grateful. In particular, I am thankful to Donald Knuth and Leslie Lamport for LaTeX and to John MacFarlane for Pandoc. David Steurer wrote the original scripts to produce this text. The current version uses Sergio Correia’s panflute. The templates for the LaTeX and HTML versions are derived from Tufte LaTeX, Gitbook and Bookdown. Thanks to Amy Hendrickson for some LaTeX consulting. Juan Esteller and Gabe Montague initially implemented the NAND* programming languages in OCaml and Javascript. I used the Jupyter project to write the supplemental code snippets. Finally, I would like to thank my family: my wife Ravit, and my children Alma and Goren. Working on this book (and the correspond- ing course) took so much of my time that Alma wrote an essay for her fifth-grade class saying that “universities should not pressure profes- sors to work too much.” I’m afraid all I have to show for this effort is 600 pages of ultra-boring mathematical text.

PRELIMINARIES

Learning Objectives: • Introduce and motivate the study of computation for its own sake, irrespective of particular implementations. • The notion of an algorithm and some of its . • Algorithms as not just tools, but also ways of thinking and understanding. • Taste of Big- analysis and the surprising creativity in the design of efficient 0 algorithms. 푂 Introduction

“Computer Science is no more about computers than astronomy is about telescopes”, attributed to Edsger Dijkstra.1 1 This quote is typically read as disparaging the importance of actual physical computers in Computer “Hackers need to understand the theory of computation about as much as Science, but note that telescopes are absolutely painters need to understand paint chemistry.”, Paul Graham 2003.2 essential to astronomy as they provide us with the means to connect theoretical predictions with actual “The subject of my talk is perhaps most directly indicated by simply asking experimental observations. two questions: first, is it harder to multiply than to add? and second, why?…I 2 To be fair, in the following sentence Graham says (would like to) show that there is no algorithm for multiplication computation- “you need to know how to calculate time and space ally as simple as that for addition, and this proves something of a stumbling complexity and about Turing completeness”. This block.”, Alan Cobham, 1964 book includes these topics, as well as others such as NP-hardness, randomization, cryptography, quantum The origin of much of science and medicine can be traced back to computing, and more. the ancient Babylonians. However, the Babylonians’ most significant contribution to humanity was arguably the invention of the place-value number system. The place-value system represents any number using a collection of digits, whereby the position of the digit is used to de- termine its value, as opposed to a system such as Roman numerals, where every symbol has a fixed numerical value regardless of posi- tion. For example, the average distance to the moon is approximately 238,900 of our miles or 259,956 Roman miles. The latter quantity, ex- pressed in standard Roman numerals is MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM MMMMMMMMMMMMMMMMMMMDCCCCLVI

Writing the distance to the sun in Roman numerals would require about 100,000 symbols: a 50- book just containing this single number!

Compiled on 8.26.2020 18:10 30 introduction to theoretical computer science

For someone who thinks of numbers in an additive system like Roman numerals, quantities like the distance to the moon or sun are not merely large—they are unspeakable: cannot be expressed or even grasped. It’s no wonder that Eratosthenes, who was the first person to calculate the earth’s diameter (up to about ten percent error) and Hipparchus who was the first to calculate the distance to the moon, did not use a Roman-numeral system but rather the Babylonian sexagesimal (i.e., base 60) place-value system.

0.1 INTEGER MULTIPLICATION: AN EXAMPLE OF AN ALGORITHM

In the language of Computer Science, the place-value system for rep- resenting numbers is known as a data structure: a set of instructions, or “recipe”, for representing objects as symbols. An algorithm is a set of instructions, or “recipe”, for performing operations on such rep- resentations. Data structures and algorithms have enabled amazing applications that have transformed human society, but their impor- tance goes beyond their practical utility. Structures from computer science, such as , strings, graphs, and even the notion of a program itself, as well as concepts such as universality and , have not just found (many) practical uses but contributed a new language and a new way to view the world. In addition to coming up with the place-value system, the Babylo- nians also invented the “standard algorithms” that we were all taught in elementary school for adding and multiplying numbers. These al- gorithms have been essential throughout the ages for people using abaci, papyrus, or pencil and paper, but in our computer age, do they still serve any purpose beyond torturing third graders? To see why these algorithms are still very much relevant, let us compare the Baby- lonian digit-by-digit multiplication algorithm (“grade-school multi- plication”) with the naive algorithm that multiplies numbers through repeated addition. We start by formally describing both algorithms, see Algorithm 0.1 and Algorithm 0.2.

Algorithm 0.1 — Multiplication via repeated addition. Input: Non-negative integers Output: Product 1: Let . 푥, 푦 2: for 푥do ⋅ 푦 3: 푟푒푠푢푙푡 ← 0 4: end푖 for = 1, … , 푦 5: return푟푒푠푢푙푡 ← 푟푒푠푢푙푡 + 푥

푟푒푠푢푙푡 introduction 31

Algorithm 0.2 — Grade-school multiplication. Input: Non-negative integers Output: Product 1: Write and푥, 푦 in dec- imal place-value푥 ⋅ 푦notation. # is the ones digit of , is 푛−1 푛−2 0 푚−1 푚−2 0 the tens푥 digit, = 푥 etc.푥 ⋯ 푥 푦 = 푦 푦 ⋯ 푦 0 1 2: Let 푥 푥 푥 3: for do 4: for푟푒푠푢푙푡 ← 0 do 5: 푖 = 0, … , 푛 − 1

6: end푗 for = 0, … , 푚 − 1 푖+푗 푖 푗 7: end for푟푒푠푢푙푡 ← 푟푒푠푢푙푡 + 10 ⋅ 푥 ⋅ 푦 8: return

Both Algorithm푟푒푠푢푙푡 0.1 and Algorithm 0.2 assume that we already know how to add numbers, and Algorithm 0.2 also assumes that we can multiply a number by a power of (which is, after all, a sim- ple shift). Suppose that and are two integers of digits each. (This roughly corresponds10 to 64 binary digits, which is a common size in many푥 programming푦 languages.)푛 Computing = 20 using Algorithm 0.1 entails adding to itself times which entails (since is a -digit number) at least additions. In contrast,푥 the ⋅ 푦 grade-school algorithm (i.e., Algorithm푥 19 0.2) involves푦 shifts and single-digit푦 products,20 and so at most 10 single-digit2 opera- tions. To understand the difference, consider2 that푛 a grade-schooler can perform a single-digit operation in about2푛 2= seconds, 800 and so would re- quire about seconds (about half an hour) to compute using Algorithm 0.2. In contrast, even though it is more than a billion times faster than a1, human, 600 if we used Algorithm 0.1 to compute 푥 ⋅ 푦using a modern PC, it would take us seconds (which is more than three millennia!) to compute20 the9 same11 result. 푥 ⋅ 푦 Computers have not made10 algorithms/10 = 10obsolete. On the contrary, the vast increase in our ability to measure, store, and communicate data has led to much higher demand for developing better and more sophisticated algorithms that empower us to make better decisions based on these data. We also see that in no small extent the notion of algorithm is independent of the actual computing device that executes it. The digit-by-digit multiplication algorithm is vastly better than iter- ated addition, regardless whether the technology we use to implement it is a silicon-based chip, or a third grader with pen and paper. Theoretical computer science is concerned with the inherent proper- ties of algorithms and computation; namely, those properties that are independent of current technology. We ask some questions that were 32 introduction to theoretical computer science

already pondered by the Babylonians, such as “what is the best way to multiply two numbers?”, but also questions that rely on cutting-edge science such as “could we use the effects of to factor numbers faster?”.

R Remark 0.3 — Specification, implementation and .. A full description of an algorithm has three components:

• Specification: What is the task that the algorithm performs (e.g., multiplication in the case of Algo- rithm 0.1 and Algorithm 0.2.) • Implementation: How is the task accomplished: what is the sequence of instructions to be per- formed. Even though Algorithm 0.1 and Algo- rithm 0.2 perform the same computational task (i.e., they have the same specification), they do it in different ways (i.e., they have different implementa- tions). • Analysis: Why does this sequence of instructions achieve the desired task. A full description of Algo- rithm 0.1 and Algorithm 0.2 will include a proof for each one of these algorithms that on input , the algorithm does indeed output . 푥, 푦 Often as part of the analysis we show that the algo- 푥 ⋅ 푦 rithm is not only correct but also efficient. That is, we want to show that not only will the algorithm compute the desired task, but will do so in prescribed number of operations. For example Algorithm 0.2 computes the multiplication function on inputs of digits using operations, while Algorithm 0.4 (described below)2 computes the same function using푛 operations.푂(푛 ) (We define the notations used here1.6 in Section 1.4.8.) 푂(푛 ) 푂

0.2 EXTENDED EXAMPLE: A FASTER WAY TO MULTIPLY (OP- TIONAL)

Once you think of the standard digit-by-digit multiplication algo- rithm, it seems like the “obviously best’ ’ way to multiply numbers. In 1960, the famous mathematician organized a seminar at State University in which he conjectured that every algorithm for multiplying two digit numbers would require a number of basic operations that is proportional to ( opera- tions, using -notation as defined in 푛Chapter 1). In other2 words,2 Kol- mogorov conjectured that in any multiplication algorithm,푛 Ω(푛 doubling) the number푂 of digits would quadruple the number of basic operations required. A young student named Anatoly Karatsuba was in the au- introduction 33

dience, and within a week he disproved Kolmogorov’s conjecture by discovering an algorithm that requires only about operations for some constant . Such a number becomes much smaller1.6 than as grows and so for large Karatsuba’s algorithm퐶푛 is superior to the2 grade-school one.퐶 (For example, Python’s implementation switches푛 from푛 the grade-school algorithm푛 to Karatsuba’s algorithm for num- bers that are 1000 bits or larger.) While the difference between an and an algorithm can be sometimes crucial in practice (see1.6Section 0.3 below),2 in this book we will mostly ignore such dis- tinctions.푂(푛 ) However,푂(푛 we) describe Karatsuba’s algorithm below since it is a good example of how algorithms can often be surprising, as well as a demonstration of the analysis of algorithms, which is central to this book and to theoretical computer science at large. Karatsuba’s algorithm is based on a faster way to multiply two-digit numbers. Suppose that are a pair of two- digit numbers. Let’s write for the “tens” digit of , and for the “ones” digit, so that 푥, 푦 ∈ [100], and = write {0, … similarly , 99} for . The grade-school푥 algorithm for multiplying푥 푥 and is illustrated in Fig. 1. 푥 = 10푥 + 푥 푦 = 10푦 + 푦 푥, 푥The, 푦, 푦 grade-school∈ [10] algorithm can be thought of as transforming푥 the푦 task of multiplying a pair of two-digit numbers into four single-digit multiplications via the

(1)

Generally,(10 in푥 + the 푥) grade-school × (10푦 + 푦) = algorithm 100푥푦 + 10(doubling푥푦 + 푥the푦) + number 푥푦 of digits in the input results in quadrupling the number of operations, leading to an times algorithm. In contrast, Karatsuba’s algo- Figure 1: The grade-school multiplication algorithm illustrated for multiplying and rithm is based on2 the observation that we can express Eq. (1) also . It uses the formula as 푂(푛 ) 푥. = 10푥 + 푥 푦 = 10푦 + 푦 (10푥 + 푥) × (10푦 + 푦) = 100푥푦 + 10(푥푦 + 푥푦) + 푥푦

(2)

(10which푥+푥)×(10 reduces푦+푦 multiplying) = (100−10) the푥푦+10 two-digit [(푥 + number 푥)(푦 + 푦)]−(10−1)푥and to com-푦 puting the following three simpler products: , and . By repeating the same strategy recursively, we can reduce푥 the푦 task of multiplying two -digit numbers to the task of푥푦 multiplying푥푦 (푥 +three 푥)(푦pairs + 푦) 3 If is a number then is the integer obtained by of digit numbers.3 Since every time we double the number of rounding it down, see Section 1.7. 푥 ⌊푥⌋ digits we triple the푛 number of operations, we will be able to multiply numbers⌊푛/2⌋ + of 1 digits using about log operations. The above is theℓ intuitive idea behindℓ Karatsuba’s2 3 1.585 algorithm, but is not enough to푛 = fully 2 specify it. A complete3 = description 푛 ∼ 푛 of an algorithm entails a precise specification of its operations together with its analysis: proof that the algorithm does in fact do what it’s supposed to do. The 34 introduction to theoretical computer science

operations of Karatsuba’s algorithm are detailed in Algorithm 0.4, while the analysis is given in Lemma 0.5 and Lemma 0.6.

Algorithm 0.4 — Karatsuba multiplication. Input: nonnegative integers each of at most digits Output: 1: procedure Karatsuba(푥,, 푦) 푛 2: if 푥 ⋅ 푦 then return ; 3: Let 푥 푦 4: Write푛 ≤ 4 and푥 ⋅ 푦 Figure 2: Karatsuba’s multiplication algorithm illus- 5: 푚Karatsuba = ⌊푛/2⌋푚 푚 trated for multiplying and . 6: Karatsuba푥 = 10 푥 + 푥 푦 = 10 푦 + 푦 We compute the three orange, green and purple prod- ucts , and 푥 = 10푥 +and 푥 then푦 add = 10 and푦 + 푦 7: 퐴 ← (푥, 푦) Karatsuba subtract them to obtain the result. 8: return퐵 ← (푥 + 푥, 푦 + 푦) 푥푦 푥푦 (푥 + 푥)(푦 + 푦) 9: end퐶 procedure ← 푛 푚(푥, 푦) 푚 푚 (10 − 10 ) ⋅ 퐴 + 10 ⋅ 퐵 + (1 − 10 ) ⋅ 퐶 Algorithm 0.4 is only half of the full description of Karatsuba’s algorithm. The other half is the analysis, which entails proving that (1) Algorithm 0.4 indeed computes the multiplication operation and (2) it does so using log operations. We now turn to showing both facts: 2 3 푂(푛 ) Lemma 0.5 For every nonnegative integers , when given input Figure 3: Running time of Karatsuba’s algorithm vs. the grade-school algorithm. (Python implementa- Algorithm 0.4 will output . tion available online.) Note the existence of a “cutoff” 푥, 푦 푥, 푦 length, where for sufficiently large inputs Karat- Proof. Let be the maximum푥 ⋅ 푦 number of digits of and . We prove suba becomes more efficient than the grade-school the lemma by induction on . The base case is where the algo- algorithm. The precise cutoff location varies by imple- mentation and platform details, but will always occur rithm returns푛 by definition. (It does not matter푥 푦 which algorithm eventually. we use to multiply four-digit푛 numbers - we can푛 even ≤ 4 use repeated addition.) Otherwise,푥 ⋅ 푦 if , we define , and write and . Plugging푚 this into 푛푚, we > get4 푚 = ⌊푛/2⌋ 푥 = 10 푥 + 푥 푦 = 10 푦 + 푦 푥 ⋅ 푦 (3) 2푚 푚 Rearranging the푥 ⋅ terms 푦 = 10 we푥 see푦 + that 10 (푥푦 + 푥푦) + 푥푦 .

(4) 2푚 푚 since the푥 numbers ⋅ 푦 = 10 ,푥푦, +, 10, [(푥, + 푥)(all푦 + have 푦) − at 푥 most푦 − 푥푦] + 푥푦 . digits, the induction hypothesis implies that the values computed by the recursive calls푥 푥 will푦 푦 satisfy푥 + 푥 푦 + 푦 , 푚 + 1 < 푛and . Plugging this into (4) we see that 퐴,equals 퐵, 퐶 the value 퐴 = 푥푦 퐵computed = (푥 + 푥 by)(푦Algorithm + 푦) 0.4. ■ 퐶 =2푚 푥푦 푚 푚 푚 푥 ⋅ 푦 (10 − 10 ) ⋅ 퐴 + 10 ⋅ 퐵 + (1 − 10 ) ⋅ 퐶 introduction 35

Lemma 0.6 If are integers of at most digits, Algorithm 0.4 will take log operations on input . 2 3푥, 푦 푛 Proof. Fig. 2 illustrates the idea behind the proof, which we only 푂(푛 ) 푥, 푦 sketch here, leaving filling out the details as Exercise 0.4. The proof is again by induction. We define to be the maximum number of steps that Algorithm 0.4 takes on inputs of length at most . Since in the base case , Exercise 0.4 performs푇 (푛) a constant number of com- putation, we know that for some constant and for푛 , it satisfies the 푛recursive ≤ 2 equation 푇 (2) ≤ 푐 푐 푛 > 2 (5) ′ for some constant (using the fact that addition can be done in 푇 (푛) ≤ 3푇 (⌊푛/2⌋ + 1) + 푐 푛 operations). ′ The recursive equation푐 (5) solves to log . The intuition be-푂(푛) hind this is presented in Fig. 2, and this is also2 3 a consequence of the so called “Master Theorem” on recurrence푂(푛 relations.) As mentioned above, we leave completing the proof to the reader as Exercise 0.4. ■

Figure 4: Karatsuba’s algorithm reduces an -bit multiplication to three -bit multiplications, which in turn are reduced to nine -bit multi-푛 plications and so on. We푛/2 can represent the compu- tational cost of all these multiplications푛/4 in a -ary of depth log , where at the root the extra cost is operations, at the first level the extra3 cost is operations,2 푛 and at each of the nodes of level푐푛 , the extra cost is . The total푖 cost is log log 푐(푛/2) 푖 by the3 formula for 푖 푐(푛/2 ) summing2 푛 a geometric푖 series.2 3 푐푛 ∑푖=0 (3/2) ≤ 10푐푛

Karatsuba’s algorithm is by no means the end of the line for multi- plication algorithms. In the 1960’s, Toom and Cook extended Karat- suba’s ideas to get an log time multiplication algorithm for every constant . In 1971, Schönhage푘(2푘−1) and Strassen got even better al- gorithms using the Fast푂(푛 Fourier Transform) ; their idea was to somehow treat integers as푘 “signals” and do the multiplication more efficiently by moving to the Fourier domain. (The Fourier transform is a central tool in and , used in a great many applica- tions; if you have not seen it yet, you are likely encounter it at some point in your studies.) In the years that followed researchers kept im- proving the algorithm, and only very recently Harvey and Van Der 36 introduction to theoretical computer science

Hoeven managed to obtain an log time algorithm for multipli- cation (though it only starts beating the Schönhage-Strassen algorithm for truly astronomical numbers).푂(푛 Yet, despite푛) all this progress, we still don’t know whether or not there is an time algorithm for multiplying two digit numbers! 푂(푛) 푛

R Remark 0.7 — Matrix Multiplication (advanced note). (This book contains many “advanced” or “optional” notes and sections. These may assume background that not every student has, and can be safely skipped over as none of the future parts depends on them.) Ideas similar to Karatsuba’s can be used to speed up matrix multiplications as well. Matrices are a powerful way to represent linear equations and operations, widely used in a great many applications of scientific computing, graphics, , and many many more. One of the basic operations one can do with two matrices is to multiply them. For example,

if and 0,0 0,1 0,0 0,1 then the product푥 of 푥and is the matrix 푦 푦 푥 = ( 1,0 1,1) 푦 = ( 1,0 1,1) 푥 푥 푦 푦 . You can 푥 푦 0,0 0,0 0,1 1,0 0,0 0,1 0,1 1,1 see푥 that푦 we+ can푥 푦 compute푥 this푦 + matrix 푥 푦 by eight products ( 1,0 0,0 1,1 1,0 1,0 0,1 1,1 1,1) of푥 numbers.푦 + 푥 푦 푥 푦 + 푥 푦 Now suppose that is even and and are a pair of matrices which we can think of as each com- posed of four 푛 blocks푥 푦 푛and × 푛 . Then the formula for the matrix 0,0 0,1 1,0 1,1 product of and(푛/2)can × be (푛/2) expressed in푥 the, 푥 same, 푥 way, 푥 0,0 0,1 1,0 1,1 as above,푦 , 푦 just, 푦 replacing, 푦 products with matrix products, and푥 addition푦 with matrix addition. This 푎,푏 푐,푑 means that we can use the formula푥 above푦 to give an algorithm that doubles the dimension of the matrices at the expense of increasing the number of operation by a factor of , which for results in operations. ℓ ℓ 3 In 1969 Volker8 Strassen noted푛 = that 2 we can compute8 = 푛 the product of a pair of two-by-two matrices using only seven products of numbers by observing that each entry of the matrix can be computed by adding and subtracting the following seven terms: 푥푦 , , , , 1 0,0 1,1 0,0 1,1 2 0,0 1,1 0,0 푡 = (푥 + 푥 )(푦 , + 푦 ) 푡 = (푥 + 푥 )푦 , 3 0,0 0,1 1,1 4 1,1 0,1 0,0 푡 = 푥 (푦 − 푦 ) 푡 .= Indeed, 푥 one(푦 can− verify 푦 ) 5 0,0 0,1 1,1 6 1,0 0,0 0,0 0,1 푡 = (푥 + 푥 )푦 푡 = (푥 − 푥 )(푦 + 푦 ) 7 0,1 1,1 1,0 1,1 that푡 = (푥 − 푥 )(푦 + 푦 ) . 1 4 5 7 3 5 푡 + 푡 − 푡 + 푡 푡 + 푡 푥푦 = ( 2 4 1 3 2 6) 푡 + 푡 푡 + 푡 − 푡 + 푡 introduction 37

Using this observation, we can obtain an algorithm such that doubling the dimension of the matrices results in increasing the number of operations by a factor of , which means that for the cost is log . A long sequence ofℓ work has sinceℓ improved7 2 7 this algorithm,2.807 and푛 the =current 2 record 7has running= 푛 time∼ about 푛 . However, unlike the case of integer multiplication,2.373 at the moment we don’t know of any algorithm for푂(푛 matrix) multiplication that runs in time linear or even close to linear in the size of the input matrices (e.g., an time algorithm). People have tried to use2 group represen- tations, which can be thought푂(푛 of as푝표푙푦푙표푔(푛)) generalizations of the Fourier transform, to obtain faster algorithms, but this effort has not yet succeeded.

0.3 ALGORITHMS BEYOND ARITHMETIC

The quest for better algorithms is by no means restricted to arithmetic tasks such as adding, multiplying or solving equations. Many graph algorithms, including algorithms for finding paths, matchings, span- ning trees, cuts, and flows, have been discovered in the last several decades, and this is still an intensive area of research. (For example, the last few years saw many advances in algorithms for the maximum flow problem, borne out of unexpected connections with electrical cir- cuits and linear equation solvers.) These algorithms are being used not just for the “natural” applications of routing network traffic or GPS-based navigation, but also for applications as varied as drug dis- covery through searching for structures in gene-interaction graphs to computing risks from correlations in financial investments. Google was founded based on the PageRank algorithm, which is an efficient algorithm to approximate the “principal eigenvector” of (a dampened version of) the adjacency matrix of the web graph. The Akamai company was founded based on a new data structure, known as consistent hashing, for a where buckets are stored at dif- ferent servers. The backpropagation algorithm, which computes partial derivatives of a neural network in instead of time, under- lies many of the recent phenomenal successes of learning2 deep neural networks. Algorithms for solving linear푂(푛) equations푂(푛 under) sparsity constraints, a concept known as compressed sensing, have been used to drastically reduce the amount and quality of data needed to ana- lyze MRI images. This made a critical difference for MRI imaging of cancer tumors in children, where previously doctors needed to use anesthesia to suspend breath during the MRI exam, sometimes with dire consequences. 38 introduction to theoretical computer science

Even for classical questions, studied through the ages, new dis- coveries are still being made. For example, for the question of de- termining whether a given integer is prime or composite, which has been studied since the days of Pythagoras, efficient probabilistic algo- rithms were only discovered in the 1970s, while the first deterministic polynomial-time algorithm was only found in 2002. For the related problem of actually finding the factors of a composite number, new algorithms were found in the 1980s, and (as we’ll see later in this course) discoveries in the 1990s raised the tantalizing prospect of obtaining faster algorithms through the use of quantum mechanical effects. Despite all this progress, there are still many more questions than answers in the world of algorithms. For almost all natural prob- lems, we do not know whether the current algorithm is the “best”, or whether a significantly better one is still waiting to be discovered. As alluded in Cobham’s opening quote for this chapter, even for the basic problem of multiplying numbers we have not yet answered the question of whether there is a multiplication algorithm that is as ef- ficient as our algorithms for addition. But at least we now knowthe right way to ask it.

0.4 ON THE IMPORTANCE OF NEGATIVE RESULTS

Finding better algorithms for problems such as multiplication, solv- ing equations, graph problems, or fitting neural networks to data, is undoubtedly a worthwhile endeavor. But why is it important to prove that such algorithms don’t exist? One motivation is pure intellectual curiosity. Another reason to study impossibility results is that they correspond to the fundamental limits of our world. In other words, impossibility results are laws of nature. Here are some examples of impossibility results outside computer science (see Section 0.7 for more about these). In physics, the impos- sibility of building a perpetual motion machine corresponds to the law of conservation of energy. The impossibility of building a heat engine beating Carnot’s bound corresponds to the second law of thermo- dynamics, while the impossibility of faster-than-light information transmission is a cornerstone of special relativity. In mathematics, while we all learned the formula for solving quadratic equations in high school, the impossibility of generalizing this formula to equations of degree five or more gave birth to . The impossibility of proving Euclid’s fifth axiom from the first four gave rise to non- Euclidean , which ended up crucial for the theory of general relativity. In an analogous way, impossibility results for computation corre- spond to “computational laws of nature” that tell us about the fun- introduction 39

damental limits of any apparatus, whether based on silicon, neurons, or quantum particles. Moreover, computer scientists found creative approaches to apply computational limitations to achieve certain useful tasks. For example, much of modern traffic is encrypted using the RSA encryption scheme, which relies on its security on the (conjectured) impossibility of efficiently factoring large integers. More recently, the Bitcoin system uses a digital ana- log of the “gold standard” where, instead of using a precious metal, new currency is obtained by “mining” solutions for computationally difficult problems.

Chapter Recap

• The history of algorithms goes back thousands ✓ of years; they have been essential much of hu- man progress and these days form the basis of multi-billion dollar industries, as well as life-saving technologies. • There is often more than one algorithm to achieve the same computational task. Finding a faster al- gorithm can often make a much bigger difference than improving computing hardware. • Better algorithms and data structures don’t just speed up calculations, but can new qualitative insights. • One question we will study is to find out what is the most efficient algorithm for a given problem. • To show that an algorithm is the most efficient one for a given problem, we need to be able to prove that it is impossible to solve the problem using a smaller amount of computational resources.

0.5 ROADMAP TO THE REST OF THIS BOOK

Often, when we try to solve a , whether it is solving a system of linear equations, finding the top eigenvector of a matrix, or trying to rank Internet search results, it is enough to use the “I know it when I see it” standard for describing algorithms. As long as we find some way to solve the problem, we are happy and might not care much on the exact mathematical model for our algorithm. But when we want to answer a question such as “does there exist an algorithm to solve the problem ?” we need to be much more precise. In particular, we will need to (1) define exactly what it means to solve , and (2) define exactly푃 what an algorithm is. Even (1) can sometimes be non-trivial but (2) is particularly challenging; it is not at all clear푃 how (and even whether) we can encompass all potential ways to design algorithms. We will consider several simple models of 40 introduction to theoretical computer science

computation, and argue that, despite their simplicity, they do capture all “reasonable” approaches to achieve computing, including all those that are currently used in modern computing devices. Once we have these formal models of computation, we can try to obtain impossibility results for computational tasks, showing that some problems can not be solved (or perhaps can not be solved within the resources of our universe). Archimedes once said that given a fulcrum and a long enough lever, he could move the world. We will see how reductions allow us to leverage one hardness result into a slew of a great many others, illuminating the boundaries between the computable and uncomputable (or tractable and intractable) problems. Later in this book we will go back to examining our models of computation, and see how resources such as randomness or quantum entanglement could potentially change the power of our model. In the context of probabilistic algorithms, we will see a glimpse of how randomness has become an indispensable tool for understanding computation, information, and communication. We will also see how computational difficulty can be an asset rather than a hindrance, and be used for the “derandomization” of probabilistic algorithms. The same ideas also show up in cryptography, which has undergone not just a technological but also an intellectual revolution in the last few decades, much of it building on the foundations that we explore in this course. Theoretical Computer Science is a vast topic, branching out and touching upon many scientific and engineering disciplines. This book provides a very partial (and biased) sample of this area. More than anything, I hope I will manage to “infect” you with at least some of my love for this , which is inspired and enriched by the connec- tion to practice, but is also deep and beautiful regardless of applica- tions.

0.5.1 Dependencies between chapters This book is divided into the following parts, see Fig. 5.

• Preliminaries: Introduction, mathematical background, and repre- senting objects as strings.

• Part I: Finite computation (Boolean circuits): Equivalence of cir- cuits and straight-line programs. Universal gate sets. Existence of a circuit for every function, representing circuits as strings, universal circuit, lower bound on circuit size using the counting argument.

• Part II: Uniform computation (Turing machines): Equivalence of Turing machines and programs with loops. Equivalence of models introduction 41

(including RAM machines, calculus, and cellular automata), con- figurations of Turing machines, existence of a universal Turing ma- chine, uncomputable functions휆 (including the Halting problem and Rice’s Theorem), Gödel’s incompleteness theorem, restricted com- putational models models (regular and context free languages).

• Part III: Efficient computation: Definition of running time, time

hierarchy theorem, P and NP, P poly, NP completeness and the Cook-Levin Theorem, space bounded computation. / • Part IV: Randomized computation: Probability, randomized algo- rithms, BPP, amplification, BPP P , pseudorandom genera- tors and derandomization. /푝표푙푦 ⊆ • Part V: Advanced topics: Cryptography, proofs and algorithms (interactive and zero knowledge proofs, Curry-Howard correspon- dence), quantum computing.

Figure 5: The dependency structure of the different parts. Part I introduces the model of Boolean cir- cuits to study finite functions with an emphasis on quantitative questions (how many gates to compute a function). Part II introduces the model of Turing machines to study functions that have unbounded input lengths with an emphasis on qualitative questions (is this function computable or not). Much of Part II does not depend on Part I, as Turing machines can be used as the first computational model. Part III depends on both parts as it introduces a quantitative study of functions with unbounded input length. The more advanced parts IV (randomized computation) and V (advanced topics) rely on the material of Parts I, II and III.

The book largely proceeds in linear order, with each chapter build- ing on the previous ones, with the following exceptions:

• The topics of calculus (Section 8.5 and Section 8.5), Gödel’s in- completeness theorem (Chapter 11), Automata/regular expres- sions and context-free휆 grammars (Chapter 10), and space-bounded computation (Chapter 17), are not used in the following chapters. Hence you can choose whether to cover or skip any subset of them.

• Part II (Uniform Computation / Turing Machines) does not have a strong dependency on Part I (Finite computation / Boolean cir- cuits) and it should be possible to teach them in the reverse order 42 introduction to theoretical computer science

with minor modification. Boolean circuits are used Part III (efficient

computation) for results such as P P poly and the Cook-Levin

Theorem, as well as in Part IV (for BPP P poly and derandom- / ization) and Part V (specifically in⊆ cryptography and quantum / computing). ⊆

• All chapters in Part V (Advanced topics) are independent of one another and can be covered in any order.

A course based on this book can use all of Parts I, II, and III (possi- bly skipping over some or all of the calculus, Chapter 11, Chapter 10 or Chapter 17), and then either cover all or some of Part IV (random- ized computation), and add a “sprinkling”휆 of advanced topics from Part V based on student or instructor interest.

0.6 EXERCISES

Exercise 0.1 Rank the significance of the following inventions in speed- ing up multiplication of large (that is 100-digit or more) numbers. That is, use “back of the envelope” estimates to order them in terms of the factor they offered over the previous state of affairs. a. Discovery of the grade-school digit by digit algorithm (improving upon repeated addition) b. Discovery of Karatsuba’s algorithm (improving upon the digit by digit algorithm) c. Invention of modern electronic computers (improving upon calcu- lations with pen and paper).

Exercise 0.2 The 1977 Apple II had a speed of 1.023 Mhz or about operations per seconds. At the time of this writing the world’s6 fastest performs 93 “petaflops” ( floating point10 operations per second) or about basic steps per second.15 For each one of the following running times18 (as a function10 of the input length ), compute for both computers10 how large an input they could handle in a week of computation, if they run an algorithm that has this running푛 time: a. operations. b. 푛 operations. 2 c. 푛log operations. d. 푛 operations.푛 푛 2 introduction 43

e. operations.

■ 푛! Exercise 0.3 — Usefulness of algorithmic non-existence. In this chapter we mentioned several companies that were founded based on the discov- ery of new algorithms. Can you give an example for a company that was founded based on the non existence of an algorithm? See footnote 4 4 As we will see in Chapter Chapter 21, almost any for hint. company relying on cryptography needs to assume ■ the non existence of certain algorithms. In particular, RSA Security was founded based on the security Exercise 0.4 — Analysis of Karatsuba’s Algorithm. a. Suppose that of the RSA cryptosystem, which presumes the non is a sequence of numbers such that and existence of an efficient algorithm to compute the prime factorization of large integers. for every , for some . Prove that 1 2 3 2 5 푇 , 푇 , 푇 ,…log for every .5 푇 ≤ 10 Hint: Use a proof by induction - suppose that this is 푛 ⌊푛/2⌋+1 true for all ’s from to and prove that this is true 2 푛 푇 3 ≤ 3푇 + 퐶푛 퐶 ≥ 1 also for . b. Prove푛 that the number of single-digit operations that Karatsuba’s 푇 ≤ 20퐶푛 푛 > 2 푛 1 푚 algorithm takes to multiply two digit numbers is at most 푚 + 1 log . 2 3 푛 ■ 1000푛 Exercise 0.5 Implement in the programming language of your choice functions Gradeschool_multiply(x,y) and Karat- suba_multiply(x,y) that take two arrays of digits x and y and return an array representing the product of x and y (where x is identified with the number x[0]+10*x[1]+100*x[2]+... etc..) using the grade-school algorithm and the Karatsuba algorithm respectively. At what number of digits does the Karatsuba algorithm beat the grade-school one? ■

Exercise 0.6 — Matrix Multiplication (optional, advanced). In this exercise, we show that if for some , we can write the product of two real-valued matrices using at most multiplications, then we can multiply two 휔matrices > 2 in roughly휔 time for every푘 large × 푘 enough . 퐴, 퐵 푘 휔 To make this precise,푛 × 푛 we need to make some푛 notation that is unfor- tunately푛 somewhat cumbersome. Assume that there is some and such that for every matrices such that AB, we휔 can write for every : 푘 ∈ ℕ 푚 ≤ 푘 푘 × 푘 퐴, 퐵, 퐶 퐶 = 푖, 푗 ∈ [푘] (6) 푚−1 ℓ 푖,푗 푖,푗 ℓ ℓ for some linear functions퐶 = ∑ℓ=0 훼 푓 (퐴)푔 (퐵) and 2 coefficients . Prove that under this assumption푛 for 0 푚−1 0 푚−1 every , ifℓ is sufficiently푓 , … , 푓 large,, 푔 , …then , 푔 there∶ ℝ is→ an algorithm ℝ that 푖,푗 푖,푗∈[푘],ℓ∈[푚] computes the{훼 product} of two matrices using at most 6 arithmetic휖 > 0operations.푛 See footnote for hint. 휔+휖 푛 × 푛 푂(푛 ) 44 introduction to theoretical computer science

0.7 BIBLIOGRAPHICAL NOTES

For a brief overview of what we’ll see in this book, you could do far worse than read Bernard Chazelle’s wonderful essay on the Algo- rithm as an Idiom of modern science. The book of Moore and Mertens [MM11] gives a wonderful and comprehensive overview of the theory of computation, including much of the content discussed in this chap- ter and the rest of this book. Aaronson’s book [Aar13] is another great read that touches upon many of the same themes. For more on the algorithms the Babylonians used, see Knuth’s paper and Neugebauer’s classic book. Many of the algorithms we mention in this chapter are covered in algorithms textbooks such as those by Cormen, Leiserson, Rivert, and Stein [Cor+09], Kleinberg and Tardos [KT06], and Dasgupta, Pa- padimitriou and Vazirani [DPV08], as well as Jeff Erickson’s textbook. Erickson’s book is freely available online and contains a great exposi- tion of recursive algorithms in general and Karatsuba’s algorithm in particular. The story of Karatsuba’s discovery of his multiplication algorithm is recounted by him in [Kar95]. As mentioned above, further improve- ments were made by Toom and Cook [Too63; Coo66], Schönhage and Strassen [SS71], Fürer [Für07], and recently by Harvey and Van Der Hoeven [HV19], see this article for a nice overview. The last papers crucially rely on the Fast Fourier transform algorithm. The fascinating story of the (re)discovery of this algorithm by John Tukey in the con- text of the cold war is recounted in [Coo87]. (We say re-discovery because it later turned out that the algorithm dates back to Gauss [HJB85].) The Fast Fourier Transform is covered in some of the books mentioned below, and there are also online available lectures such as Jeff Erickson’s. See also this popular article by David Austin. Fast ma- trix multiplication was discovered by Strassen [Str69], and since then this has been an active area of research. [Blä13] is a recommended self-contained survey of this area. The Backpropagation algorithm for fast differentiation of neural - works was invented by Werbos [Wer74]. The Pagerank algorithm was invented by Larry Page and Sergey Brin [Pag+99]. It is closely related to the HITS algorithm of Kleinberg [Kle99]. The Akamai company was founded based on the consistent hashing data structure described in [Kar+97]. Compressed sensing has a long history but two foundational papers are [CRT06; Don06]. [Lus+08] gives a survey of applications of compressed sensing to MRI; see also this popular article by Ellen- berg [Ell10]. The deterministic polynomial-time algorithm for testing primality was given by Agrawal, Kayal, and Saxena [AKS04]. introduction 45

We alluded briefly to classical impossibility results in mathematics, including the impossibility of proving Euclid’s fifth postulate from the other four, impossibility of trisecting an angle with a straightedge and compass and the impossibility of solving a quintic equation via radicals. A geometric proof of the impossibility of angle trisection (one of the three geometric problems of antiquity, going back to the ancient Greeks) is given in this blog post of Tao. The book of Mario Livio [Liv05] covers some of the background and ideas behind these impossibility results. Some exciting recent research is focused on trying to use computational complexity to shed light on fundamental questions in physics such understanding black holes and reconciling general relativity with

Learning Objectives: • Recall basic mathematical notions such as sets, functions, numbers, logical operators and quantifiers, strings, and graphs. • Rigorously define Big- notation. • Proofs by induction. 푂 • Practice with reading mathematical definitions, statements, and proofs. • Transform an intuitive argument into a 1 rigorous proof. Mathematical Background

“I found that every number, which may be expressed from one to ten, surpasses the preceding by one unit: afterwards the ten is doubled or tripled … until a hundred; then the hundred is doubled and tripled in the same manner as the units and the tens … and so forth to the utmost limit of numeration.”, Muhammad ibn Mūsā al-Khwārizmī, 820, translation by Fredric Rosen, 1831.

In this chapter we review some of the mathematical concepts that we use in this book. These concepts are typically covered in courses or textbooks on “mathematics for computer science” or “”; see the “Bibliographical Notes” section (Section 1.9) for several excellent resources on these topics that are freely-available online. A mathematician’s apology. Some students might wonder why this book contains so much math. The reason is that mathematics is sim- ply a language for modeling concepts in a precise and unambiguous way. In this book we use math to model the concept of computation. For example, we will consider questions such as “is there an efficient algorithm to find the prime factors of a given integer?”. (We will see that this question is particularly interesting, touching on areas as far apart as Internet security and quantum mechanics!) To even phrase such a question, we need to give a precise definition of the notion of an algo- rithm, and of what it means for an algorithm to be efficient. Also, since there is no empirical experiment to prove the nonexistence of an algo- rithm, the only way to establish such a result is using a mathematical proof.

1.1 THIS CHAPTER: A READER’S MANUAL

Depending on your background, you can approach this chapter in two different ways:

• If you have already taken a “discrete mathematics”, “mathematics for computer science” or similar courses, you do not need to read

Compiled on 8.26.2020 18:10 48 introduction to theoretical computer science

the whole chapter. You can just take quick look at Section 1.2 to see the main tools we will use, Section 1.7 for our notation and conven- tions, and then skip ahead to the rest of this book. Alternatively, you can sit back, relax, and read this chapter just to get familiar with our notation, as well as to enjoy (or not) my philosophical musings and attempts at humor.

• If your background is less extensive, see Section 1.9 for some re- sources on these topics. This chapter briefly covers the concepts that we need, but you may find it helpful to see a more in-depth treatment. As usual with math, the best way to get comfort with this material is to work out exercises on your own.

• You might also want to start brushing up on discrete probability, which we’ll use later in this book (see Chapter 18).

1.2 A QUICK OVERVIEW OF MATHEMATICAL PREREQUISITES

The main mathematical concepts we will use are the following. We just list these notions below, deferring their definitions to the rest of this chapter. If you are familiar with all of these, then you might want to just skip to Section 1.7 to see the full list of notation we use.

• Proofs: First and foremost, this book involves a heavy dose of for- mal mathematical reasoning, which includes mathematical defini- tions, statements, and proofs.

• Sets and set operations: We will use extensively mathematical sets. We use the basic set relations of membership ( ) and containment ( ), and set operations, principally union ( ), intersection ( ), and set difference ( ). ∈ ⊆ ⧵ ∪ ∩ • Cartesian product and Kleene star operation: We also use the Cartesian product of two sets and , denoted as (that is, the set of pairs where and ). We denote by the fold Cartesian product퐴 (e.g.,퐵 퐴 × 퐵) and by 퐴(known푛 × 퐵 as the Kleene star(푎,) 푏) the union푎 ∈ of 퐴3 for푏 all ∈ 퐵 . ∗ 퐴 푛 퐴 푛= 퐴 × 퐴 × 퐴 퐴 • Functions: The domain and codomain of a function, properties such 퐴 푛 ∈ {0, 1, 2, …} as being one-to-one (also known as injective) or onto (also known as surjective) functions, as well as partial functions (that, unlike standard or “total” functions, are not necessarily defined on all elements of their domain).

• Logical operations: The operations AND ( ), OR ( ), and NOT ( ) and the quantifiers “there exists” ( ) and “for all” ( ). ∧ ∨ • Basic : Notions such as (the number of -sized ¬ ∃ ∀ subsets of a set of size ). 푛 (푘) 푘 푛 mathematical background 49

• Graphs: Undirected and directed graphs, connectivity, paths, and cycles.

• Big- notation: notation for analyzing asymptotic growth of functions. 푂 푂, 표, Ω, 휔, Θ • Discrete probability: We will use probability theory, and specifi- cally probability over finite samples spaces such as tossing coins, including notions such as random variables, expectation, and concen- tration. We will only use probability theory in the second half푛 of this text, and will review it beforehand in Chapter 18. However, probabilistic reasoning is a subtle (and extremely useful!) skill, and it’s always good to start early in acquiring it.

In the rest of this chapter we briefly review the above notions. This is partially to remind the reader and reinforce material that might not be fresh in your mind, and partially to introduce our notation and conventions which might occasionally differ from those you’ve encountered before.

1.3 READING MATHEMATICAL TEXTS

Mathematicians use jargon for the same reason that it is used in many other professions such engineering, law, medicine, and others. We want to make terms precise and introduce shorthand for concepts that are frequently reused. Mathematical texts tend to “pack a lot of punch” per sentence, and so the key is to read them slowly and carefully, parsing each symbol at a time. With time and practice you will see that reading mathematical texts becomes easier and jargon is no longer an issue. Moreover, reading mathematical texts is one of the most transferable skills you could take from this book. Our world is changing rapidly, not just in the realm Figure 1.1: A snippet from the “methods” section of of technology, but also in many other human endeavors, whether it the “AlphaGo Zero” paper by Silver et al, Nature, 2017. is medicine, economics, law or even culture. Whatever your future aspirations, it is likely that you will encounter texts that use new con- cepts that you have not seen before (see Fig. 1.1 and Fig. 1.2 for two recent examples from current “hot areas”). Being able to internalize and then apply new definitions can be hugely important. It is a skill that’s much easier to acquire in the relatively safe and stable context of a mathematical course, where one at least has the guarantee that the concepts are fully specified, and you have access to your teaching staff for questions. Figure 1.2: A snippet from the “Zerocash” paper of The basic components of a mathematical text are definitions, asser- Ben-Sasson et al, that forms the basis of the cryptocur- tions and proofs. rency startup Zcash. 50 introduction to theoretical computer science

1.3.1 Definitions Mathematicians often define new concepts in terms of old concepts. For example, here is a mathematical definition which you may have encountered in the past (and will see again shortly):

Definition 1.1 — One to one function. Let be sets. We say that a function is one to one (also known as injective) if for every two elements , if then푆, 푇 . 푓 ∶ 푆 → 푇′ ′ ′ Definition 1.1푥,captures 푥 ∈ 푆 a simple푥 ≠ 푥 concept,푓(푥) but ≠ 푓(푥 even) so it uses quite a bit of notation. When reading such a definition, it is often useful to annotate it with a pen as you’re going through it (see Fig. 1.3). For example, when you see an identifier such as , or , make sure that you realize what of object is it: is it a set, a function, an element, a number, a gremlin? You might also find it푓 푆useful푥 to explain the definition in words to a friend (or to yourself).

1.3.2 Assertions: Theorems, lemmas, claims Theorems, lemmas, claims and the like are true statements about the concepts we defined. Deciding whether to call a particular statement a Figure 1.3: An annotated form of Definition 1.1, “Theorem”, a “Lemma” or a “Claim” is a judgement call, and does not marking which part is being defined and how. make a mathematical difference. All three correspond to statements which were proven to be true. The difference is that a Theorem refers to a significant result, that we would want to remember and highlight. A Lemma often refers to a technical result, that is not necessarily im- portant in its own right, but can be often very useful in proving other theorems. A Claim is a “throw away” statement, that we need to use in order to prove some other bigger results, but do not care so much about for its own sake.

1.3.3 Proofs Mathematical proofs are the arguments we use to demonstrate that our theorems, lemmas, and claims area indeed true. We discuss proofs in Section 1.5 below, but the main point is that the mathematical stan- dard of proof is very high. Unlike in some other realms, in mathe- matics a proof is an “airtight” argument that demonstrates that the statement is true beyond a shadow of a doubt. Some examples in this section for mathematical proofs are given in Solved Exercise 1.1 and Section 1.6. As mentioned in the preface, as a general rule, it is more important you understand the definitions than the theorems, and it is more important you understand a theorem statement than its proof. mathematical background 51

1.4 BASIC DISCRETE MATH OBJECTS

In this section we quickly review some of the mathematical objects (the “basic data structures” of mathematics, if you will) we use in this book.

1.4.1 Sets A set is an unordered collection of objects. For example, when we write , we mean that denotes the set that contains the numbers , , and . (We use the notation “ ” to denote that is an element푆 = {2, of 4,.) 7} Note that the set푆 and are identical, since they2 contain4 7 the same elements. Also,2 a ∈ set 푆 either contains an2 element or does푆 not contain it – there{2, is 4, no 7} notion{7, of 4, containing 2} it “twice” – and so we could even write the same set as (though that would be a little weird). The cardinality of a finite set , denoted by , is the number of elements it contains.푆 (Cardinality{2, 2, 4, 7} can be defined for infinite sets as well; see the sources in Section 1.9.) So,푆 in the example|푆| above, . A set is a subset of a set , denoted by , if every element of is also an element of . (We can also describe this by saying|푆| = that 3 is a푆superset of .) For푇 example, 푆 ⊆ 푇 . The set that contains푆 no elements is푇 known as the empty set and it is denoted by .푇 If is a subset of푆 that is not equal {2,to 7}we ⊆ say {2, that 4, 7} is a strict subset∅ of , and denote this by . We can define sets by either listing퐴 all their퐵 elements or by writing down퐵 a rule that퐴 they satisfy such as 퐵 퐴 ⊊ 퐵

EVEN for some non-negative integer (1.1)

Of course= there {푥 | is 푥 more = 2푦 than one way to write the same푦} set, . and of- ten we will use intuitive notation listing a few examples that illustrate the rule. For example, we can also define EVEN as

EVEN (1.2) Note that a set can be either finite (such as the set ) or in- = {0, 2, 4, …} . finite (such as the set EVEN). Also, the elements of a set don’t have to be numbers. We can talk about the sets such as the{2, set 4, 7} of all the vowels in the English language, or the set New York, Los Angeles, Chicago, Houston, Philadelphia, Phoenix, San{푎, Antonio 푒, 푖, 표, 푢}, San Diego, Dallas of all cities in the U.S. with population{ more than one million per the 2010 census. A set can even have other sets as ele- ments, such as the set} of all even-sized subsets of . ∅ { , {1, 2}, {2, 3}, {1, 3}} Operations on sets: The union of two sets , denoted by , {1, 2, 3} is the set that contains all elements that are either in or in . The intersection of and , denoted by ,푆, is 푇 the set of elements푆 ∪ 푇 that are 푆 푇 푆 푇 푆 ∩ 푇 52 introduction to theoretical computer science

both in and in . The set difference of and , denoted by (and in some texts also by ), is the set of elements that are in ⧵ but not in . 푆 푇 푆 푇 푆 푇 푆 − 푇 푆 Tuples, lists, strings, sequences: A tuple is an ordered collection of items. 푇 For example is a tuple with four elements (also known as a -tuple or quadruple). Since order matters, this is not the same tuple as the (1,-tuple 5, 2, 1) or the -tuple .A -tuple is also known4 as a pair. We use the terms tuples and lists interchangeably. A tuple where4 every(1, element 1, 5, 2) comes3 from some(1, 5, finite 2) set2 (such as ) is also known as a string. Analogously to sets, we denote the length of a tuple by . Just like sets, we can also thinkΣ of infinite {0,analogues 1} of tuples, such as the ordered collection of all perfect squares.푇 Infinite|푇 | ordered collections are known as sequences; we might sometimes use the term “infinite sequence”(1, 4, 9, to …) emphasize this, and use “finite sequence” as a synonym for a tuple. (Wecan identify a sequence of elements in some set with a function (where for every ). Similarly, 0 1 2 we can identify a -tuple(푎 , 푎 , 푎 , …) of elements in with푆 a function 푛 퐴 ∶.) ℕ → 푆 푎 = 퐴(푛) 푛 ∈ ℕ 0 푘−1 푘 (푎 , … , 푎 ) 푆 Cartesian product: If and are sets, then their Cartesian product, 퐴 ∶ [푘] → 푆 denoted by , is the set of all ordered pairs where and . For example,푆 if 푇 and , then contains the푆 ×elements 푇 (푠, 푡) 푠 ∈ 푆 . 푡Similarly ∈ 푇 if are sets푆 = then {1, 2, 3} 푇 =is the {10, set 12} of all ordered푆 × 푇 triples 6 where (1, 10),, (2, 10),, and (3, 10), (1,. More 12), (2, generally, 12), (3, 12) for every positive푆, 푇 integer , 푈 and sets푆 × 푇 × 푈 , we denote by (푠, 푡,the 푢) set of ordered푠 ∈ 푆 푡-tuples ∈ 푇 푢 ∈ 푈 where for 0 푛−1 0 1 every 푛. For every푆 set, … ,, 푆 we denote the set 푆 ×by 푆 × , 푛−1 0 푛−1 푖 푖 ⋯ × 푆 by , 푛 by (푠, and, … ,so 푠 on) and so forth.푠 ∈ 푆 2 푖 ∈ {0, …3 , 푛 − 1} 4푆 푆 × 푆 푆 푆1.4.2 × 푆 Special × 푆 sets푆 푆 × 푆 × 푆 × 푆 푆 There are several sets that we will use in this book time and again. The set

(1.3) contains all natural numbers, i.e., non-negative integers. For any natural ℕ = {0, 1, 2, …} number , we define the set as . (We start our indexing of both and from , while many other texts푛 ∈index ℕ those sets from [푛]. Starting{0, … from , 푛 − zero 1} = or{푘 one ∈ is ℕ simply ∶ 푘a convention < 푛} that doesn’t make much difference,ℕ [푛] as 0long as one is consistent about it.) 1

We will also occasionally use the set 1 The letter Z stands for the German word “Zahlen”, of (negative and non-negative) integers,1 as well as the set of real which means numbers. ℤ = {… , −2, −1, 0, +1, +2, …} ℝ mathematical background 53

numbers. (This is the set that includes not just the integers, but also fractional and irrational numbers; e.g., contains numbers such as , , etc.) We denote by the set of positive real numbers. This set is sometimes also denotedℝ as . + +0.5 −휋 ℝ {푥 ∈ ℝ ∶ 푥 > 0} Strings: Another set we will use time and again is (0, ∞)

(1.4) 푛 which is the set of all -length0 binary푛−1 strings0 for푛−1 some natural number {0, 1} = {(푥 , … , 푥 ) ∶ 푥 , … , 푥 ∈ {0, 1}} . That is is the set of all -tuples of zeroes and ones. This is consistent with푛 our notation푛 above: is the Cartesian product 푛 {0, 1}, is the product푛 2 and so on. We will write the string3 {0, 1} as simply . For {0,example, 1} × {0, 1} {0, 1} {0, 1} × {0, 1} × {0, 1} 0 1 푛−1 0 1 푛−1 (푥 , 푥 , … , 푥 ) 푥 푥 ⋯ 푥 (1.5) For every string3 and , we write for the {0, 1} = {000, 001, 010, 011, 100, 101, 110, 111} . element of . 푛 푡ℎ 푖 We will also often푥 talk ∈ {0, about 1} the set푖 ∈ of [푛] binary strings푥 of all lengths,푖 which is 푥

(1.6) ∗ 0 푛−1 0 푛−1 Another{0, 1} = way {(푥 to, …write , 푥 this) ∶set 푛 is ∈ as ℕ , , 푥 , … , 푥 ∈ {0, 1}} . (1.7) ∗ 0 1 2 or more concisely{0, as 1} = {0, 1} ∪ {0, 1} ∪ {0, 1} ∪ ⋯

(1.8) ∗ 푛 The set includes the “string푛∈ℕ of length ” or “the empty {0, 1} = ∪ {0, 1} . string”, which we∗ will denote by "". (In using this notation we fol- low the convention{0, 1} of many programming languages.0 Other texts sometimes use or to denote the empty string.)

Generalizing the star operation: For every set , we define 휖 휆

Σ (1.9) ∗ 푛 For example, if 푛∈ℕthen denotes the set of all finite Σ = ∪ Σ . length strings over the alphabet a-z. ∗ Σ = {푎, 푏, 푐, 푑, … , 푧} Σ Concatenation: The concatenation of two strings and is the -length string obtained by writing after푛 . That is,푚 if and , then is equal to the푥 ∈ string Σ 푦 ∈ Σ such(푛 that + 푚)푛 for , 푚푥푦and for 푦 푥, 푛+푚. 푥 ∈ {0, 1} 푦 ∈ {0, 1} 푥푦 푧 ∈ {0, 1} 푖 푖 푖 푖−푛 푖 ∈ [푛] 푧 = 푥 푖 ∈ {푛, … , 푛 + 푚 − 1} 푧 = 푦 54 introduction to theoretical computer science

1.4.3 Functions If and are nonempty sets, a function mapping to , denoted by , associates with every element an element 푆 푇 . The set is known as the domain퐹 of and푆 the푇 set is known퐹 ∶ 푆 as → the 푇 codomain of . The image of a function푥 ∈ 푆 is the set 퐹 (푥) ∈ 푇 which푆 is the subset of ’s codomain퐹 consisting푇 of all output elements that are mapped퐹 from some input. (Some퐹 texts use {퐹range (푥)to | denote푥 ∈ 푆} the image of a function,퐹 while other texts use range to denote the codomain of a function. Hence we will avoid using the term “range” altogether.) As in the case of sets, we can write a func- tion either by listing the table of all the values it gives for elements in or by using a rule. For example if and , then the table below defines a function . 푆 푆 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} Note that this function is the same as the function defined by the rule 2 2 For two natural numbers and , mod (short- 푇 = {0,mod 1} . 퐹 ∶ 푆 → 푇 hand for “modulo”) denotes the remainder of when it is divided by . That푥 is, it푎 is푥 the number푎 in 퐹 (푥) = (푥 2)Table 1.1: An example of a function. such that for some integer푥 . We sometimes also use푎 the notation mod푟 to{0, denote … , 푎 − the 1} assertion that푥 = 푎푘mod + 푟 is the same as 푘 Input Output mod . 푥 = 푦 ( 푎) 푥 푎 푦 0 0 푎 1 1 2 0 3 1 4 0 5 1 6 0 7 1 8 0 9 1

If satisfies that for all then we say that is one-to-one (Definition 1.1, also known as an injective function or simply푓 ∶ 푆 an →injection 푇 ). If satisfies푓(푥) ≠ that 푓(푦) for every푥 ≠ 푦 there is some 푓 such that then we say that is onto (also known as a surjective function or simply퐹 a surjection). A function푦 ∈ that 푇 is both one- 푥to-one ∈ 푆 and onto퐹 is (푥) known = 푦 as a bijective function퐹 or simply a bijection. A bijection from a set to itself is also known as a permutation of . If is a bijection then for every there is a unique such that . We푆 denote this value by . Note that 푆 퐹is itself ∶ 푆 → a bijection푇 from to (can you푦 see ∈ 푇 why?).−1 푥 ∈−1 푆 Giving퐹 a (푥) bijection = 푦 between two sets is often푥 a퐹 good(푦) way to show퐹 they have the same size.푇 In fact,푆 the standard mathematical definition of the notion that “ and have the same cardinality” is that there

푆 푇 mathematical background 55

exists a bijection . Further, the cardinality of a set is defined to be if there is a bijection from to the set . As we will see later푓 ∶ in 푆 this → book, 푇 this is a definition that generalizes푆 to defining the푛 cardinality of infinite sets. 푆 {0, … , 푛 − 1}

Partial functions: We will sometimes be interested in partial functions from to . A partial function is allowed to be undefined on some subset of . That is, if is a partial function from to , then for every푆 푇 , either there is (as in the case of standard functions) an element 푆 in , or 퐹 is undefined. For example,푆 푇 the partial func- tion 푠 ∈ 푆 is only defined on non-negative real numbers. When we want퐹 to (푠) distinguish푇 퐹 between(푠) partial functions and standard (i.e., √ non-partial)퐹 (푥) = functions,푥 we will call the latter total functions. When we say “function” without any qualifier then we mean a total function. The notion of partial functions is a strict generalization of func- tions, and so every function is a partial function, but not every partial function is a function. (That is, for every nonempty and , the set of partial functions from to is a proper superset of the set of total functions from to .) When we want to emphasize푆 that a푇 function from to might not be푆 total,푇 we will write . We can think of a partial푆 function푇 from to also as a total function from 푝 푓 to 퐴 퐵where is a special “failure symbol”.푓 ∶ 퐴 So, → instead퐵 of saying that is undefined퐹 at , we푆 can푇 say that . 푆 푇 ∪ {⊥} ⊥ Basic facts about functions: Verifying that you can prove the following 퐹 푥 퐹 (푥) = ⊥ results is an excellent way to brush up on functions:

• If and are one-to-one functions, then their composition defined as is also one to one.퐹 ∶ 푆 → 푇 퐺 ∶ 푇 → 푈 퐻 ∶ 푆 → 푈 퐻(푠) = 퐺(퐹 (푠)) • If is one to one, then there exists an onto function such that for every . 퐹 ∶ 푆 → 푇 • If is onto then there exists a one-to-one function 퐺 ∶ 푇 → 푆 퐺(퐹 (푠)) = 푠 푠 ∈ 푆 such that for every . 퐺 ∶ 푇 → 푆 퐹 ∶ 푆 → • If and are finite sets then the following conditions are equiva- 푇 퐺(퐹 (푠)) = 푠 푠 ∈ 푆 lent to one another: (a) , (b) there is a one-to-one function 푆 푇 , and (c) there is an onto function . (This is actually true even for infinite|푆| ≤ |푇and | : in that case (b) (or equiva- 퐹lently ∶ 푆(c) →) 푇 is the commonly accepted definition퐺for ∶ 푇 → 푆 .) Figure 1.4: We can represent finite functions as a directed graph where we put an edge from to 푆 푇 . The onto condition corresponds to requiring |푆| ≤ |푇 | that every vertex in the codomain of the function푥 P 푓(푥)has in-degree at least one. The one-to-one condition You can find the proofs of these results in many dis- corresponds to requiring that every vertex in the crete math texts, including for example, Section 4.5 codomain of the function has in-degree at most one. In in the Lehman-Leighton-Meyer notes. However, I the examples above is an onto function, is one to one, and is neither onto nor one to one. 퐹 퐺 퐻 56 introduction to theoretical computer science

strongly suggest you try to prove them on your own, or at least convince yourself that they are true by proving special cases of those for small sizes (e.g., ).

|푆| = 3, |푇 | = 4, |푈| = 5 Let us prove one of these facts as an example: Lemma 1.2 If are non-empty sets and is one to one, then there exists an onto function such that for every .푆, 푇 퐹 ∶ 푆 → 푇 퐺 ∶ 푇 → 푆 퐺(퐹 (푠)) = 푠 푠 ∈ 푆 Proof. Choose some . We will define the function as follows: for every , if there is some such that then 0 set (the choice푠 ∈ of 푆 is well defined since by 퐺the ∶ 푇one-to-one → 푆 property of , there푡 ∈ cannot 푇 be two distinct푠 ∈ 푆 that both퐹 map (푠) = to 푡 ). Otherwise,퐺(푡) = 푠set . Now푠 for every ′, by the definition of , if then퐹 . Moreover,푠, 푠 this also shows that푡 0 is onto, since퐺(푡) it means = 푠 that for every 푠 ∈there 푆 is some , namely퐺 푡 = 퐹 (푠), such that퐺(푡) = 퐺(퐹. (푠)) = 푠 퐺 푠 ∈ 푆 푡 ■ 푡 = 퐹 (푠) 퐺(푡) = 푠 3 It is possible, and sometimes useful, to think of an undirected graph as the special case of a directed 1.4.4 Graphs graph that has the special property that for every pair Graphs are ubiquitous in Computer Science, and many other fields as either both the edges and are present or neither of them is. However, in many settings there well. They are used to model a variety of data types including social is푢, a 푣 significant difference(푢, 푣)between(푣, undirected 푢) and networks, constraints, road networks, deep neural nets, directed graphs, and so it’s typically best to think of gene interactions, correlations between observations, and a great them as separate categories. many more. Formal definitions of several kinds of graphs are given next, but if you have not seen graphs before in a course, I urge you to read up on them in one of the sources mentioned in Section 1.9. Graphs come in two basic flavors: undirected and directed.3

Definition 1.3 — Undirected graphs. An undirected graph con- sists of a set of vertices and a set of edges. Every edge is a size two subset of . We say that two vertices 퐺are =neighbors (푉 , 퐸) , if the edge 푉 is in . 퐸 푉 푢, 푣 ∈ 푉 Given this{푢, definition, 푣} 퐸 we can define several other properties of graphs and their vertices. We define the degree of to be the number of neighbors has. A path in the graph is a tuple , for some such that is a neighbor of for푢 every 푘+1.A 0 푘 simple path is푢 a path where all the (푢’s are, … distinct. , 푢 ) ∈ 푉 A cycle 푖+1 푖 is a path 푘 > 0 where푢 . We say that푢 two vertices푖 ∈ [푘] Figure 1.5: An example of an undirected and a di- 0 푘−1 푖 are connected if either(푢 , … , 푢or there) is a path from푢 where rected graph. The undirected graph has vertex set 0 푘 0 푘 and edge set . The and(푢 , … , 푢 .) We say that푢 = the 푢 graph is connected if every푢, 푣 pair ∈ 푉 of directed graph has vertex set and the edge 0 푘 vertices in it is connected.푢 = 푣 (푢 , … , 푢 ) set{1, 2, 3, 4} {{1, 2},. {2, 3}, {2, 4}} 0 푘 푢 = 푢 푢 = 푣 퐺 {푎, 푏, 푐} {(푎, 푏), (푏, 푐), (푐, 푎), (푎, 푐)} mathematical background 57

Here are some basic facts about undirected graphs. We give some informal arguments below, but leave the full proofs as exercises (the proofs can be found in many of the resources listed in Section 1.9). Lemma 1.4 In any undirected graph , the sum of the degrees of all vertices is equal to twice the number of edges. 퐺 = (푉 , 퐸) Lemma 1.4 can be shown by seeing that every edge con- tributes twice to the sum of the degrees (once for and the second time for ). {푢, 푣} 푢 Lemma 1.5 The connectivity relation is transitive, in the sense that if is 푣 connected to , and is connected to , then is connected to . 푢 Lemma 1.5 can be shown by simply attaching a path of the form 푣 푣 푤 푢 푤 to a path of the form to obtain the path that′ connects′ ′ to . 1 2 푘−1 1 푘 −1 (푢, 푢 , 푢 , … , 푢 , 푣) ′ ′ ′ (푣, 푢 , … , 푢 , 푤) Lemma 1.6 For1 every푘−1 undirected1 graph and connected pair (푢, 푢 , … , 푢 , 푣, 푢 , … , 푢푘 −1, 푤) 푢 푤 , the shortest path from to is simple. In particular, for every connected pair there exists a simple path퐺 = that (푉 connects, 퐸) them. 푢, 푣 푢 푣 Lemma 1.6 can be shown by “shortcutting” any non simple path from to where the same vertex appears twice to remove it (see Fig. 1.6). It is a good exercise to transforming this intuitive reasoning to a formal푢 푣 proof: 푤

Figure 1.6: If there is a path from to in a graph that passes twice through a vertex then we can “shortcut” it by removing the loop푢 from푣 to itself to find a path from to that only passes푤 once through . 푤 푢 푣 푤

Solved Exercise 1.1 — Connected vertices have simple paths. Prove Lemma 1.6 ■

Solution: The proof follows the idea illustrated in Fig. 1.6. One complica- tion is that there can be more than one vertex that is visited twice by a path, and so “shortcutting” might not necessarily result in a 58 introduction to theoretical computer science

simple path; we deal with this by looking at a shortest path between and . Details follow. Let be a graph and and in be two connected vertices푢 푣 in . We will prove that there is a simple graph between and 퐺. Let =be (푉 the , 퐸) shortest length of푢 a path푣 between푉 and and let 퐺 be a -length path from to 푢(there푣 can be푘 more than one such path: if so we just choose푢 one푣 of 0 1 2 푘−1 푘 them).푃 (That = is (푢 , 푢 , 푢, , … , 푢 , and, 푢 ) 푘 for all 푢 .)푣 We claim that is simple. Indeed, suppose otherwise that there is 0 푘 ℓ ℓ+1 some vertex that푢 = occurs 푢 푢 twice= 푣 in the(푢 path:, 푢 ) ∈ 퐸 and ℓ ∈ [푘]for some . Then푃 we can “shortcut” the path by considering the 푖 푗 path 푤 푤obtained = 푢 by푤 taking = 푢 the first vertices푖′ < 푗 of (from to the first푃 occurrence of ) and 0 1 푖−1 푗+1 푘 the last푃 = (푢ones, 푢 , (from … , 푢 the, 푤, vertex 푢 , … , 푢 following) the second oc- 0 currence푖 of to 푃 ).푢 The= path 0 is a valid path between푤 and 푗+1 since every푘 − consecutive 푗 pair of vertices′ 푢 in it is connected by an 푘 edge (in particular,푤 푢 = since 푣 푃 , both and 푢 푣are edges in ), but since the length of is , this 푖 푗 푖−1 푗+1 contradicts the minimality푤 of = 푢. = 푤 ′ (푢 , 푤) (푤, 푢 ) 퐸 푃 푘 − (푗 − 푖) < 푘 ■ 푃

R Remark 1.7 — Finding proofs. Solved Exercise 1.1 is a good example of the process of finding a proof. You start by ensuring you understand what the statement means, and then come up with an informal argument why it should be true. You then transform the infor- mal argument into a rigorous proof. This proof need not be very long or overly formal, but should clearly establish why the conclusion of the statement follows from its assumptions.

The concepts of degrees and connectivity extend naturally to di- rected graphs, defined as follows.

Definition 1.8 — Directed graphs. A directed graph consists of a set and a set of ordered pairs of . We sometimes denote the edge also as . If the edge퐺 = (푉 , 퐸)is present in the graph푉 then we퐸 say ⊆ that푉 × 푉is an out-neighbor of푉 and is an in-neighbor of . (푢, 푣) 푢 → 푣 푢 → 푣 푣 푢 푢 A directed graph푣 might contain both and in which case will be both an in-neighbor and an out-neighbor of and vice versa. The in-degree of is the number of푢 in-neighbors → 푣 푣 → it has, 푢 and the out-degree푢 of is the number of out-neighbors it has. A path푣 in the 푢 푣 mathematical background 59

graph is a tuple , for some such that is an out-neighbor of for every 푘+1. As in the undirected case, a simple 0 푘 푖+1 path is a path (푢 , … , 푢 )where ∈ 푉 all the ’s are푘 > distinct 0 and a푢cycle 푖 is a path 푢 where 푖 ∈ [푘] . One type of directed graphs we 0 푘−1 푖 often care about(푢 is, …directed , 푢 ) acyclic graphs or푢 DAGs, which, as their name 0 푘 0 푘 implies, are(푢 directed, … , 푢 ) graphs푢 without= 푢 any cycles:

Definition 1.9 — Directed Acyclic Graphs. We say that is a directed acyclic graph (DAG) if it is a directed graph and there does not exist a list of vertices such퐺 that = (푉 , 퐸)and for every , the edge is in . 0 1 푘 0 푘 푢 , 푢 , … , 푢 ∈ 푉 푢 = 푢 푖 푖+1 The lemmas푖 ∈ [푘] we mentioned푢 above→ 푢 have analogs퐸 for directed graphs. We again leave the proofs (which are essentially identical to their undirected analogs) as exercises. Lemma 1.10 In any directed graph , the sum of the in- degrees is equal to the sum of the out-degrees, which is equal to the number of edges. 퐺 = (푉 , 퐸) Lemma 1.11 In any directed graph , if there is a path from to and a path from to , then there is a path from to . 퐺 푢 푣 Lemma 1.12 For every directed graph and a pair such 푣 푤 푢 푤 that there is a path from to , the shortest path from to is simple. 퐺 = (푉 , 퐸) 푢, 푣 푢 푣 푢 푣 R Remark 1.13 — Labeled graphs. For some applications we will consider labeled graphs, where the vertices or edges have associated labels (which can be numbers, strings, or members of some other set). We can think of such a graph as having an associated (possibly partial) labelling function , where is the set of potential labels. However we will typically not refer explicitly to this퐿 labeling ∶ 푉 ∪ function 퐸 → ℒ and simplyℒ say things such as “vertex has the ”.

푣 훼 1.4.5 Logic operators and quantifiers If and are some statements that can be true or false, then AND (denoted as ) is a statement that is true if and only if both and푃 are푄 true, and OR (denoted as ) is a statement푃 that is 푄true if and only푃 if ∧ either 푄 or is true. The negation of , denoted as푃 or푄 , is true if and푃 only푄 if is false. 푃 ∨ 푄 Suppose that is a푃 statement푄 that depends on some푃 parameter (also¬푃 sometimes푃 known as an푃unbound variable) in the sense that for every instantiation푃 (푥) of with a value from some set , is either 푥 true or false. For example, is a statement that is not a priori 푥 푆 푃 (푥) 푥 > 7 60 introduction to theoretical computer science

true or false, but becomes true or false whenever we instantiate with some real number. We denote by the statement that is true 4 4 In this book, we place the variable bound by a quan- if and only if is true for every . We denote by 푥 the tifier in a subscript and so write . Many 푥∈푆 statement that is true if and only if∀there푃 exists(푥) some such that other texts do not use this subscript notation and so 푥∈푆 푥∈푆 is true. 푃 (푥) 푥 ∈ 푆 ∃ 푃 (푥) will write the same statement as∀ 푃 (푥) . For example, the following is a formalization of the푥 ∈ true 푆 statement ∀푥 ∈ 푆, 푃 (푥) that푃 (푥) there exists a natural number larger than that is not divisi- ble by : 푛 100 3 (1.10)

”For sufficiently large푛∈ℕ n.” 푘∈푁 ∃ (푛 >One 100) expression ∧ (∀ 푘 that + 푘 +we 푘 will ≠ 푛) see . come up time and again in this book is the claim that some statement is true “for sufficiently large ”. What this means is that there exists an integer such that is true for every . We can formalize푃 (푛) this as . 푛 0 0 푁 푃 (푛) 푛 > 푁 푁0∈ℕ 푛>푁0 1.4.6 Quantifiers∃ ∀ for푃 (푛)summations and products The following shorthands for summing up or taking products of sev- eral numbers are often convenient. If is a finite set and is a function, then we write as shorthand for 0 푛−1 푆 = {푠 , … , 푠 } 푥∈푆 푓 ∶ 푆 → ℝ ∑ 푓(푥) (1.11)

0 1 2 푛−1 and 푓(푠as shorthand) + 푓(푠 ) + for 푓(푠 ) + … + 푓(푠 ),

푥∈푆 ∏ 푓(푥) (1.12)

0 1 2 푛−1 For example, the푓(푠 sum) ⋅of 푓(푠 the) squares ⋅ 푓(푠 ) ⋅ of… ⋅all 푓(푠 numbers). from to can be written as 1 100 (1.13) 2 Since summing up over intervals푖∈{1,…,100}∑ of푖 integers. is so common, there is a special notation for it. For every two integers, , denotes where . Hence,푏 we can 푖=푎 write the sum (1.13) as 푎 ≤ 푏 ∑ 푓(푖) ∑푖∈푆 푓(푖) 푆 = {푥 ∈ ℤ ∶ 푎 ≤ 푥 ≤ 푏}

(1.14) 100 2

∑푖=1 푖 . 1.4.7 Parsing formulas: bound and free variables In mathematics, as in coding, we often have symbolic “variables” or “parameters”. It is important to be able to understand, given some formula, whether a given variable is bound or free in this formula. For mathematical background 61

example, in the following statement is free but and are bound by the quantifier: 푛 푎 푏 ∃ (1.15)

Since is free, it푎,푏∈ℕ can be set to any value, and the truth of the state- ∃ (푎 ≠ 1) ∧ (푎 ≠ 푛) ∧ (푛 = 푎 × 푏) ment (1.15) depends on the value of . For example, if then (1.15) is푛 true, but for it is false. (Can you see why?) The same issue appears when parsing푛 code. For example,푛 = 8 in the following snippet from푛 = the 11 C programming language for (int i=0 ; i

the variable i is bound within the for block but the variable n is free. The main property of bound variables is that we can rename them (as long as the new name doesn’t conflict with another used variable) without changing the meaning of the statement. Thus for example the statement

(1.16)

is equivalent to (푥,푦∈ℕ1.15) in the sense that it is true for exactly the same ∃ (푥 ≠ 1) ∧ (푥 ≠ 푛) ∧ (푛 = 푥 × 푦) set of ’s. Similarly, the code 푛 for (int j=0 ; j

produces the same result as the code above that used i instead of j.

R Remark 1.14 — Aside: mathematical vs programming no- tation. Mathematical notation has a lot of similarities with programming language, and for the same rea- sons. Both are formalisms meant to convey complex concepts in a precise way. However, there are some cultural differences. In programming languages, we often try to use meaningful variable names such as NumberOfVertices while in math we often use short identifiers such as . Part of it might have to do with the tradition of mathematical proofs as being hand- written and verbally푛 presented, as opposed to typed up and compiled. Another reason is if the wrong variable name is used in a proof, at worst is causes confusion to readers; when the wrong variable name 62 introduction to theoretical computer science

is used in a program, planes might , patients might die, and rockets could explode. One consequence of that is that in mathematics we often end up reusing identifiers, and also “run out” of letters and hence use Greek letters too, as well as distinguish between small and capital letters and different font faces. Similarly, mathematical notation tends to use quite a lot of “overloading”, using oper- ators such as for a great variety of objects (e.g., real numbers, matrices, finite field elements, etc..), and assuming that+ the meaning can be inferred from the context. Both fields have a notion of “types”, and inmath we often try to reserve certain letters for variables of a particular type. For example, variables such as will often denote integers, and will often denote a small positive real number (see Sec- 푖,tion 푗, 푘, 1.7 ℓ,for 푚, 푛 more on these conventions). When휖 reading or writing mathematical texts, we usually don’t have the advantage of a “” that will check type safety for us. Hence it is important to keep track of the type of each variable, and see that the operations that are performed on it “make sense”. Kun’s book [Kun18] contains an extensive discus- sion on the similarities and differences between the cultures of mathematics and programming.

1.4.8 Asymptotics and Big- notation “log log log has been proved to go to infinity, but has never been observed to do so.”, Anonymous, quoted푂 by Carl Pomerance (2000) 푛 It is often very cumbersome to describe precisely quantities such as running time and is also not needed, since we are typically mostly interested in the “higher order terms”. That is, we want to understand the scaling behavior of the quantity as the input variable grows. For example, as far as running time goes, the difference between an - time algorithm and an -time one is much more significant than5 the difference between an 2 time algorithm and an time푛 algorithm. For this purpose,푛 2 -notation is extremely useful as2 a way to “declutter” our text100푛 and focus+ 10푛 our attention on what really10푛 matters. For example, using -notation,푂 we can say that both and are simply (which informally means “the2 same up to constant2 factors”), while푂 2 (which informally100푛 means+ 10푛 that is “much10푛 smaller than”Θ(푛 )).2 5 2 Generally (though still5푛 informally),= 표(푛 ) if are two functions map-푛 ping natural numbers to푛 non-negative reals, then “ ” means that if we don’t care about퐹 constant , 퐺 factors, while “ ” means that is much smaller than ,퐹 in= the 푂(퐺) sense that no matter퐹 (푛) by what ≤ 퐺(푛) constant factor we multiply , if we take to be large 퐹 = 표(퐺) 퐹 퐺 퐹 푛 mathematical background 63

enough then will be bigger (for this reason, sometimes is written as ). We will write if and , which퐺 one can think of as saying that is the same퐹 = 표(퐺)as if we don’t care퐹 about ≪ 퐺 constant factors.퐹 More = Θ(퐺) formally,퐹 we = define 푂(퐺) Big- 퐺notation = 푂(퐹 as ) follows: 퐹 퐺 푂 Definition 1.15 — Big- notation. Let be the set of positive real numbers. For two functions , we say + that if푂 there exist numbersℝ = {푥 ∈ ℝ |such 푥 > that0} + for every . We say that 퐹 , 퐺 ∶if ℕ → ℝ and 0 퐹 =. 푂(퐺) We say that if 푎, 푁 ∈. ℕ 퐹 (푛) ≤ 0 푎 ⋅We 퐺(푛) say that 푛 > 푁 if for every 퐹 = Θ(퐺)there is퐹 some = 푂(퐺)such 퐺that = 푂(퐹 ) for퐹 every = Ω(퐺) 퐺 =. 푂(퐹 We say) that if 0 . 퐹 = 표(퐺) 휖 > 0 푁 0 퐹 (푛) < 휖퐺(푛) 푛 > 푁 퐹 = 휔(퐺) 퐺It’s = often 표(퐹 ) convenient to use “anonymous functions” in the context of -notation. For example, when we write a statement such as , we mean that where is the function defined by 푂 3 . Chapter 7 in Jim Apsnes’ notes on discrete math provides퐹 (푛) = a푂(푛 good) summary3 of 퐹notation; = 푂(퐺) see also퐺this tutorial for a gentler and more퐺(푛) =-oriented 푛 introduction. is not equality. Using푂 the equality sign for -notation is extremely common, but is somewhat of a misnomer, since a statement such as Figure 1.7 푂 푂 : If then for sufficiently really means that is in the set s.t. large , will be smaller than . For example, . If anything, it makes more sense to use′ inequalities and write′ if Algorithm퐹(푛)runs = in 표(퐺(푛)) time and 푁,푐 푛>푁 Algorithm runs in time then even though 퐹 = 푂(퐺) and , reserving퐹 equality{퐺 for∶ ∃ ∀ ,퐺 and(푛) ≤ 푛 퐹(푛) 퐺(푛) 6 might be more퐴 efficient 1000for ⋅2smaller 푛 + 10 inputs, when so푐퐺(푛)} we will sometimes use this notation too, but since the equality the inputs get퐵 sufficiently0.01 large, ⋅ 푛will run much faster notation퐹 ≤ 푂(퐺) is quite퐹 firmly ≥ Ω(퐺) entrenched we often stick퐹 =to Θ(퐺) it as well. (Some 퐵than . texts write instead of , but we will not use this 퐴 퐵 notation.) Despite the misleading equality sign, you should remember that a statement퐹 ∈ 푂(퐺) such as 퐹 =means 푂(퐺) that is “at most” in some rough sense when we ignore constants, and a statement such as means that 퐹is “at =푂(퐺) least” in the same퐹 rough sense.퐺

1.4.9퐹 = Ω(퐺) Some “rules of thumb”퐹 for Big- 퐺notation There are some simple heuristics that can help when trying to com- pare two functions and : 푂

• Multiplicative constants퐹 퐺 don’t matter in -notation, and so if then . 푂 •퐹 When (푛) = adding 푂(퐺(푛)) two functions,100퐹 (푛) we = only 푂(퐺(푛)) care about the larger one. For example, for the purpose of -notation, is the same as , and in general in any polynomial, we only3 care2 about the larger exponent.3 푂 푛 + 100푛 푛 64 introduction to theoretical computer science

• For every two constants , if and only if , and if and only if .푎 For example,푏 combining the two observations푎 푏 above, 푎, 푏 > 0 푛 = 푂(푛 ) . 푎 ≤ 푏 푛 = 표(푛 ) 2 푎 < 푏 3 • Polynomial is always100푛 smaller+ 10푛 than + exponential: 100 = 표(푛 ) for 휖 every two constants and even if is much푎 smaller푛 than . For example, . 푛 = 표(2 ) √ 푎100 > 0 휖푛 > 0 휖 •푎 Similarly, logarithmic100푛 is always= 표(2 smaller) than polynomial: log (which we write as log ) is for every two constants 푎 . For example, combining푎 the observations휖 above, log( 푛) . 푛 표(푛 ) 2 100푎, 휖 > 0 3 100푛 푛 = 표(푛 )

R Remark 1.16 — Big for other applications (optional). While Big- notation is often used to analyze running time of algorithms,푂 this is by no means the only ap- plication. We푂 can use notation to bound asymptotic relations between any functions mapping integers to positive numbers. It푂 can be used regardless of whether these functions are a measure of running time, usage, or any other quantity that may have nothing to do with computation. Here is one example which is unrelated to this book (and hence one that you can feel free to skip): one way to state the (one of the most famous open questions in mathematics) is that it corresponds to the conjecture that the number of primes between

and is equal to ln up to an additive error of magnitude at most푛 1 log . 0 푥 푛 ∫2 푑푥 √ 푂( 푛 푛)

1.5 PROOFS

Many people think of mathematical proofs as a sequence of logical deductions that starts from some axioms and ultimately arrives at a conclusion. In fact, some dictionaries define proofs that way. This is not entirely wrong, but at its essence mathematical proof of a state- ment X is simply an argument that convinces the reader that X is true beyond a shadow of a doubt. To produce such a proof you need to:

1. Understand precisely what X means.

2. Convince yourself that X is true.

3. Write your reasoning down in plain, precise and concise English (using formulas or notation only when they help clarity). mathematical background 65

In many cases, the first part is the most important one. Understand- ing what a statement means is oftentimes more than halfway towards understanding why it is true. In third part, to convince the reader beyond a shadow of a doubt, we will often want to break down the reasoning to “basic steps”, where each basic step is simple enough to be “self evident”. The combination of all steps yields the desired statement.

1.5.1 Proofs and programs There is a great deal of similarity between the process of writing proofs and that of writing programs, and both require a similar set of skills. Writing a program involves:

1. Understanding what is the task we want the program to achieve.

2. Convincing yourself that the task can be achieved by a computer, perhaps by planning on a whiteboard or notepad how you will break it up to simpler tasks.

3. Converting this plan into code that a compiler or interpreter can understand, by breaking up each task into a sequence of the basic operations of some programming language.

In programs as in proofs, step 1 is often the most important one. A key difference is that the reader for proofs is a human beingand the reader for programs is a computer. (This difference is eroding with time as more proofs are being written in a machine verifiable form; moreover, to ensure correctness and maintainability of programs, it is important that they can be read and understood by .) Thus our emphasis is on readability and having a clear logical flow for our proof (which is not a bad idea for programs as well). When writing a proof, you should think of your audience as an intelligent but highly skeptical and somewhat petty reader, that will “call foul” at every step that is not well justified.

1.5.2 Proof writing style A mathematical proof is a piece of writing, but it is a specific genre of writing with certain conventions and preferred styles. As in any writing, practice makes perfect, and it is also important to revise your drafts for clarity. In a proof for the statement , all the text between the words “Proof:” and “QED” should be focused on establishing that is true. Digressions, examples, or ruminations푋 should be kept outside these two words, so they do not confuse the reader. The proof should푋 have a clear logical flow in the sense that every sentence or equation init should have some purpose and it should be crystal-clear to the reader 66 introduction to theoretical computer science

what this purpose is. When you write a proof, for every equation or sentence you include, ask yourself:

1. Is this sentence or equation stating that some statement is true?

2. If so, does this statement follow from the previous steps, or are we going to establish it in the next step?

3. What is the role of this sentence or equation? Is it one step towards proving the original statement, or is it a step towards proving some intermediate claim that you have stated before?

4. Finally, would the answers to questions 1-3 be clear to the reader? If not, then you should reorder, rephrase or add explanations.

Some helpful resources on mathematical writing include this hand- out by Lee, this handout by Hutching, as well as several of the excel- lent handouts in Stanford’s CS 103 class.

1.5.3 Patterns in proofs “If it was so, it might be; and if it were so, it would be; but as it isn’t, it ain’t. That’s logic.”, Lewis Carroll, Through the looking-glass.

Just like in programming, there are several common patterns of proofs that occur time and again. Here are some examples:

Proofs by contradiction: One way to prove that is true is to show that if was false it would result in a contradiction. Such proofs often start with a sentence such as “Suppose, towards푋 a contradiction, that 푋is false” and end with deriving some contradiction (such as a violation of one of the assumptions in the theorem statement). Here is an example:푋 Lemma 1.17 There are no natural numbers such that . 푎 √ 푏 Proof. Suppose, towards a contradiction that푎, 푏 this is false,2 and = so let be the smallest number such that there exists some satisfying . Squaring this equation we get that or 푎 ∈ ℕ . But푎 this means that is even, and since the product푏2 ∈ ℕ2 of √ 푏 two2 odd2 numbers2 = is odd, it means that2 is even as well,2 or = in 푎 other/푏 푎words,= 2푏 (∗) for some . Yet푎 plugging this into shows that which′ means ′ is an푎 even number as well. By the same′2 considerations푎 =2 2푎 as above푎2 ∈ ℕwe′2 get that is even and hence(∗) and 4푎 are= 2푏two natural numbers푏 = satisfying 2푎 , contradicting the 푏 푎/2 minimality of . 푎/2 √ 푏/2 푏/2 = 2 ■ 푎 mathematical background 67

Proofs of a universal statement: Often we want to prove a statement of the form “Every object of type has property .” Such proofs often start with a sentence such as “Let be an object of type ” and end푋 by showing that has the property푂 . Here is a simple푃 example: 표 푂 Lemma 1.18 For every natural number , either or is even. 표 푃 Proof. Let be some number. If푛 ∈ 푁is a whole푛 number푛 + 1 then we are done, since then and hence it is even. Otherwise, is푛 a ∈ whole 푁 number, and hence푛/2 is even. 푛 = 2(푛/2) ■ 푛/2 + 1/2 2(푛/2 + 1/2) = 푛 + 1 Proofs of an implication: Another common case is that the statement has the form “ implies ”. Such proofs often start with a sentence such as “Assume that is true” and end with a derivation of from푋 . Here is a simple퐴 example:퐵 퐴 퐵 Lemma 1.19 If then there is a solution to the quadratic equa- 퐴 tion 2 . 2 푏 ≥ 4푎푐 Proof.푎푥Suppose+ 푏푥 + 푐that = 0 . Then is a non-negative number and hence it has2 a square root . Thus2 satisfies 푏 ≥ 4푎푐 푑 = 푏 − 4푎푐 푠 푥 = (−푏 + 푠)/(2푎) (1.17) 2 2 2 푎푥 + 푏푥 + 푐 = 푎(−푏 + 푠) /(4푎 ) + 푏(−푏 + 푠)/(2푎) + 푐 2 2 2 ■ = (푏 − 2푏푠 + 푠 )/(4푎) + (−푏 + 푏푠)/(2푎) + 푐 . Rearranging the terms of (1.17) we get

(1.18) 2 2 2 2 Proofs푠 /(4푎) of equivalence: + 푐 − 푏 /(4푎)If a =statement (푏 − 4푎푐)/(4푎) has the+ form 푐 − “ 푏 /(4푎)if and = only 0 if ” (often shortened as “ iff ”) then we need to prove both that implies and that implies . We call the implication퐴 that implies 퐵 the “only if” direction,퐴 and퐵 the implication that implies the “if”퐴 direction.퐵 퐵 퐴 퐴 퐵 퐵 퐴 Proofs by combining intermediate claims: When a proof is more complex, it is often helpful to break it apart into several steps. That is, to prove the statement , we might first prove statements , , and and then prove that implies . (Recall that denotes the 1 2 3 logical AND operator.)푋 푋 푋 푋 1 2 3 푋 ∧ 푋 ∧ 푋 푋 ∧ Proofs by case distinction: This is a special case of the above, where to prove a statement we split into several cases , and prove that (a) the cases are exhaustive, in the sense that one of the cases 1 푘 must happen and (b)푋 go one by one and prove퐶 that, … each , 퐶 one of the 푖 cases implies the result that we are after. 퐶

푖 퐶 푋 68 introduction to theoretical computer science

Proofs by induction: We discuss induction and give an example in Section 1.6.1 below. We can think of such proofs as a variant of the above, where we have an unbounded number of intermediate claims , and we prove that is true, as well as that implies , and that implies , and so on and so forth. The website 0 2 푘 0 0 푋for, CMU 푋 , … course , 푋 15-251 contains a푋useful handout on potential푋 pitfalls 1 0 1 2 푋when making푋 proofs∧ 푋 by induction.푋

”Without loss of generality (w.l.o.g)”: This term can be initially quite con- fusing. It is essentially a way to simplify proofs by case distinctions. The idea is that if Case 1 is equal to Case 2 up to a change of variables or a similar transformation, then the proof of Case 1 will also imply the proof of Case 2. It is always a statement that should be viewed with suspicion. Whenever you see it in a proof, ask yourself if you understand why the assumption made is truly without loss of gen- erality, and when you use it, try to see if the use is indeed justified. When writing a proof, sometimes it might be easiest to simply repeat the proof of the second case (adding a remark that the proof is very similar to the first one).

R Remark 1.20 — Hierarchical Proofs (optional). Mathe- matical proofs are ultimately written in English prose. The well-known Leslie Lamport argues that this is a problem, and proofs should be written in a more formal and rigorous way. In his manuscript he proposes an approach for structured hierarchical proofs, that have the following form:

• A proof for a statement of the form “If then ” is a sequence of numbered claims, starting with the assumption that is true, and ending퐴 with퐵 the claim that is true. • Every claim is followed퐴 by a proof showing how it is derived퐵 from the previous assumptions or claims. • The proof for each claim is itself a sequence of subclaims.

The advantage of Lamport’s is that the role that every sentence in the proof plays is very clear. It is also much easier to transform such proofs into machine-checkable forms. The disadvantage is that such proofs can be tedious to read and write, with less differentiation between the important parts ofthe arguments versus the more routine ones. mathematical background 69

1.6 EXTENDED EXAMPLE: TOPOLOGICAL SORTING

In this section we will prove the following: every directed acyclic graph (DAG, see Definition 1.9) can be arranged in layers so that for all directed edges , the layer of is larger than the layer of . This result is known as topological sorting and is used in many appli- cations, including푢 task → scheduling, 푣 build푣 systems, software package푢 management, spreadsheet calculations, and many others (see Fig. 1.8). In fact, we will also use it ourselves later on in this book.

Figure 1.8: An example of topological sorting. We con- sider a directed graph corresponding to a prerequisite graph of the courses in some Computer Science pro- gram. The edge means that the course is a prerequisite for the course .A layering or “topologi- cal sorting” of this푢→ graph 푣 is the same as mapping푢 the courses to semesters so that푣 if we decide to take the course in semester , then we have already taken all the prerequisites for (i.e., its in-neighbors) in prior semesters.푣 푓(푣) 푣

We start with the following definition. A layering of a directed graph is a way to assign for every vertex a natural number (corresponding to its layer), such that ’s in-neighbors are in lower-numbered layers than , and ’s out-neighbors푣 are in higher-numbered layers. The formal definition푣 is as follows: 푣 푣 Definition 1.21 — Layering of a DAG. Let be a directed graph. A layering of is a function such that for every edge of , . 퐺 = (푉 , 퐸) 퐺 푓 ∶ 푉 → ℕ 푢In → this 푣 section퐺 푓(푢) we < prove 푓(푣) that a directed graph is acyclic if and only if it has a valid layering.

Theorem 1.22 — Topological Sort. Let be a directed graph. Then is acyclic if and only if there exists a layering of . 퐺 퐺 To prove such a theorem, we need to first푓 understand퐺 what it means. Since it is an “if and only if” statement, Theorem 1.22 corre- sponds to two statements: Lemma 1.23 For every directed graph , if is acyclic then it has a layering. 퐺 퐺 Lemma 1.24 For every directed graph , if has a layering, then it is acyclic. 퐺 퐺 70 introduction to theoretical computer science

To prove Theorem 1.22 we need to prove both Lemma 1.23 and Lemma 1.24. Lemma 1.24 is actually not that hard to prove. Intuitively, if contains a cycle, then it cannot be the case that all edges on the cycle increase in layer number, since if we travel along the cycle at some퐺 point we must come back to the place we started from. The formal proof is as follows:

Proof. Let be a directed graph and let be a layering of as per Definition 1.21 . Suppose, towards a contradiction, that is not퐺 acyclic, = (푉 , 퐸) and hence there exists some cycle푓 ∶ 푉 → ℕ such that 퐺 and for every the edge is present in 0 1 푘 . Since퐺 is a layering, for every , 푢,, which 푢 , … , means 푢 0 푘 푖 푖+1 that 푢 = 푢 푖 ∈ [푘] 푢 → 푢 푖 푖+1 퐺 푓 푖 ∈ [푘] 푓(푢 ) < 푓(푢 ) (1.19) but this is a contradiction0 since 1 and hence푘 . 푓(푢 ) < 푓(푢 ) < ⋯ < 푓(푢 ) ■ 0 푘 0 푘 푢 = 푢 푓(푢 ) = 푓(푢 ) Lemma 1.23 corresponds to the more difficult (and useful) direc- tion. To prove it, we need to show how given an arbitrary DAG , we can come up with a layering of the vertices of so that all edges “go up”. 퐺 퐺

P If you have not seen the proof of this theorem before (or don’t remember it), this would be an excellent point to pause and try to prove it yourself. One way to do it would be to describe an algorithm that given as input a directed acyclic graph on vertices and or fewer edges, constructs an array of length such that for every edge in the퐺 graph푛 푛−2. 퐹 푛 푢 → 푣 퐹 [푢] < 퐹 [푣]

1.6.1 Mathematical induction There are several ways to prove Lemma 1.23. One approach to do is to start by proving it for small graphs, such as graphs with 1, 2 or 3 vertices (see Fig. 1.9, for which we can check all the cases, and then try to extend the proof for larger graphs. The technical term for this proof approach is proof by induction. Induction is simply an application of the self-evident Modus Ponens rule that says that if

(a) is true and 푃 Figure 1.9: Some examples of DAGs of one, two and (b) implies three vertices, and valid ways to assign layers to the then is true. vertices. 푃 푄 푄 mathematical background 71

In the setting of proofs by induction we typically have a statement that is parameterized by some integer , and we prove that (a) is true, and (b) For every , if are all true then푄(푘) is true. (Usually proving (b) is the푘 hard part, though there are푄(0) examples where the “base case”푘 > 0(a) 푄(0),is quite … ,subtle.) 푄(푘 − 1) By applying Modus푄(푘) Ponens, we can deduce from (a) and (b) that is true. Once we did so, since we now know that both and are true, then we can use this and (b) to deduce (again using Modus푄(1) Ponens) that is true. We can repeat the same reasoning푄(0) again푄(1) and again to obtain that is true for every . The statement (a) is called the “base푄(2) case”, while (b) is called the “inductive step”. The assumption in (b) that 푄(푘)holds for is called푘 the “inductive hypothesis”. (The form of induction described here is sometimes called “strong induction”푄(푖) as opposed to푖 “weak < 푘 induction” where we replace (b) by the statement (b’) that if is true then is true; weak induction can be thought of as the special case of strong induction where we don’t use the assumption푄(푘 − 1) that 푄(푘) are true.)

푄(0), … , 푄(푘 − 2) R Remark 1.25 — Induction and recursion. Proofs by in- duction are closely related to algorithms by recursion. In both cases we reduce solving a larger problem to solving a smaller instance of itself. In a recursive algo- rithm to solve some problem P on an input of length we ask ourselves “what if someone handed me a way to solve P on instances smaller than ?”. In an inductive푘 proof to prove a statement Q parameterized by a number , we ask ourselves “what if푘 I already knew that is true for ?”. Both induction and recursion푘 are′ crucial concepts′ for this course and Computer푄(푘 Science) at large (and푘 < even 푘 other areas of inquiry, including not just mathematics but other as well). Both can be confusing at first, but with time and practice they become clearer. For more on proofs by induction and recursion, you might find the following Stanford CS 103 handout, this MIT 6.00 lecture or this excerpt of the Lehman-Leighton book useful.

1.6.2 Proving the result by induction There are several ways to use induction to prove Lemma 1.23 by in- duction. We will use induction on the number of vertices, and so we will define the statement as follows: 푛 푄(푛) is “For every DAG with vertices, there is a layering of .”

푄(푛) 퐺 = (푉 , 퐸) 푛 퐺 72 introduction to theoretical computer science

The statement for (where the graph contains no vertices) is trivial. Thus it will suffice to prove the following: for every , if is true then 푄(0) is true. To do so, we need to somehow find a way, given a graph푛 >of 0 vertices,푄(푛 − 1) to reduce the푄(푛) task of finding a layering for into the task of finding a layering for some other graph of vertices.퐺 The idea푛 is that we will find a source of : a vertex that′ has no퐺 in-neighbors. We can then assign to the layer , and layer퐺 the remaining푛 − 1 vertices using the inductive hypothesis in layers퐺 푣. The above is the푣 intuition behind0 the proof of Lemma 1.23, but when writing the formal proof below,1, 2, we… use the benefit of hind- sight, and try to streamline what was a messy journey into a linear and easy-to-follow flow of logic that starts with the word “Proof:” 5 ■ 5 QED stands for “quod erat demonstrandum”, which and ends with “QED” or the symbol . Discussions, examples and is Latin for “what was to be demonstrated” or “the digressions can be very insightful, but we keep them outside the space very thing it was required to have shown”. delimited between these two words, where (as described by this ex- cellent handout) “every sentence must be load bearing”. Just like we do in programming, we can break the proof into little “subroutines” or “functions” (known as lemmas or claims in math language), which will be smaller statements that help us prove the main result. How- ever, the proof should be structured in a way that ensures that it is always crystal-clear to the reader in what stage we are of the proof. The reader should be able to tell what is the role of every sentence in the proof and which part it belongs to. We now present the formal proof of Lemma 1.23.

Proof of Lemma 1.23. Let be a DAG and be the number of its vertices. We prove the lemma by induction on . The 6 base case is where퐺 there = (푉 are , 퐸) no vertices, and so푛 the = |푉 statement | is Using as the base case is logically valid, but can be confusing. If you find the trivial case 6 푛 trivially true. For the case of , we make the inductive hypothesis to be confusing,푛 = 0 you can always directly verify the that every DAG푛 = 0 of at most vertices has a layering. statement for and then use both 푛 = 0 and as the base cases. We make the following′ claim:푛 > 0 푛 = 1 푛 = 0 Claim: must퐺 contain a vertex푛 − 1 of in-degree zero. 푛 = 1 Proof of Claim: Suppose otherwise that every vertex has an in-neighbor.퐺 Let be some vertex푣 of , let be an in-neighbor of , be an in-neighbor of , and continue in this way for 푣steps ∈ 푉 until 0 1 0 we construct a list푣 such that퐺 for푣 every , is an푣 2 1 in-neighbor푣 of , or in other푣 words the edge is푛 present in the 0 1 푛 푖+1 graph. Since there푣 are, 푣 only, … , 푣vertices in this graph,푖 one ∈ [푛] of푣 the 푖 푖+1 푖 vertices in this푣 sequence must repeat itself, and푣 so→ there 푣 exists such that . But then the푛 sequence is푛 a + cycle 1 in , contradicting our assumption that it is acyclic. (QED Claim)푖 < 푗 푖 푗 푗 푗−1 푖 Given the푣 = claim, 푣 we can let be some푣 vertex→ 푣 of in-degree→ ⋯ → 푣 zero in ,퐺 and let be the graph obtained by removing from . has 0 ′ 푣 ′ 0 퐺 퐺 푣 퐺 퐺 mathematical background 73

vertices and hence per the inductive hypothesis has a layering . We define as follows: 푛′ − 1 ⧵ 0 푓 ∶ (푉 {푣 }) → ℕ 푓 ∶ 푉 → ℕ (1.20) ′ 0 {⎧푓 (푣) + 1 푣 ≠ 푣 We claim that is푓(푣) a valid = layering, namely that0 . for every edge {⎨0 푣 = 푣 , . To prove this,⎩ we split into cases: 푓 푢 → 푣• 푓(푢)Case < 1: 푓(푣) , . In this case the edge exists in the graph and hence by the inductive hypothesis 0 0 which implies푢′ ≠ 푣 that푣 ≠ 푣 . 푢 →′ 푣 ′ 퐺 ′ ′ 푓 (푢) < 푓 (푣) • Case 2: , 푓 (푢). In + this 1 < case 푓 (푣) + 1 and . ′ 0 0 푢 = 푣 푣 ≠ 푣 푓(푢) = 0 푓(푣) = 푓 (푣) + 1 > • Case0 3: , . This case can’t happen since does not have in-neighbors. 0 0 0 푢 ≠ 푣 푣 = 푣 푣 • Case 4: . This case again can’t happen since it means that is its own-neighbor — it is involved in a self loop which is a 0 0 form cycle푢 = that 푣 , is 푣 =disallowed 푣 in an acyclic graph. 0 푣 Thus, is a valid layering for which completes the proof. ■ 푓 퐺

P Reading a proof is no less of an important skill than producing one. In fact, just like understanding code, it is a highly non-trivial skill in itself. Therefore I strongly suggest that you re-read the above proof, ask- ing yourself at every sentence whether the assumption it makes is justified, and whether this sentence truly demonstrates what it purports to achieve. Another good habit is to ask yourself when reading a proof for every variable you encounter (such as , , , , etc. in the above proof) the following questions: (1)′ ′What type of variable is it? is it a number? a graph?푢 푖 퐺 a푓 vertex? a function? and (2) What do we know about it? Is it an arbitrary member of the set? Have we shown some facts about it?, and (3) What are we trying to show about it?.

1.6.3 Minimality and uniqueness Theorem 1.22 guarantees that for every DAG there exists some layering but this layering is not necessarily unique. For example, if is a valid layering퐺 of = the (푉 graph , 퐸) then so is the function 푓defined ∶ 푉 → ℕ as . However, it turns out that ′ 푓 ∶ 푉 → ℕ ′ 푓 푓 (푣) = 2 ⋅ 푓(푣) 74 introduction to theoretical computer science

the minimal layering is unique. A minimal layering is one where every vertex is given the smallest layer number possible. We now formally define minimality and state the uniqueness theorem:

Theorem 1.26 — Minimal layering is unique. Let be a DAG. We say that a layering is minimal if for every vertex , if has no in-neighbors then and퐺 if =has (푉 , in-neighbors 퐸) then there exists an in-neighbor푓 ∶ 푉 → ℕof such that .푣 ∈ 푉 푣 For every layering 푓(푣) =of 0 , if both푣 and are minimal then . 푢 푣 푓(푢) = 푓(푣) − 1 푓, 푔 ∶ 푉 → ℕ 퐺 푓 푔 The definition푓 = 푔 of minimality in Theorem 1.26 implies that for every vertex , we cannot move it to a lower layer without making the layering invalid. If is a source (i.e., has in-degree zero) then a minimal푣 ∈ layering 푉 must put it in layer , and for every other , if , then we cannot푣 modify this to set since there is an-neighbor of 푓 satisfying 0 . What Theorem 1.26푣 says푓(푣) is = that 푖 a minimal layering is unique in푓(푣) the sense ≤ 푖 − that 1 every other minimal layering푢 is푣 equal to . 푓(푢) = 푖 − 1 푓 Proof Idea: 푓 The idea is to prove the theorem by induction on the layers. If and are minimal then they must agree on the source vertices, since both and should assign these vertices to layer . We can then show푓 that 푔if and agree up to layer , then the minimality property implies 푓that they푔 need to agree in layer as well. In the0 actual proof we use a small푓 trick푔 to save on writing.푖 − 1 Rather than proving the statement that (or in other words that푖 for every ), we prove the weaker statement that for every . (This푓 is = a weaker푔 statement since the푓(푣) condition = 푔(푣) that is푣 lesser ∈ 푉 or equal than to is implied by the condition푓(푣) ≤ 푔(푣) that is equal푣 ∈ to푉 .) However, since and are just labels we give푓(푣) to two minimal layerings, by simply푔(푣) changing the names “ ” and “푓(푣)” the same proof 푔(푣)also shows that 푓 for푔 every and hence that . 푓 푔 푔(푣) ≤ 푓(푣) 푣 ∈ 푉 푓 = 푔 Proof⋆ of Theorem 1.26. Let be a DAG and be two minimal valid layering of . We will prove that for every , . Since we didn’t퐺 = assume (푉 , 퐸) anything about푓, 푔 ∶except 푉 → their ℕ minimality, the same proof will퐺 imply that for every , 푣 ∈ 푉 푓(푣)and hence ≤ 푔(푣) that for every , which is푓, what 푔 we needed to show. 푣 ∈ 푉 푔(푣) ≤ 푓(푣) We will prove푓(푣) that = 푔(푣) for푣 every ∈ 푉 by induction on . The case is immediate: since in this case , must be at least 푓(푣). For ≤ 푔(푣) the case 푣, by ∈ the 푉 minimality of , 푖 = 푓(푣) 푖 = 0 푓(푣) = 0 푔(푣) 푓(푣) 푖 > 0 푓 mathematical background 75

if then there must exist some in-neighbor of such that . By the induction hypothesis we get that , and since푓(푣) =is a 푖 valid layering it must hold that 푢 which푣 means 푓(푢)that = 푖 − 1 . 푔(푢) ≥ 푖 − 1 푔 푔(푣) > 푔(푢) ■ 푔(푣) ≥ 푖 = 푓(푣)

P The proof of Theorem 1.26 is fully rigorous, but is written in a somewhat terse manner. Make sure that you read through it and understand why this is indeed an airtight proof of the Theorem’s statement.

1.7 THIS BOOK: NOTATION AND CONVENTIONS

Most of the notation we use in this book is standard and is used in most mathematical texts. The main points where we diverge are:

• We index the natural numbers starting with (though many other texts, especially in computer science, do the same). ℕ 0 • We also index the set starting with , and hence define it as . In other texts it is often defined as . Similarly, we index our strings starting[푛] with , and0 hence a string is{0, written … , 푛−1} as . {1, … , 푛} 푛 0 푥 ∈ {0, 1} • If is a natural0 number1 푛−1 then does not equal the number but 푥 푥 ⋯ 푥 rather this is the length string푛 (that is a string of ones). Similarly,푛 refers to the length1 string . 1 푛 푛 11 ⋯ 1 푛 • Partial functions are functions that are not necessarily defined on 0 푛 00 ⋯ 0 all inputs. When we write this means that is a total function unless we say otherwise. When we want to emphasize that can be a partial function,푓 we ∶ 퐴will → sometimes 퐵 write 푓 .

• As we will see later on in the course, we will mostly describe푝 our 푓 푓 ∶ 퐴 → 퐵 computational problems in terms of computing a Boolean function . In contrast, many other textbooks refer to the same task∗ as deciding a language . These two viewpoints 푓are ∶ {0,equivalent, 1} → {0, since 1} for every set ∗ there is a correspond- ing function such that 퐿 ⊆if and{0, 1} only∗ if . Computing partial functions corresponds to the퐿 task ⊆ {0, known 1} in the literature as a solving a퐹promise problem퐹 (푥). Because= 1 the language푥 ∈ 퐿notation is so prevalent in other textbooks, we will occasionally remind the reader of this correspondence.

• We use and for the “ceiling” and “floor” operators that correspond to “rounding up” or “rounding down” a number to the ⌈푥⌉ ⌊푥⌋ 76 introduction to theoretical computer science

nearest integer. We use mod to denote the “remainder” of when divided by . That is, mod . In context when an integer is expected(푥 we’ll푦) typically “silently round” the 푥 quantities to an integer.푦 For(푥 example,푦) if we = 푥 say − that푦⌊푥/푦⌋is a string of length then this means that is of length . (We round up for the sake of convention, but in most such cases, it푥 will not make a √ √ difference푛 whether we round푥 up or down.) ⌈ 푛 ⌉

• Like most Computer Science texts, we default to the logarithm in base two. Thus, log is the same as log .

• We will also use the푛 notation 2 푛 as a shorthand for (i.e., as shorthand for saying that there are some constants 푂(1) such that 푓(푛) = 푝표푙푦(푛)for every sufficiently large 푓(푛)). Similarly, = 푛 we will use 푏 as shorthand for 푎, 푏 log (i.e.,푓(푛) as shorthand ≤ 푎 ⋅ 푛 for saying that there are some푛 constants such that푓(푛) = 푝표푙푦푙표푔(푛)log for every sufficiently 푓(푛)large =). 푝표푙푦( 푛) 푏 푎, 푏 푓(푛) ≤ 푎 ⋅ ( 푛) • As in푛 often the case in mathematical literature, we use the apostro- phe character to enrich our set of identifiers. Typically if denotes some object, then , , etc. will denote other objects of the same type. ′ ″ 푥 푥 푥 • To save on “cognitive load” we will often use round constants such as in the statements of both theorems and problem set questions. When you see such a “round” constant, you can typically10, 100, assume 1000 that it has no special significance and was just chosen arbitrarily. For example, if you see a theorem of the form “Algorithm takes at most steps to compute function on inputs of length ” then probably2 the number is an abitrary sufficiently퐴 large constant,1000 ⋅and 푛 one could prove the same퐹 theorem with a bound of the푛 form for a constant that1000 is smaller than . Similarly, if a problem asks2 you to prove that some quantity is at least , it is quite possible푐 ⋅ 푛 that in truth푐 the quantity is at least 1000for some constant that is smaller than . 푛/100 1.7.1푛/푑 Variable name conventions푑 100 Like programming, mathematics is full of variables. Whenever you see a variable, it is always important to keep track of what is its type (e.g., whether the variable is a number, a string, a function, a graph, etc.). To make this easier, we try to stick to certain conventions and consis- tently use certain identifiers for variables of the same type. Some of these conventions are listed in Section 1.7.1 below. These conventions are not immutable laws and we might occasionally deviate from them. mathematical background 77

Also, such conventions do not the need to explicitly declare for each new variable the type of object that it denotes.

Table 1.2: Conventions for identifiers in this book

Identifier Often denotes object of type , , , , , Natural numbers (i.e., in ) Small positive real numbers (very close to ) 푖 푗 푘 ℓ 푚 푛 Typically strings in ℕthough = {0, 1, sometimes 2, …} numbers or 휖, 훿 other objects. We often identify∗ an object with0 its 푥, 푦, 푧, 푤 representation as a string.{0, 1} A graph. The set of ’s vertices is typically denoted by . Often . The set of ’s edges is typically denoted by 퐺 . 퐺 푉 Set 푉 = [푛] 퐺 퐸Functions. We often (though not always) use lowercase 푆 identifiers for finite functions, which map to 푓, 푔, ℎ (often ). 푛 푚 Infinite (unbounded input) functions{0, mapping 1} {0, 1}to 푚or = 1 to for some . Based on context,∗ 퐹 , 퐺, 퐻 the identifiers∗ ∗ are sometimes푚 used to denote{0, 1} {0,functions 1} {0, and 1} sometimes{0, 1} graphs. 푚 Boolean circuits퐺, 퐻 Turing machines 퐴, 퐵, 퐶 Programs 푀, 푁 A function mapping to that corresponds to a time 푃 , 푄 bound. 푇 A positive number (oftenℕ ℕ an unspecified constant; e.g., corresponds to the existence of s.t. 푐 every ). We sometimes use in a 푇similar (푛) = way. 푂(푛) 푐 Finite푇 (푛) ≤ set 푐 ⋅ (often 푛 used푛 > as 0 the alphabet for a set of푎, strings). 푏

Σ

1.7.2 Some idioms Mathematical texts often employ certain conventions or “idioms”. Some examples of such idioms that we use in this text include the following:

• “Let be ”, “let denote ”, or “let ”: These are all different ways for us to say that we are defining the symbol to stand푋 for whatever… 푋 expression… is in the 푋. When = … is a property of some objects we might define by writing something along푋 the lines of “We say that has the property… if .”푋. While we often 푋 … 푋 … 78 introduction to theoretical computer science

try to define terms before they are used, sometimes a mathematical sentence reads easier if we use a term before defining it, in which case we add “Where is ” to explain how is defined in the preceding expression. 푋 … 푋 • Quantifiers: Mathematical texts involve many quantifiers such as “for all” and “exists”. We sometimes spell these in words as in “for all ” or “there is ”, and sometimes use the formal symbols and . It is important∗ to keep track on which variable is quantified푖 ∈ ℕ in what푥 way ∈ the{0,dependencies 1} between the variables. For example,∀ a sentence∃ fragment such as “for every there exists ” means that can be chosen in a way that depends on . The order of quantifiers is important. For example, the 푘following > 0 is atrue 푛statement: “for푛 every natural number there exists a prime푘 number such that divides .” In contrast, the following statement is false: “there exists a prime number such that푘 >for 1 every natural number , 푛 divides .”푛 푘 푛 푘 > 1 • 푛Numbered푘 equations, theorems, definitions: To keep track of all the terms we define and statements we prove, we often assign them a (typically numeric) label, and then refer back to them in other parts of the text.

• (i.e.,), (e.g.,): Mathematical texts tend to contain quite a few of these expressions. We use (i.e., ) in cases where is equivalent to and (e.g., ) in cases where is an example of (e.g., one can use phrases such as “a푋 natural푌 number (i.e., a non-negative푌 integer)”푋 푋 or “a natural푌 number (e.g.,푌 )”). 푋

• “Thus”, “Therefore” , “We get that”:7 This means that the following sentence is implied by the preceding one, as in “The -vertex graph is connected. Therefore it contains at least edges.” We sometimes use “indeed” to indicate that the following푛 text justifies the퐺 claim that was made in the preceding sentence푛 − 1 as in “The - vertex graph has at least edges. Indeed, this follows since is connected.” 푛 퐺 푛 − 1 퐺 • Constants: In Computer Science, we typically care about how our algorithms’ resource consumption (such as running time) scales with certain quantities (such as the length of the input). We refer to quantities that do not depend on the length of the input as constants and so often use statements such as “there exists a constant such that for every , Algorithm runs in at most steps on inputs of length .” The qualifier “constant” for is not strictly2 needed푐 > but 0 is added to emphasize푛 ∈ ℕ that here퐴 is a fixed number푐 ⋅ 푛 independent of . In fact푛 sometimes, to reduce cognitive푐 load, we will simply replace 푐 푛 푐 mathematical background 79

by a sufficiently large round number such as , , or , or use -notation and write “Algorithm runs in time.” 102 100 1000 푂 퐴 푂(푛 ) Chapter Recap

• The basic “mathematical data structures” we’ll ✓ need are numbers, sets, tuples, strings, graphs and functions. • We can use basic objects to define more complex notions. For example, graphs can be defined as a list of pairs. • Given precise definitions of objects, we can state unambiguous and precise statements. We can then use mathematical proofs to determine whether these statements are true or false. • A mathematical proof is not a formal ritual but rather a clear, precise and “bulletproof” argument certifying the truth of a certain statement. • Big- notation is an extremely useful formalism to suppress less significant details and allows us to focus푂 on the high level behavior of quantities of interest. • The only way to get comfortable with mathematical notions is to apply them in the contexts of solving problems. You should expect to need to go back time and again to the definitions and notation in this chapter as you work through problems in this course.

1.8 EXERCISES

Exercise 1.1 — Logical expressions. a. Write a logical expression involving the variables and the operators (AND), (OR), and (NOT), such that is true if the majority of휑(푥) the 0 1 2 inputs are True. 푥 , 푥 , 푥 ∧ ∨ ¬ 휑(푥) b. Write a logical expression involving the variables and the operators (AND), (OR), and (NOT), such that 0 1 2 is true if the sum (identifying휑(푥) “true” with and푥 “false”, 푥 , 푥 with ) is odd. ∧2 ∨ ¬ 휑(푥) 푖=0 푖 ∑ 푥 1 ■ 0 Exercise 1.2 — Quantifiers. Use the logical quantifiers (for all), (there exists), as well as and the arithmetic operations to write the following: ∀ ∃ ∧, ∨, ¬ +, ×, =, >, < a. An expression such that for every natural numbers , is true if and only if divides . 휑(푛, 푘) 푛, 푘 b. An expression such that for every natural number , is 휑(푛, 푘) 푘 푛 true if and only if is a power of three. 휑(푛) 푛 휑(푛) 푛 80 introduction to theoretical computer science

Exercise 1.3 Describe the following statement in English words: . ■ 푛∈ℕ 푝>푛 ∀ ∃ ∀푎, 푏 ∈ ℕ(푎 × 푏 ≠ 푝) ∨ (푎 = 1) Exercise 1.4 — Set construction notation. Describe in words the following sets: a. 100 b. 푖∈{0,…,99} 푖 99−푖 푆 = {푥 ∈ {0, 1} ∶ ∀ 푥 = 푥 } ∗ 푖,푗∈{2,…,|푥|−1} ■ 푇 = {푥 ∈ {0, 1} ∶ ∀ 푖 ⋅ 푗 ≠ |푥|} Exercise 1.5 — Existence of one to one mappings. For each one of the fol- lowing pairs of sets , prove or disprove the following statement: there is a one to one function mapping to . (푆, 푇 ) a. Let . and푓 푆 푇 . 푛 b. Let . is the set of all functions mapping to . 푛 > 10 푆 = {0, 1} 푇 = [푛] × [푛] × [푛] . 푛 3 푛 > 10푛 푆 {0, 1} {0, 1} c. Let . is prime , log . 푇 = {0, 1} ⌈ 푛−1⌉ ■ 푛 > 100 푆 = {푘 ∈ [푛] | 푘 } 푇 = {0, 1} Exercise 1.6 — Inclusion Exclusion. a. Let be finite sets. Prove that . 퐴, 퐵 b. Let be finite sets. Prove that |퐴 ∪ 퐵| = |퐴| + |퐵| − |퐴 ∩ 퐵| . 0 푘−1 0 푘−1 푘−1퐴 , … , 퐴 |퐴 ∪ ⋯ ∪ 퐴 | ≥ c. Let 푖 be finite푖 subsets푗 of , such that for ∑푖=0 |퐴 | − ∑0≤푖<푗<푘 |퐴 ∩ 퐴 | every . Prove that if , then there exist two distinct 0 푘−1 푖 sets퐴 , … ,s.t. 퐴 {1,. … , 푛} |퐴 | = 푚 푖 ∈ [푘] 푘2 > 100푛 ■ 푖 푗 푖 푗 퐴 , 퐴 |퐴 ∩ 퐴 | ≥ 푚 /(10푛) Exercise 1.7 Prove that if are finite and is one to one then . 푆, 푇 퐹 ∶ 푆 → 푇 ■ |푆| ≤ |푇 | Exercise 1.8 Prove that if are finite and is onto then . 푆, 푇 퐹 ∶ 푆 → 푇 ■ |푆| ≥ |푇 | Exercise 1.9 Prove that for every finite , there are partial functions from to . |푆| 푆, 푇 (|푇 | + 1) ■ 푆 푇 Exercise 1.10 Suppose that is a sequence such that and for . Prove by induction that log for 푛 푛∈ℕ 0 every . {푆 } 푆 ≤ 10 푛 푛 푛 푛 > 1 푆 ≤ 5푆⌊ 5 ⌋ + 2푛 푆 ≤ 100푛 푛 푛 mathematical background 81

Exercise 1.11 Prove that for every undirected graph of vertices, if every vertex has degree at most , then there exists a subset of at least vertices such that no two vertices in are neighbors퐺 100 of one another. 4 푆 20 푆 ■

Exercise 1.12 — -notation. For every pair of functions below, deter- mine which of the following relations holds: , , or 푂 . 퐹 , 퐺 퐹 = 푂(퐺) 퐹 = Ω(퐺) 퐹a. = 표(퐺) ,퐹 = 휔(퐺) . b. 퐹 (푛) = 푛, 퐺(푛) = 100푛. c. log , √ log . 퐹 (푛) = 푛 퐺(푛) = 푛 2 ( (푛)) log d. 퐹 (푛) = 푛 , 푛 퐺(푛) = 2 √ √ 푛 e. 퐹 (푛) = 푛 퐺(푛), = 2 (where is the number of -sized 7 one way to do this is to use Stirling’s approximation subsets of a푛 set of size ) and0.1푛 푛 . See footnote for hint.7 for the factorial function.. ⌈0.2푛⌉ 푘 퐹 (푛) = ( ) 퐺(푛) = 2 0.1푛( ) 푘 푛 푔(푛) = 2 ■ Exercise 1.13 Give an example of a pair of functions such that neither nor holds. 퐹 , 퐺 ∶ ℕ → ℕ ■ 퐹 = 푂(퐺) 퐺 = 푂(퐹 ) Exercise 1.14 Prove that for every undirected graph on vertices, if has at least edges then contains a cycle. 퐺 푛 퐺■ 푛 퐺 Exercise 1.15 Prove that for every undirected graph of vertices, if every vertex has degree at most , then there exists a subset of at least vertices such that no two vertices in are퐺 neighbors1000 of one another. 4 푆 200 푆 ■

1.9 BIBLIOGRAPHICAL NOTES

The heading “A Mathematician’s Apology”, refers to Hardy’s classic book [Har41]. Even when Hardy is wrong, he is very much worth reading. There are many online sources for the mathematical background needed for this book. In particular, the lecture notes for MIT 6.042 “Mathematics for Computer Science” [LLM18] are extremely com- prehensive, and videos and assignments for this course are available online. Similarly, Berkeley CS 70: “Discrete Mathematics and Proba- bility Theory” has extensive lecture notes online. 82 introduction to theoretical computer science

Other sources for discrete mathematics are Rosen [Ros19] and Jim Aspens’ online book [Asp18]. Lewis and Zax [LZ19], as well as the online book of Fleck [Fle18], give a more gentle overview of the much of the same material. Solow [Sol14] is a good introduction to proof reading and writing. Kun [Kun18] gives an introduction to mathematics aimed at readers with programming background. Stanford’s CS 103 course has a wonderful collections of handouts on mathematical proof techniques and discrete mathematics. The word graph in the sense of Definition 1.3 was coined by the mathematician Sylvester in 1878 in analogy with the chemical graphs used to visualize molecules. There is an unfortunate confusion be- tween this term and the more common usage of the word “graph” as a way to plot data, and in particular a plot of some function as a function of . One way to relate these two notions is to identify every function with the directed graph over the vertex푓(푥) set such푥 that contains the edge for every . In 푓 a graph 푓 ∶constructed 퐴 → 퐵 in this way, every vertex퐺 in has out-degree 푓 푉equal = 퐴 to ∪ one.퐵 If the function퐺 is one to one then푥 → 푓(푥)every vertex in푥 ∈has 퐴 푓 in-degree퐺 at most one. If the function is onto then퐴 every vertex in has in-degree at least one. If푓 is a bijection then every vertex in퐵 has in-degree exactly equal to one. 푓 퐵 Carl Pomerance’s quote is푓 taken from the home page of Doron퐵 Zeilberger. Learning Objectives: • Distinguish between specification and implementation, or equivalently between mathematical functions and algorithms/programs. • Representing an object as a string (often of zeroes and ones). • Examples of representations for common objects such as numbers, vectors, lists, graphs. • Prefix-free representations. 2 • Cantor’s Theorem: The real numbers cannot be represented exactly as finite strings. Computation and Representation

“The alphabet (sic) was a great invention, which enabled men (sic) to store and to learn with little effort what others had learned the hard way – that is, to learn from books rather than from direct, possibly painful, contact with the real world.”, B.F. Skinner

“The name of the song is called ‘HADDOCK’S EYES.”’ [said the Knight] “Oh, that’s the name of the song, is it?” Alice said, trying to feel interested. “No, you don’t understand,” the Knight said, looking a little vexed. “That’s what the name is CALLED. The name really is ‘THE AGED AGED MAN.”’ “Then I ought to have said ‘That’s what the SONG is called’?” Alice cor- rected herself. “No, you oughtn’t: that’s quite another thing! The SONG is called ‘WAYS AND MEANS’: but that’s only what it’s CALLED, you know!” “Well, what IS the song, then?” said Alice, who was by this time com- pletely bewildered. “I was coming to that,” the Knight said. “The song really IS ‘A-SITTING ON A GATE’: and the tune’s my own invention.” Lewis Carroll, Through the Looking-Glass

To a first approximation, computation is a process that maps an input to an output. When discussing computation, it is essential to separate the ques- tion of what is the task we need to perform (i.e., the specification) from the question of how we achieve this task (i.e., the implementation). For example, as we’ve seen, there is more than one way to achieve the computational task of computing the product of two integers. In this chapter we focus on the what part, namely defining compu- tational tasks. For starters, we need to define the inputs and outputs. Figure 2.1: Our basic notion of computation is some Capturing all the potential inputs and outputs that we might ever process that maps an input to an output want to compute on seems challenging, since computation today is applied to a wide variety of objects. We do not compute merely on numbers, but also on texts, images, videos, connection graphs of social

Compiled on 8.26.2020 18:10 84 introduction to theoretical computer science

networks, MRI scans, gene data, and even other programs. We will represent all these objects as strings of zeroes and ones, that is objects such as or or any other finite list of ’s and ’s. (This choice is for convenience: there is nothing “holy” about zeroes and ones, and0011101 we could1011 have used any other finite collection1 0 of symbols.) Today, we are so used to the notion of digital representation that we are not surprised by the existence of such an encoding. But it is actually a deep insight with significant implications. Many animals can convey a particular fear or desire, but what is unique about hu- mans is language: we use a finite collection of basic symbols to describe a potentially unlimited range of experiences. Language allows trans- mission of information over both time and space and enables soci- eties that span a great many people and accumulate a body of shared Figure 2.2: We represent numbers, texts, images, net- knowledge over time. works and many other objects using strings of zeroes Over the last several decades, we have seen a revolution in what we and ones. Writing the zeroes and ones themselves in green font over a black background is optional. can represent and convey in digital form. We can capture experiences with almost perfect fidelity, and disseminate it essentially instanta- neously to an unlimited audience. Moreover, once information is in digital form, we can compute over it, and gain insights from data that were not accessible in prior times. At the heart of this revolution is the simple but profound observation that we can represent an unbounded variety of objects using a finite set of symbols (and in fact using only the two symbols 0 and 1). In later chapters, we will typically take such representations for granted, and hence use expressions such as “program takes as input” when might be a number, a vector, a graph, or any other object, when we really mean that takes as input the representation푃 푥 of as a binary푥 string. However, in this chapter we will dwell a bit more on how we can construct such representations.푃 푥 This chapter: A non-mathy overview The main takeaways from this chapter are:

• We can represent all kinds of objects we want to use as in- puts and outputs using binary strings. For example, we can use the binary basis to represent integers and rational num- bers as binary strings (see Section 2.1.1 and Section 2.2).

• We can use compose the representations of simple objects to represent more complex objects. In this way, we can represent lists of integers or rational numbers, and use that to represent objects such as matrices, images, and graphs. Prefix-free encoding is one way to achieve such a composition (see Section 2.5.2). computation and representation 85

•A computational task specifies a map between an input toan output— a function. It is crucially important to distinguish between the “what” and the “how”, or the specification and implementation (see Section 2.6.1). A function simply defines which output corresponds to which input. Itdoes not specify how to compute the output from the input, and as we’ve seen in the context of multiplication, there can be more than one way to compute the same function.

• While the set of all possible binary strings is infinite, it still cannot represent everything. In particular, there is no representation of the real numbers (with absolute accuracy) as binary strings. This result is also known as “Cantor’s Theorem” (see Section 2.4) and is typically referred to as the result that the “reals are uncountable”. It is also im- plies that there are different levels of infinity though we will not get into this topic in this book (see Remark 2.10).

The two “big ideas’ ’ we discuss are Big Idea 1 - we can com- pose representations for simple objects to represent more complex objects and Big Idea 2 - it is crucial to distinguish between functions (“what”) and programs (“how”). The latter will be a theme we will come back to time and again in this book.

2.1 DEFINING REPRESENTATIONS

Every time we store numbers, images, sounds, , or other ob- jects on a computer, what we actually store in the computer’s memory is the representation of these objects. Moreover, the idea of representa- tion is not restricted to digital computers. When we write down text or make a drawing we are representing ideas or experiences as sequences of symbols (which might as well be strings of zeroes and ones). Even our brain does not store the actual sensory inputs we experience, but rather only a representation of them. To use objects such as numbers, images, graphs, or others as inputs for computation, we need to define precisely how to represent these objects as binary strings. A representation scheme is a way to map an ob- ject to a binary string . For example, a representation scheme for natural numbers is a function∗ . Of course, we cannot푥 merely represent퐸(푥) all ∈ numbers{0, 1} as the string “ ∗ ” (for ex- ample). A minimal requirement is that if two퐸 ∶ ℕ numbers → {0, 1} and are different then they would be represented by different0011 strings.′ Another way to say this is that we require the encoding function푥 to be푥 one to one. 퐸 86 introduction to theoretical computer science

2.1.1 Representing natural numbers We now show how we can represent natural numbers as binary strings. Over the years people have represented numbers in a variety of ways, including Roman numerals, tally marks, our own Hindu- Arabic decimal system, and many others. We can use any one of those as well as many others to represent a number as a string (see Fig. 2.3). However, for the sake of concreteness, we use the binary basis as our default representation of natural numbers as strings. For example, we represent the number six as the string since , and similarly we represent the number thirty- five2 as the1 string 0 which satisfies 110 . 1Some ⋅ 2 + more 1 ⋅ 2 examples+ 0 ⋅ 2 = are 6 given in the table below.5 |푦|−푖−1 푖 푦 = 100011 ∑푖=0 푦 ⋅ 2 = 35 Table 2.1: Representing numbers in the binary basis. The lefthand column contains representations of natural numbers in the deci- mal basis, while the righthand column contains representations of the same numbers in the binary basis.

Number (decimal representation) Number (binary representation)

0 0 Figure 2.3: Representing each one the digits 1 1 as a bitmap image, which can be 2 10 thought of as a string in . Using this scheme we0, 1, can 2, …represent , 9 12 a natural × 8 number96 of decimal 5 101 digits as a string in {0, 1}. Image taken from blog 16 10000 post of A. C. Andersen. 96푛 푥 푛 40 101000 {0, 1} 53 110101 389 110000101 3750 111010100110

If is even, then the least significant digit of ’s binary representa- tion is , while if is odd then this digit equals . Just like the number 푛 corresponds to “chopping off” the least푛 significant decimal digit (e.g.,0 푛 ), the number1 corresponds ⌊푛/10⌋to the “chopping off” the least significant binary digit. Hence the bi- nary representation⌊457/10⌋ can = ⌊45.7⌋ be formally = 45 defined as ⌊푛/2⌋the following function ( stands for “natural numbers to strings”): ∗ 푁푡푆 ∶ ℕ → {0, 1} 푁푡푆 (2.1) ⎧0 푛 = 0 { 푁푡푆(푛) = ⎨1 푛 = 1 where { is the function defined as {푁푡푆(⌊푛/2⌋)푝푎푟푖푡푦(푛) 푛 > 1 if is even and ⎩ if is odd, and as usual, for strings 푝푎푟푖푡푦, ∶ ℕdenotes → {0, 1} the concatenation of and 푝푎푟푖푡푦(푛). The function = 0 푛 ∗ 푝푎푟푖푡푦(푛) = 1 푛 푥, 푦 ∈ {0, 1} 푥푦 푥 푦 computation and representation 87

is defined recursively: for every we define in terms of the representation of the smaller number . It is also possible to 푁푡푆define non-recursively, see Exercise푛 > 2.2 0 . 푟푒푝(푛) Throughout most of this book, the particular⌊푛/2⌋ choices of represen- tation of푁푡푆 numbers as binary strings would not matter much: we just need to know that such a representation exists. In fact, for many of our purposes we can even use the simpler representation of mapping a natural number to the length- all-zero string . 푛 푛 푛 0 R Remark 2.1 — Binary representation in python (optional). We can implement the binary representation in Python as follows:

def NtS(n):# natural numbers to strings if n > 1: return NtS(n // 2) + str(n % 2) else: return str(n % 2)

print(NtS(236)) # 11101100

print(NtS(19)) # 10011

We can also use Python to implement the inverse transformation, mapping a string back to the natural number it represents.

def StN(x):# String to number k = len(x)-1 return sum(int(x[i])*(2**(k-i)) for i in range(k+1))

↪ print(StN(NtS(236))) # 236

R Remark 2.2 — Programming examples. In this book, we sometimes use code examples as in Remark 2.1. The point is always to emphasize that certain com- putations can be achieved concretely, rather than illustrating the features of Python or any other pro- gramming language. Indeed, one of the messages of this book is that all programming languages are in a certain precise sense equivalent to one another, and hence we could have just as well used JavaScript, C, COBOL, Visual Basic or even BrainF*ck. This book is not about programming, and it is absolutely OK if 88 introduction to theoretical computer science

you are not familiar with Python or do not follow code examples such as those in Remark 2.1.

2.1.2 Meaning of representations (discussion) It is natural for us to think of as the “actual” number, and of as “merely” its representation. However, for most Euro- peans in the middle ages CCXXXVI236 would be the “actual” number and 11101100(if they have even heard about it) would be the weird Hindu- 1 1 While the Babylonians already invented a positional Arabic positional representation. When our AI robot overlords ma- system much earlier, the decimal positional system terialize,236 they will probably think of as the “actual” number we use today was invented by Indian mathematicians and of as “merely” a representation that they need to use when around the third century. Arab mathematicians took it up in the 8th century. It first received significant they give commands to humans. 11101100 attention in Europe with the publication of the 1202 So what236 is the “actual” number? This is a question that philoso- book “Liber Abaci” by Leonardo of Pisa, also known as Fibonacci, but it did not displace Roman numerals in phers of mathematics have pondered over throughout history. Plato common usage until the 15th century. argued that mathematical objects exist in some ideal sphere of exis- tence (that to a certain extent is more “real” than the world we per- ceive via our senses, as this latter world is merely the shadow of this ideal sphere). In Plato’s vision, the symbols are merely notation for some ideal object, that, in homage to the late musician, we can refer to as “the number commonly represented236 by ”. The Austrian philosopher Ludwig Wittgenstein, on the other hand, argued that mathematical objects do not exist at all,236 and the only things that exist are the actual marks on paper that make up , or CCXXXVI. In Wittgenstein’s view, mathematics is merely about formal manipulation of symbols that do not have any inherent236 11101100meaning. You can think of the “actual” number as (somewhat recur- sively) “that thing which is common to , and CCXXXVI and all other past and future representations that are meant to capture the same object”. 236 11101100 While reading this book, you are free to choose your own phi- losophy of mathematics, as long as you maintain the distinction be- tween the mathematical objects themselves and the various particular choices of representing them, whether as splotches of , pixels on a screen, zeroes and one, or any other form.

2.2 REPRESENTATIONS BEYOND NATURAL NUMBERS

We have seen that natural numbers can be represented as binary strings. We now show that the same is true for other types of objects, including (potentially negative) integers, rational numbers, vectors, lists, graphs and many others. In many instances, choosing the “right” string representation for a piece of data is highly nontrivial, and find- ing the “best” one (e.g., most compact, best fidelity, most efficiently manipulable, robust to errors, most informative features, etc.) is the computation and representation 89

object of intense research. But for now, we focus on presenting some simple representations for various objects that we would like to use as inputs and outputs for computation.

2.2.1 Representing (potentially negative) integers Since we can represent natural numbers as strings, we can represent the full set of integers (i.e., members of the set ) by adding one more bit that represents the sign. To represent a (potentially negative) number ℤ, = we {… prepend , −3, −2, to−1, the representation0, +1, +2, +3, …} of the natural number a bit that equals if and equals if . Formally, we define the 푚function as follows |푚| 휎 0 푚 ≥ 0 ∗ 1 푚 < 0 푍푡푆 ∶ ℤ → {0, 1} (2.2)

{⎧0 푁푡푆(푚) 푚 ≥ 0 푍푡푆(푚) = where is defined as in2.1 ( ⎨). ⎩{1 푁푡푆(−푚) 푚 < 0 While the encoding function of a representation needs to be one to one,푁푡푆 it does not have to be onto. For example, in the representation above there is no number that is represented by the empty string but it is still a fine representation, since every integer is represented uniquely by some string.

R Remark 2.3 — Interpretation and context. Given a string , how do we know if it’s “supposed” to represent a (nonnegative)∗ natural number or a (po- tentially푦 ∈ negative) {0, 1} integer? For that matter, even if we know is “supposed” to be an integer, how do we know what representation scheme it uses? The short answer푦 is that we do not necessarily know this information, unless it is supplied from the context. (In programming languages, the compiler or interpreter determines the representation of the sequence of bits corresponding to a variable based on the variable’s type.) We can treat the same string as representing a natural number, an integer, a piece of text, an image, or a green gremlin. Whenever we say푦 a sentence such as “let be the number represented by the string ,” we will assume that we are fixing some canonical rep- resentation푛 scheme such as the ones above. The choice푦 of the particular representation scheme will rarely matter, except that we want to make sure to stick with the same one for consistency.

2.2.2 Two’s complement representation (optional) Section 2.2.1’s approach of representing an integer using a specific “sign bit” is known as the Signed Magnitude Representation and was 90 introduction to theoretical computer science

used in some early computers. However, the two’s complement rep- resentation is much more common in practice. The two’s complement representation of an integer in the set is the string of length defined as푛 follows:푛 푛 푘 {−2 , −2 + 1, … , 2 − 1} 푛 푍푡푆 (푘) 푛 + 1 (2.3) 푛 ⎧ 푛+1 푛 {푁푡푆 (푘) 0 ≤ 푘 ≤ 2 − 1 푍푡푆 (푘) = 푛+1 푛 , where demotes⎨ the푛+1 standard binary representation of a num- ⎩{푁푡푆 (2 + 푘) −2 ≤ 푘 ≤ −1 ber as string of length , padded with leading zeros ℓ as needed.푁푡푆 For(푚) example,ℓ if then , 푚 ∈ {0, … , 2 } , ℓ , and 3 4 푛 = 3. If is푍푡푆 a negative(1) = 푁푡푆 number(1) = larger 0001 than 3 4 3 4 푍푡푆or equal(2) =to 푁푡푆 then(2) = 0010 푍푡푆is a(−1) number = 푁푡푆 between(16 − 1)and = 1111 . 3 4 푍푡푆Hence(−8) the = two’s 푁푡푆푛 complement(16 −푛+1 8) = 1000 representation푘 of such a푛 number푛+1 is a string of length−2 2with+ its 푘 first digit equal to . 2 2 − 1 Another way to say this is that we represent a potentially negative푘 number 푛 + 1 as the non-negative1 number mod (see also Fig. 2.4).푛 This푛 means that if two (potentially negative) num-푛+1 bers and푘 ∈ {−2are not, … too, 2 − large 1} (i.e., ), then푘 we can2 compute the′ representation of by adding′ modulo푛+1 the rep- resentations푘 푘 of and as if they were|푘|′ + non-negative |푘 | < 2 integers.푛+1 This property of the two’s complement′ 푘 + representation 푘 is its main2 attraction since, depending푘 on their푘 , can often perform arithmetic operations modulo very efficiently (for certain values of such as and ). Many systems푤 leave it to the pro- grammer to check that values are not too2 large and will carry out this modular arithmetic푤 32 regardless64 of the size of the numbers involved. For this reason, in some systems adding two large positive numbers can result in a negative number (e.g., adding and might result in since mod 푛 , see푛 also Fig. 2.4). 푛+1 2푛+1− 100 2 − 200 2.2.3 Rational numbers, and representing pairs of strings −300 (2 − 300) 2 = −300 Figure 2.4: In the two’s complement representation We can represent a rational number of the form by represent- we represent a potentially negative integer ing the two numbers and . However, merely concatenating the as an length string using the binary representation of the integer mod . On representations of and will not work. For example,푎/푏 the binary rep- 푛 푛 푘 ∈ the{−2 lefthand, … , 2 side:− 1} this representation푛 + 1 for 푛+1(the resentation of is 푎and the푏 binary representation of is , red integers are the numbers being represented푘 2 by but the concatenation푎 푏 of these strings is also the concatena- the blue binary strings). If a microprocessor푛 = does 3 not check for overflows, adding the two positive numbers tion of the representation4 100 of and the representation43 101011of and might result in the negative number (since . Hence, if we used such100101011 simple concatenation then we would not mod . The righthand side is a C program that will on some bit print a negative be able to tell if the string 10010 18is supposed to represent1011 or 6 5 −5 number−5 after16 = adding 11 two positive numbers. (Integer 11 . overflow in C is considered32 undefined behavior which We tackle this by giving100101011 a general representation for pairs of4/43 strings. means the result of this program, including whether it runs or crashes, could differ depending on the 18/11If we were using a pen and paper, we would just use a separator sym- architecture, compiler, and even compiler options and bol such as to represent, for example, the pair consisting of the num- version.)

‖ computation and representation 91

bers represented by and as the length- string “ ”. In other words, there is a one to one map from pairs of strings into a10 single string110001 over the alphabet9 01‖110001 (in other words,∗ ). Using such separators퐹 is similar to the 푥,way 푦 ∈we {0, use 1} spaces and∗ punctuation푧 to separate wordsΣ in = English. {0, 1, ‖} By adding a little redundancy,푧 ∈ Σ we achieve the same effect in the digital domain. We can map the three-element set to the three-element set in a one-to-one fashion, and hence encode a length string 2 as a length string Σ . {00,Our 11, final 01} ⊂ representation {0, 1}∗ for rational numbers∗ is obtained bycom- posing푛 the following푧 ∈ Σ steps: 2푛 푤 ∈ {0, 1} 1. Representing a non-negative rational number as a pair of natural numbers.

2. Representing a natural number by a string via the binary represen- tation.

3. Combining 1 and 2 to obtain a representation of a rational number as a pair of strings.

4. Representing a pair of strings over as a single string over . {0, 1} 5. Representing a string over as a longer string over . Σ = {0, 1, ‖}

Σ {0, 1} ■ Example 2.4 — Representing a rational number as a string. Consider the rational number . We represent as and as , and so we can represent as the pair of strings and represent this푟 pair = as the−5/8 length string −5 1101 over+8 the01000 alphabet . Now, applying푟 the map (1101,, 01000) , , we can represent the latter10 string as1101‖01000 the length string {0, 1, ‖} over the alphabet 0 ↦. 00 1 ↦ 11 ‖ ↦ 01 20 푠The = 11110011010011000000 same idea can be used to represent triples{0, of 1} strings, quadru- ples, and so on as a string. Indeed, this is one instance of a very gen- eral principle that we use time and again in both the theory and prac- tice of computer science (for example, in Object Oriented program- ming):

 Big Idea 1 If we can represent objects of type as strings, then we can represent tuples of objects of type as strings as well. 푇 Repeating the same idea, once we푇 can represent objects of type , we can also represent lists of lists of such objects, and even lists of lists of lists and so on and so forth. We will come back to this point when푇 we discuss prefix free encoding in Section 2.5.2. 92 introduction to theoretical computer science

2.3 REPRESENTING REAL NUMBERS

The set of real numbers contains all numbers including positive, negative, and fractional, as well as irrational numbers such as or . Every real number canℝ be approximated by a rational number, and thus we can represent every real number by a rational number휋 푒 that is very close to . For example, we can represent by within an error of about . If we want a smaller푥 error (e.g., about 푎/푏) then we can use −3푥 , and so on and so forth. 휋 22/7 −4 10 10 311/99 Figure 2.5: The floating point representation of a real number is its approximation as a number of the form where , is an (potentially

negative)푥 integer, ∈ ℝ푒 and is a rational number between and expressed휎푏 ⋅ 2 as a binary휎 ∈ {±1} fraction푒 for some 푏 (that is 1 2 ). Commonly-used floating1.푏0푏1푏2 … point 푏푘 1 푘 1 representations푏 , … , fix푏푘 ∈ the {0, 1}numbers and푏 =of 1 + bits 푏 /2 to + represent푏2/4 + … +and 푏푘/2 respectively. In the example above, assuming we use two’s complementℓ representation푘 for , the number푒 푏 represented is The above representation of real numbers via rational numbers . 5 that approximate them is a fine choice for a representation scheme. 푒 −1 × 2 × (1 + 1/2 + 1/4 + 1/64 + 1/512) = −56.5625 However, typically in computing applications, it is more common to use the floating point representation scheme (see Fig. 2.5) to represent real numbers. In the floating point representation scheme we rep- resent by the pair of (positive or negative) integers of some prescribed sizes (determined by the desired accuracy) such that 푥is closest∈ ℝ to . Floating(푏, 푒) point representation is the base-two version푒 of scientific notation, where one represents a number as푏 × its 2 approximation푥 of the form for . It is called “floating point” because we can think of the number푒 as specifying a sequence푦 ∈ 푅 of binary digits, and as describing푏 × the 10 location푏, 푒 of the “binary point” within this sequence. The use of floating representation푏 is the reason why in many programming푒 systems, printing the expression 0.1+0.2 will result in 0.30000000000000004 and not 0.3, see here, here and here for more. The reader might be (rightly) worried about the fact that the float- ing point representation (or the rational number one) can only ap- proximately represent real numbers. In many (though not all) com- putational applications, one can make the accuracy tight enough so that this does not affect the final result, though sometimes we doneed to be careful. Indeed, floating-point bugs can sometimes be no jok- Figure 2.6: XKCD cartoon on floating-point arithmetic. ing matter. For example, floating point rounding errors have been implicated in the failure of a U.S. Patriot missile to intercept an Iraqi Scud missile, costing 28 lives, as well as a 100 million pound error in computing payouts to British pensioners. computation and representation 93

2.4 CANTOR’S THEOREM, COUNTABLE SETS, AND STRING REP- RESENTATIONS OF THE REAL NUMBERS “For any collection of fruits, we can make more fruit salads than there are fruits. If not, we could label each salad with a different fruit, and consider the salad of all fruits not in their salad. The label of this salad is in it if and only if it is not.”, Martha Storey.

Given the issues with floating point approximations for real num- bers, a natural question is whether it is possible to represent real num- bers exactly as strings. Unfortunately, the following theorem shows that this cannot be done:

Theorem 2.5 — Cantor’s Theorem. There does not exist a one-to-one 2 function . 2 stands for “real numbers to strings”. ∗ 푅푡푆 ∶ ℝ → {0, 1} 푅푡푆 Countable sets. We say that a set is countable if there is an onto map , or in other words, we can write as the sequence . Since the binary푆 representation yields an onto map from퐶 ∶ ℕ → 푆 to , and the composition of two푆 onto maps is onto, a퐶(0), set 퐶(1),is countable 퐶(2),∗ … iff there is an onto map from to . Using the basic properties{0, 1} ℕ of functions (see Section 1.4.3), a set∗ is countable if and푆 only if there is a one-to-one function from {0,to 1} 푆. Hence, we can rephrase Theorem 2.5 as follows: ∗ 푆 {0, 1} Theorem 2.6 — Cantor’s Theorem (equivalent statement). The reals are un- countable. That is, there does not exist an onto function . 푁푡푅 ∶ ℕ → Theorem 2.6 was proven by Georg Cantor in 1874. This result (and ℝ the theory around it) was quite shocking to mathematicians at the time. By showing that there is no one-to-one map from to (or ), Cantor showed that these two infinite sets have “different∗ forms of infinity” and that the set of real numbers is in some senseℝ “bigger”{0, 1} ℕthan the infinite set . The notion that there are “shades of infin- ity” was deeply disturbing∗ to mathematiciansℝ and philosophers at the time. The philosopher{0, Ludwig 1} Wittgenstein (whom we mentioned be- fore) called Cantor’s results “utter nonsense” and “laughable.” Others thought they were even worse than that. Leopold Kronecker called Cantor a “corrupter of youth,” while Henri Poincaré said that Can- tor’s ideas “should be banished from mathematics once and for all.” The tide eventually turned, and these days Cantor’s work is univer- sally accepted as the cornerstone of and the foundations of mathematics. As David Hilbert said in 1925, “No one shall expel us from the paradise which Cantor has created for us.” As we will see later in this 94 introduction to theoretical computer science

book, Cantor’s ideas also play a huge role in the theory of computa- tion. Now that we have discussed Theorem 2.5’s importance, let us see the proof. It is achieved in two steps:

1. Define some infinite set for which it is easier for us to prove that is not countable (namely, it’s easier for us to prove that there is no one-to-one function from풳 to ). 풳 ∗ 2. Prove that there is a one-to-one풳 function{0, 1} mapping to .

We can use a proof by contradiction to show퐺 that these풳 twoℝ facts together imply Theorem 2.5. Specifically, if we assume (towards the sake of contradiction) that there exists some one-to-one mapping to then the function obtained by composing with the∗ function from Step 2 above would be a one-to-one퐹 functionℝ from{0, 1}to , which contradicts푥 ↦ 퐹 (퐺(푥)) what we proved in Step 1! 퐹 To turn this idea∗ 퐺 into a full proof of Theorem 2.5 we need to: 풳 {0, 1} • Define the set .

• Prove that there풳 is no one-to-one function from to ∗ • Prove that there is a one-to-one function from 풳to .{0, 1}

We now proceed to do precisely that. That is, we풳 willℝ define the set , which will play the role of , and then state and prove two lemmas∞ that show that this set satisfies our two desired properties. {0, 1} 풳 Definition 2.7 We denote by the set . ∞ That is, is a set of {0,functions 1} , and a{푓 function | 푓 ∶ ℕ →is {0, in 1}} iff its domain is ∞ and its codomain is . We can also think of∞ as{0, the 1} set of all infinite sequences of bits, since푓 a function{0, 1} ∞ can beℕ identified with the{0, 1}sequence . {0,The 1} following two lemmas show that can play the role of푓 ∶ to ℕestablish → {0, 1}Theorem 2.5. ∞ (푓(0), 푓(1), 푓(2), …) {0, 1} 풳 Lemma 2.8 There does not exist a one-to-one map 3 3 stands for “functions to strings”. . ∞ 퐹 푡푆 ∶ {0, 1} → 4 stands for “functions to reals.” Lemma∗ 2.9 There does exist a one-to-one map .4 퐹푡푆 {0, 1} As we’ve seen above, Lemma 2.8 and Lemma 2.9 together∞ imply 퐹푡푅 퐹 푡푅 ∶ {0, 1} → ℝ Theorem 2.5. To repeat the argument more formally, suppose, for the sake of contradiction, that there did exist a one-to-one function . By Lemma 2.9, there exists a one-to-one function ∗ . Thus, under this assumption, since the composi- 푅푡푆tion of ∶ ℝ two → one-to-one {0,∞ 1} functions is one-to-one (see Exercise 2.12), the 퐹 푡푅 ∶ {0, 1} → ℝ computation and representation 95

function defined as will be one to one, contradicting∞ ∗Lemma 2.8. See Fig. 2.7 for a graphi- cal illustration퐹 푡푆 ∶ of {0, this 1} argument.→ {0, 1} 퐹 푡푆(푓) = 푅푡푆(퐹 푡푅(푓))

Figure 2.7: We prove Theorem 2.5 by combining Lemma 2.8 and Lemma 2.9. Lemma 2.9, which uses standard calculus tools, shows the existence of a one-to-one map from the set to the real

numbers. So, if a hypothetical one-to-one∞ map existed,퐹푡푅 then we could{0, compose 1} them to get a one-to-one∗ map 푅푡푆. ∶ ℝYet → this {0, contradicts 1} Lemma 2.8- the heart∞ of the proof-∗ which rules out the existence퐹푡푆 of∶ {0, such 1} a map.→ {0, 1}

Now all that is left is to prove these two lemmas. We start by prov- ing Lemma 2.8 which is really the heart of Theorem 2.5.

Figure 2.8: We construct a function such that for every by ensuring that

for∗ every푑 푑 ≠ with푆푡퐹(푥) lexicographic푥 order ∈ {0, 1} . We can think of∗ this 푑(푛(푥))as building ≠ a푆푡퐹(푥)(푛(푥)) table where the columns푥 correspond ∈ {0, 1} to numbers and the푛(푥) rows correspond to (sorted according to ). If the entry

in the -th∗ row푚∈ and ℕ the -th column corresponds to 푥 ∈ {0,where 1} then 푛(푥)is obtained by going over the푥 “diagonal” elements푚 in this table (the entries corresponding푔(푚)) 푔 to = the 푆푡퐹(푥)-th row and푑 -th column) and ensuring that . 푥 푛(푥) 푑(푥)(푛(푥)) ≠ 푆푡퐹(푥)(푥)

Warm-up: ”Baby Cantor”. The proof of Lemma 2.8 is rather subtle. One way to get intution for it is to consider the following finite statement ”there is no onto function . Of course we know it’s true since the set is bigger than100 the set , but let’s see a direct proof. For푓 every ∶ {0,100 … , 99} → {0, 1} , we can define the string {0, 1} as follows: [100]100 . If was100 onto,푓 ∶ then{0, … there , 99} would → {0, exist 1} some 0 such that 푑 ∈ {0, 1} , but we claim푑 that = (1no − such 푓(0) exists., 1 − 1 99 푓(1) , … , 1 − 푓(99) ) 푓 푛 ∈ [100] 푓(푛) = 푑(푛) 푛 96 introduction to theoretical computer science

Indeed, if there was such , then the -th coordinate of would equal but by definition this coordinate equals . See also a “proof by code” of this statement.푛 푛 푑 푛 푛 푓(푛) 1 − 푓(푛) Proof of Lemma 2.8. We will prove that there does not exist an onto function . This implies the lemma since for every two sets and∗ , there∞ exists an onto function from to if and only푆푡퐹 if ∶ there {0, 1} exists→ {0, a one-to-one 1} function from to (see Lemma 1.2). 퐴 퐵 퐴 퐵 The technique of this proof is known as the “diagonal퐵 argument”퐴 and is illustrated in Fig. 2.8. We assume, towards a contradiction, that there exists such a function . We will show that is not onto by demonstrating a∗ function ∞ such that for every 푆푡퐹. ∶ Consider {0, 1} → the {0, lexicographic 1} ∞ ordering of binary푆푡퐹 strings (i.e., "", , , , ∗ , ). For every 푑 ∈ {0,, we 1} let be the 푑-th ≠ 푆푡퐹 string (푥) in this order.푥 ∈ That {0, 1} is "", , and so on and 푛 so forth. We define the 0function1 00 01 … as follows:푛 ∈ ℕ 푥 0 1 2 푛 푥 = 푥∞= 0 푥 = 1 푑 ∈ {0, 1} (2.4)

푛 for every . That is,푑(푛) to compute = 1 − 푆푡퐹on (푥 input)(푛) , we first com- pute , where is the -th string in the lexico- graphical푛 ordering. ∈ ℕ Since 푑∗, it is a function푛 ∈ ℕ mapping to 푛 푛 푔. The = 푆푡퐹 value (푥 ) is defined푥 ∈ {0, to 1}∞ be the푛 negation of . The definition of the푔 ∈function {0, 1}is a bit subtle. One way to thinkℕ {0,about 1} it is to imagine푑(푛) the function as being specified푔(푛) by an in- finitely long table, in which푑 every row corresponds to astring (with strings sorted in lexicographic푆푡퐹 order), and contains the sequence∗ . The diagonal elements푥 ∈ in {0,this 1} table are the values 푆푡퐹 (푥)(0), 푆푡퐹 (푥)(1), 푆푡퐹 (푥)(2), …

"" (2.5)

which푆푡퐹 ( correspond)(0), 푆푡퐹 (0)(1), to the 푆푡퐹 elements (1)(2), 푆푡퐹 (00)(3), 푆푡퐹in the (01)(4),-th row … and -th column of this table for . The function we defined 푛 above maps every to the negation푆푡퐹 of (푥 the)(푛)-th diagonal푛 value. 푛 To complete the proof that푛 = 0,is 1, not 2, … onto we need to푑 show that for every푛 ∈ ℕ . Indeed, let 푛 be some string and let . If is the푆푡퐹 position∗ of in the lexicographical∗ 푑order ≠ 푆푡퐹 then (푥) by construction푥 ∈ {0, 1} 푥 ∈ {0,which 1} means that which푔 = 푆푡퐹is what (푥) we푛 wanted to prove. 푥 푑(푛) = 1 − 푔(푛) ≠ 푔(푛) ■ 푔 ≠ 푑 computation and representation 97

R Remark 2.10 — Generalizing beyond strings and reals. Lemma 2.8 doesn’t really have much to do with the natural numbers or the strings. An examination of the proof shows that it really shows that for every set , there is no one-to-one map where denotes the set 푆 of all푆 Boolean푆 functions with domain퐹 ∶. {0, Since 1} we→ can 푆 identify{0, a subset1} with{푓 its characteristic | 푓 ∶ 푆 → function {0, 1}} (i.e., iff ),푆 we can think of also as the푉 set ⊆ of 푆 all subsets of . This subset 푉 푉 is푓 sometimes =푆 1 called1 (푥) the =power 1 푥 set ∈of 푉 and denoted by {0, 1}or . 푆 The proof푆 of Lemma 2.8 can be generalized푆 to show 풫(푆)that there2 is no one-to-one map between a set and its power set. In particular, it means that the set is “even bigger” than . Cantor used these ideas to con-ℝ struct an infinite hierarchy of shades of infinity.{0, 1} The number of such shadesℝ turns out to be much larger than or even . He denoted the cardinality of by and denoted the next largest infinite number by |ℕ|.( is the first|ℝ| letter in the Hebrew alphabet.)ℕ 0 Cantorℵ also made the continuum hypothesis that 1 ℵ ℵ . We will come back to the fascinating story of this hypothesis later on in this book. This lecture of 1 |ℝ|Aaronson = ℵmentions some of these issues (see also this Berkeley CS 70 lecture).

To complete the proof of Theorem 2.5, we need to show Lemma 2.9. This requires some calculus background but is otherwise straightfor- ward. If you have not had much experience with limits of a real series before, then the formal proof below might be a little hard to follow. This part is not the core of Cantor’s argument, nor are such limits important to the remainder of this book, so you can feel free to take Lemma 2.9 on faith and skip the proof.

Proof Idea: We define to be the number between and whose dec- imal expansion is , or in other words 퐹 푡푅(푓). If and are two distinct functions0 in2 , then there∞ must be−푖 some푓(0).푓(1)푓(2) input in which … they disagree. If퐹 we 푡푅(푓) take∞ = the 푖=0 minimum∑ 푓(푖) ⋅ such 10 , then푓 the푔 numbers {0, 1} and 푘agree with each other all the way up to the -th digit after푘 the decimal point,푓(0).푓(1)푓(2) and disagree … on 푓(푘 the − 1)푓(푘)-th digit. … But then푔(0).푔(1)푔(2) these numbers … 푔(푘) …must be distinct. Concretely, if and 푘 − 1 then the first number is larger than the second,푘 and otherwise ( and ) the first number is smaller 푓(푘)than = the 1 second. 푔(푘)In the = proof 0 we have to be a little careful since these are numbers with infinite푓(푘) = expansions 0 푔(푘). For = example,1 the number one half has two decimal 98 introduction to theoretical computer science

expansions and . However, this issue does not come up here, since we restrict attention only to numbers with decimal expan- sions that do0.5 not involve0.49999 the ⋯ digit .

9 Proof⋆ of Lemma 2.9. For every , we define to be the number whose decimal expansion is ∞ . Formally, 푓 ∈ {0, 1} 퐹 푡푅(푓) 푓(0).푓(1)푓(2)푓(3) … (2.6) ∞ −푖 It is a known result in calculus퐹 푡푅(푓) (whose = ∑푖=0 푓(푖) proof ⋅ 10 we will not repeat here) that the series on the righthand side of (2.6) converges to a definite limit in . We now prove that is one to one. Let be two distinct func- tions in ℝ . Since and are distinct, there must be some input on which they∞ differ,퐹 and 푡푅 we define to be the푓, smallest 푔 such input and assume{0, 1} without loss푓 of generality푔 that and . (Otherwise, if and 푘, then we can simply the roles of and .) The numbers and푓(푘) = 0agree푔(푘) with = each 1 other up to the푓(푘) =-th 1 digit푔(푘) up after = 0 the decimal point. Since this digit equals 푓for 푔 and equals 퐹for 푡푅(푓) 퐹, we 푡푅(푔) claim that is bigger than 푘−1by at least . To see this note that the dif- ference0 퐹 푡푅(푓) will be1 minimized퐹−푘 푡푅(푔) if for every퐹 푡푅(푔) and 퐹for 푡푅(푓) every , in0.5 which ⋅ 10 case (since and agree up to the 퐹-th 푡푅(푔) digit) − 퐹 푡푅(푓) 푔(ℓ) = 0 ℓ > 푘 푓(ℓ) = 1 ℓ > 푘 푓 푔 푘 − 1 (2.7) −푘 −푘−1 −푘−2 −푘−3 Since퐹 푡푅(푔) the − infinite 퐹 푡푅(푓) =series 10 − 10 converges− 10 to− 10 , it follows− ⋯ that for every such and , ∞ −푖 . 푗=0 In particular we see that for∑ every10 distinct −푘 10/9−푘−1, , implying푓 that푔 the퐹 푡푅(푔) function − 퐹 푡푅(푓)is ≥ one 10 to− one. 10 ∞ ⋅ (10/9) > 0 푓, 푔 ∈ {0, 1} 퐹 푡푅(푓) ≠ ■ 퐹 푡푅(푔) 퐹 푡푅

R Remark 2.11 — Using decimal expansion (op- tional). In the proof above we used the fact that converges to , which plugging into (2.7) yields that the difference between 1 + 1/10and + 1/100is at + least ⋯ 10/9 . While the choice of the decimal representation−푘 −푘−1 for 퐹was 푡푅(푔) arbitrary,퐹 푡푅(ℎ)we could not have10 used−10 the binary⋅(10/9) > 0 representation in its place. Had we used the binary퐹 푡푅 expansion instead of decimal, the corresponding se- quence converges to ,

1 + 1/2 + 1/4 + ⋯ 2/1 = 2 computation and representation 99

and since , we could not have de- duced that −푘 is one to−푘−1 one. Indeed there do exist pairs of distinct2 sequences= 2 ⋅ 2 such that 퐹 푡푅 . (For example,∞ the se- quence∞ −푖 and∞ the푓, sequence 푔−푖 ∈ {0, 1} have this∑푖=0 property.)푓(푖)2 = ∑푖=0 푔(푖)2 1, 0, 0, 0, … 0, 1, 1, 1, …

2.4.1 Corollary: Boolean functions are uncountable Cantor’s Theorem yields the following corollary that we will use several times in this book: the set of all Boolean functions (mapping to ) is not countable: ∗ {0,Theorem 1} {0, 2.12 1}— Boolean functions are uncountable. Let ALL be the set of all functions . Then ALL is uncountable. Equiv- alently, there does not exist∗ an onto map ALL. 퐹 ∶ {0, 1} → {0, 1} ∗ 푆푡퐴퐿퐿 ∶ {0, 1} → Proof Idea: This is a direct consequence of Lemma 2.8, since we can use the binary representation to show a one-to-one map from to ALL. Hence the uncountability of implies the uncountability∞ of ALL. ∞ {0, 1} {0, 1}

Proof of Theorem 2.12. Since is uncountable, the result will follow by showing a one-to-one map∞ from to ALL. The reason is that the existence of such{0, a map 1} implies that if∞ALL was countable, and hence there was a one-to-one map from{0,ALL 1} to , then there would have been a one-to-one map from to , contradicting Lemma 2.8. ∞ ℕ We now show this one-to-one map. We{0, simply 1} mapℕ a function to the function as follows. We let ∞, , ∗ , and so on and 푓so ∈ forth. {0, 1} That is, for every 퐹 ∶ {0, 1}that→ represents {0, 1} a natural number 퐹 (0)in the = 푓(0) binary퐹 (1) basis, = 푓(1) we define퐹 (10) = 퐹∗ (2) 퐹 (11). If =does 퐹 (3) not represent such a number (e.g., it has푥 a ∈ leading {0, 1} zero), then we set . 푛 This map is one-to-one since퐹 if (푥) = 푓(푛)are two푥 distinct elements in , then there must be some input on which퐹 (푥) = 0 . But then∞ if is the string푓 representing ≠ 푔 , we see that {0, 1}where is the function∗ in ALL that푛 ∈mapped ℕ to, and푓(푛) ≠is the푔(푛) function that푥 ∈is {0, mapped 1} to. 푛 퐹 (푥) ≠ 퐺(푥) 퐹 푓 퐺 ■ 푔 100 introduction to theoretical computer science

2.4.2 Equivalent conditions for countability The results above establish many equivalent ways to phrase the fact that a set is countable. Specifically, the following statements are all equivalent:

1. The set is countable

2. There exists푆 an onto map from to

3. There exists an onto map from ℕ 푆 to . ∗ 4. There exists a one-to-one map from{0, 1} to 푆

5. There exists a one-to-one map from 푆 to ℕ . ∗ 6. There exists an onto map from some푆 countable{0, 1} set to .

7. There exists a one-to-one map from to some countable푇 푆 set .

푆 푇 P Make sure you know how to prove the equivalence of all the results above.

2.5 REPRESENTING OBJECTS BEYOND NUMBERS

Numbers are of course by no means the only objects that we can repre- sent as binary strings. A representation scheme for representing objects from some set consists of an encoding function that maps an object in to a string, and a decoding function that decodes a string back to an object in . Formally,풪 we make the following definition: 풪 Definition풪 2.13 — String representation. Let be any set. A representation scheme for is a pair of functions where is a total one-to-one function, 풪 is a (possibly partial)∗ function, and풪 such that and satisfy퐸, 퐷∗ that 퐸 ∶ 풪 → {0,for 1} every 푝 . is known as the퐷encoding ∶ {0, 1}function→ 풪 and is known as the decoding function. 퐷 퐸 퐷(퐸(표)) = 표 표 ∈ 풪 퐸 퐷 Note that the condition for every implies that is onto (can you see why?). It turns out that to construct a representation scheme we퐷(퐸(표)) only need = to 표 find an encoding표 ∈ 풪function. That is, every퐷 one-to-one encoding function has a corresponding decoding function, as shown in the following lemma: Lemma 2.14 Suppose that is one-to-one. Then there exists a function such that∗ for every . 퐸∗ ∶ 풪 → {0, 1} 퐷 ∶ {0, 1} → 풪 퐷(퐸(표)) = 표 표 ∈ 풪 computation and representation 101

Proof. Let be some arbitrary element of . For every , there exists either zero or a single such that (otherwise∗ 0 would not표 be one-to-one). We will define풪 to equal푥 ∈ {0,in 1}the first case and this single object in표 the ∈ 풪 second case.퐸(표) By definition= 푥 0 퐸 for every . 퐷(푥) 표 표 ■ 퐷(퐸(표)) = 표 표 ∈ 풪

R Remark 2.15 — Total decoding functions. While the decoding function of a representation scheme can in general be a partial function, the proof of Lemma 2.14 implies that every representation scheme has a total decoding function. This observation can sometimes be useful.

2.5.1 Finite representations If is finite, then we can represent every object in as a string of length at most some number . What is the value of ? Let us denote by풪 the set of strings풪 of length at most . The size≤푛 of is equal to푛∗ 푛 {0, 1} ≤푛{푥 ∈ {0, 1} ∶ |푥| ≤ 푛} 푛 {0, 1} (2.8) 푛 0 1 2 푛 푖 푛+1 using|{0, 1} the| + standard |{0, 1} | formula + |{0, 1} for| + summing ⋯ + |{0, 1} a geometric| = ∑푖=0 2 progression= 2 − 1 . To obtain a representation of objects in as strings of length at most we need to come up with a one-to-one function from to . We can do so, if and only if 풪 as is implied by the following푛≤푛 lemma: 푛+1 풪 {0, 1} |풪| ≤ 2 − 1 Lemma 2.16 For every two finite sets , there exists a one-to-one if and only if . 푆, 푇 퐸Proof. ∶ 푆 →Let 푇 and |푆| ≤ |푇 |and so write the elements of and as and . We need to show that there푘 = |푆| is a one-to-one푚 = |푇 |function iff 푆. For 0 1 푘−1 0 1 푚−1 푇the “if”푆 = direction, {푠 , 푠 , … if , 푠 } we can푇 = simply {푡 , 푡 define, … , 푡 } for every . Clearly for , 퐸 ∶ 푆 → 푇, and푘 hence ≤ 푚 this 푖 푖 function is one-to-one.푘 ≤ In 푚 the other direction, suppose퐸(푠 ) that = 푡 and 푖 푖 푗 푗 푖 ∈ [푘] is some function.푖 ≠ 푗 푡 = Then 퐸(푠 )cannot ≠ 퐸(푠 be) = one-to-one. 푡 Indeed, for let us “mark” the element in 푘. >If 푚was 퐸marked ∶ 푆 → before, 푇 then we have found퐸 two objects in mapping to the 푗 푖 푗 same푖 = 0, element 1, … , 푚 −. 1 Otherwise, since has elements,푡 = 퐸(푠 when) 푇 we get푡 to we mark all the objects in . Hence, in this푆 case, must 푗 map to an element푡 that was already푇 marked푚 before. (This observation 푚 is푖 = sometimes 푚 − 1 known as the “Pigeon Hole푇 Principle”: the principle퐸(푠 ) that 102 introduction to theoretical computer science

if you have a pigeon coop with holes, and pigeons, then there must be two pigeons in the same hole. ) 푚 푘 > 푚 ■

2.5.2 Prefix-free encoding When showing a representation scheme for rational numbers, we used the “hack” of encoding the alphabet to represent tuples of strings as a single string. This is a special case of the general paradigm of prefix-free encoding. The idea is the{0, following: 1, ‖} if our representation has the property that no string representing an object is a prefix (i.e., an initial substring) of a string representing a different object , then we can represent a list of푥 objects by merely concatenating표 the representations′ of all the list members.푦 For example, because in En- 표glish every sentence ends with a punctuation mark such as a period, exclamation, or , no sentence can be a prefix of another and so we can represent a list of sentences by merely concatenating the sentences one after the other. (English has some complications such as periods used for abbreviations (e.g., “e.g.”) or sentence quotes containing punctuation, but the high level point of a prefix-free repre- sentation for sentences still holds.) It turns out that we can transform every representation to a prefix- free form. This justifies Big Idea 1, and allows us to transform a repre- sentation scheme for objects of a type to a representation scheme of lists of objects of the type . By repeating the same technique, we can also represent lists of lists of objects of푇 type , and so on and so forth. But first let us formally푇 define prefix-freeness: 푇 Definition 2.17 — Prefix free encoding. For two strings , we say that is a prefix of if and for every , ′ . Let be a non-empty′ ′ set and ′푦,be 푦 a function. We푦 푖 푖 say that is푦prefix-free|푦| ≤ |푦if | is non-empty푖 < for|푦| every푦 ∗ = 푦 and there does풪 not exist a distinct pair퐸 of ∶ objects 풪 → {0, 1} such that is a퐸 prefix of . 퐸(표) ′ 표 ∈ 풪 ′ 표, 표 ∈ 풪 퐸(표)Recall that for every퐸(표 set) , the set consists of all finite length tuples (i.e., lists) of elements in . The∗ following theorem shows that if is a prefix-free encoding풪 of then풪 by concatenating encodings we can obtain a valid (i.e., one-to-one)풪 representation of : 퐸 풪 ∗ Theorem 2.18 — Prefix-free implies tuple encoding. Suppose풪 that is prefix-free. Then the following map is one to∗ one, for every , we define ∗ 퐸 ∶ 풪∗ → {0, 1} ∗ 퐸 ∶ 풪 → {0, 1} 0 푘−1 표 , … , 표 ∈ 풪 (2.9)

0 푘−1 0 1 푘−1 퐸(표 , … , 표 ) = 퐸(표 )퐸(표 ) ⋯ 퐸(표 ). computation and representation 103

P Theorem 2.18 is an example of a theorem that is a little hard to parse, but in fact is fairly straightforward to prove once you understand what it means. Therefore, I highly recommend that you pause here to make sure you understand the statement of this theorem. You should also try to prove it on your own before proceeding further.

Proof Idea: The idea behind the proof is simple. Suppose that for example Figure 2.9 we want to decode a triple from its representation : If we have a prefix-free representation of each object then we can concatenate the representa- . We will do so by first finding the tions of objects to obtain a representation for the 0 1 2 first prefix of that is a representation(표 , 표 , 표 ) of some object. Then푥 we = tuple . 0 1 2 0 1 2 푘 퐸(표will decode, 표 , 표 ) this = 퐸(표 object,)퐸(표 remove)퐸(표 ) from to obtain a new string , 0 푘−1 0 (표 , … , 표 ) and continue푥 onwards푥 to find the first prefix of and so on and′ so 0 forth (see Exercise 2.9). The prefix-freeness푥 푥 property′ of will ensure푥 1 that will in fact be , will be ,푥 etc. 푥 퐸 0 0 1 1 푥 퐸(표 ) 푥 퐸(표 ) Proof⋆ of Theorem 2.18. We now show the formal proof. Suppose, to- wards the sake of contradiction, that there exist two distinct tuples and such that

′ ′ ′ 0 푘−1 0 푘 −1 (표 , … , 표 ) (표 , … , 표 ) (2.10)

′ ′ ′ We will denote the string0 푘−1 0by . 푘 −1 퐸(표 , … , 표 ) = 퐸(표 , … , 표 ). Let be the first index such that . (If for all then, 0 푘−1 since we assume the two tuples퐸(표 , … are , 표 distinct,) ′ 푥 one of them′ must be 푖 푖 푖 푖 larger than푖 the other. In this case we표 assume≠ 표 without표 = 표 loss of generality푖 that and let .) In the case that , we see that the string can′ be written in two different ways: 푘 > 푘 푖 = 푘 푖 < 푘 푥 (2.11)

0 푘−1 0 푖−1 푖 푖+1 푘−1 and푥 = 퐸(표 , … , 표 ) = 푥 ⋯ 푥 퐸(표 )퐸(표 ) ⋯ 퐸(표 )

(2.12)

′ ′ ′ ′ ′ ′ 0 0 푖−1 푖 푖+1 where푥 = 퐸(표 , … , 표푘 −1) = 푥 for⋯ 푥 all 퐸(표 )퐸(표. Let be) ⋯ the 퐸(표 string푘−1) obtained after removing the prefix ′ from . We see that can be writ- 푗 푗 푗 ten as both푥 = 퐸(표 ) =for 퐸(표 some) string푗 < 푖 푦 and as 0 푖−푖 for some . But푥 this⋯ means 푥 that푥 one of ∗ and푦 must′ ′ 푖 푖 be a prefix′ 푦 =of 퐸(표the∗ )푠 other, contradicting푠 ∈ {0, the 1} prefix-freeness푦 of =. 퐸(표′ )푠 푖 푖 푠 ∈ {0, 1} 퐸(표 ) 퐸(표 ) 퐸 104 introduction to theoretical computer science

In the case that and , we get a contradiction in the following way. In this case ′ 푖 = 푘 푘 > 푘

(2.13)

′ ′ ′ 0 푘−1 0 푘−1 which푥 = 퐸(표 means) ⋯that 퐸(표 ) = 퐸(표 ) ⋯ 퐸(표must)퐸(표 correspond푘) ⋯ 퐸(표푘 to−1 the) empty string "". But in such a case′ ′ ′must be the empty string, which in 푘 푘 −1 particular is the prefix퐸(표 of) ⋯ any 퐸(표 other′ ) string, contradicting the prefix- 푘 freeness of . 퐸(표 ) ■ 퐸

R Remark 2.19 — Prefix freeness of list representation. Even if the representation of objects in is prefix free, this does not mean that our representation of lists of such objects will퐸 be prefix free풪 as well. In fact, it won’t be: for every three objects the퐸 representation of the list will be a prefix′ ″ of the representation of the list ′ . However,표, 표 , 표 as we see in Lemma 2.20 below, we(표, can 표′ transform) ″ every repre- sentation into prefix-free(표, form, 표 , 표 )and so will be able to use that transformation if needed to represent lists of lists, lists of lists of lists, and so on and so forth.

2.5.3 Making representations prefix-free Some natural representations are prefix-free. For example, every fixed output length representation (i.e., one-to-one function ) is automatically prefix-free, since a string can only be a prefix of 푛 an equal-length if and are identical. Moreover,퐸 the ∶ 풪 approach → {0, 1} we used for representing′ rational′ numbers푥 can be used to show the following: 푥 푥 푥 Lemma 2.20 Let be a one-to-one function. Then there is a one-to-one prefix-free encoding∗ such that for every . 퐸 ∶ 풪 → {0, 1} 퐸 |퐸(표)| ≤ 2|퐸(표)| + 2 표 ∈ 풪 P For the sake of completeness, we will include the proof below, but it is a good idea for you to pause here and try to prove it on your own, using the same technique we used for representing rational numbers.

Proof of Lemma 2.20. The idea behind the proof is to use the map , to “double” every bit in the string and then mark the end of the string by concatenating to it the pair . If we encode a0 ↦ 00 1 ↦ 11 푥 01 computation and representation 105

string in this way, it ensures that the encoding of is never a prefix of the encoding of a distinct string . Formally, we define the function PF 푥 as follows: ′ 푥 ∗ ∗ 푥 ∶ {0, 1} → {0, 1}PF (2.14)

0 0 1 1 푛−1 푛−1 for every .(푥) If = 푥 푥 푥 푥 … 푥 is푥 the (potentially01 not prefix-free) representation∗ for , then we∗ transform it into a prefix- free representation푥 ∈ {0, 1} 퐸 ∶ 풪 →by {0, defining 1} PF . To prove the lemma we need풪 to∗ show that (1) is one-to-one and (2) is prefix-free.퐸 ∶In 풪 →fact, {0, 1}prefix freeness퐸(표) is = a stronger(퐸(표)) condition than one-to-one (if two strings are equal then in particular퐸 one of them is a prefix퐸 of the other) and hence it suffices toprove (2), which we now do. Let in be two distinct objects. We will prove that is a not a prefix′ of . Define and . Since is one-to-one,표 ≠ 표 풪 ′ . Under our assumption, ′PF is′ a prefix퐸(표) of PF . If 퐸(표 )then′ the two푥 = bits 퐸(표) in positions푥 = 퐸(표 ) 퐸in PF ′ have the푥 value ≠′ 푥 but the corresponding bits(푥) in PF will equal(푥 either) |푥|or < |푥(depending| on the -th bit of 2|푥|,) and 2|푥| hence′ + 1 PF (푥)cannot be a prefix01 of PF . If then, since′ (푥 ) , there must be00 a coordinate11 in which′ they|푥| differ,′ 푥meaning that′ the strings(푥) PF and PF differ(푥 )in |푥|the =coordinates |푥 | 푥, which ≠ 푥 again means that PF ′ cannot푖 be a prefix of PF . If then PF (푥) (푥 ) PF and2푖,′ 2푖 hence + 1 PF ′ is longer than (and cannot(푥) be a prefix′ of) PF′ . In(푥 all) cases|푥| > we |푥 see| that PF | (푥)| =is 2|푥| not + a prefix 2 > | of(푥PF)| = 2|푥 | +′ , 2 hence completing(푥) the proof. ′ (푥 ′) (푥) = 퐸(표) (푥 ) = 퐸(표 ) ■

The proof of Lemma 2.20 is not the only or even the best way to transform an arbitrary representation into prefix-free form. Exer- cise 2.10 asks you to construct a more efficient prefix-free transforma- tion satisfying log .

2.5.4 “Proof by| Python”퐸(표)| ≤ (optional) |퐸(표)| + 푂( |퐸(표)|) The proofs of Theorem 2.18 and Lemma 2.20 are constructive in the sense that they give us:

• A way to transform the encoding and decoding functions of any representation of an object to encoding and decoding functions that are prefix-free, and 푂 • A way to extend prefix-free encoding and decoding of single objects to encoding and decoding of lists of objects by concatenation. 106 introduction to theoretical computer science

Specifically, we could transform any pair of Python functions en- code and decode to functions pfencode and pfdecode that correspond to a prefix-free encoding and decoding. Similarly, given pfencode and pfdecode for single objects, we can extend them to encoding of lists. Let us show how this works for the case of the NtS and StN functions we defined above. We start with the “Python proof” of Lemma 2.20: a way to trans- form an arbitrary representation into one that is prefix free. The func- tion prefixfree below takes as input a pair of encoding and decoding functions, and returns a triple of functions containing prefix-free encod- ing and decoding functions, as well as a function that checks whether a string is a valid encoding of an object.

# takes functions encode and decode mapping # objects to lists of bits and vice versa, # and returns functions pfencode and pfdecode that # maps objects to lists of bits and vice versa # in a prefix-free way. # Also returns a function pfvalid that says # whether a list is a valid encoding def prefixfree(encode, decode): def pfencode(o): L = encode(o) return [L[i//2] for i in range(2*len(L))]+[0,1] def pfdecode(L): return decode([L[j] for j in range(0,len(L)-2,2)]) def pfvalid(L): return (len(L) % 2 == 0 ) and all(L[2*i]==L[2*i+1] for i in range((len(L)-2)//2)) and

↪ L[-2:]==[0,1]

↪ return pfencode, pfdecode, pfvalid pfNtS, pfStN , pfvalidN = prefixfree(NtS,StN)

NtS(234) # 11101010 pfNtS(234) # 111111001100110001 pfStN(pfNtS(234)) # 234 pfvalidM(pfNtS(234)) # true computation and representation 107

P Note that the Python function prefixfree above takes two Python functions as input and outputs three Python functions as output. (When it’s not too awkward, we use the term “Python function” or “subroutine” to distinguish between such snippets of Python programs and mathematical functions.) You don’t have to know Python in this course, but you do need to get comfortable with the idea of functions as mathematical objects in their own right, that can be used as inputs and outputs of other functions.

We now show a “Python proof” of Theorem 2.18. Namely, we show a function represlists that takes as input a prefix-free representation scheme (implemented via encoding, decoding, and validity testing functions) and outputs a representation scheme for lists of such ob- jects. If we want to make this representation prefix-free then we could fit it into the function prefixfree above. def represlists(pfencode,pfdecode,pfvalid): """ Takes functions pfencode, pfdecode and pfvalid, and returns functions encodelists, decodelists that can encode and decode lists of the objects respectively. """

def encodelist(L): """Gets list of objects, encodes it as list of bits"""

return↪ "".join([pfencode(obj) for obj in L])

def decodelist(S): """Gets lists of bits, returns lists of objects""" i=0; j=1 ; res = [] while j<=len(S): if pfvalid(S[i:j]): res += [pfdecode(S[i:j])] i=j j+= 1 return res

return encodelist,decodelist

LtS , StL = represlists(pfNtS,pfStN,pfvalidN) 108 introduction to theoretical computer science

LtS([234,12,5]) # 111111001100110001111100000111001101 StL(LtS([234,12,5])) # [234, 12, 5]

2.5.5 Representing letters and text We can represent a letter or symbol by a string, and then if this rep- resentation is prefix-free, we can represent a sequence of symbols by merely concatenating the representation of each symbol. One such representation is the ASCII that represents letters and symbols as strings of bits. Since the ASCII representation is fixed-length, it is automatically prefix-free (can you see 128why?). Unicode is representa- tion of (at7 the time of this writing) about 128,000 symbols as numbers (known as code points) between and . There are several types of prefix-free representations of the code points, a popular one being UTF-8 that encodes every0 codepoint1, 114, into 111 a string of length be- tween and .

■ Example8 2.2132 — The Braille representation. The Braille system is another way to encode letters and other symbols as binary strings. Specifi- cally, in Braille, every letter is encoded as a string in , which Figure 2.10: The word “Binary” in “Grade 1” or is written using indented dots arranged in two columns and6 three “uncontracted” Unified English Braille. This word is rows, see Fig. 2.10. (Some symbols require more than{0, one 1} six-bit encoded using seven symbols since the first one is a string to encode, and so Braille uses a more general prefix-free modifier indicating that the first letter is capitalized. encoding.) The Braille system was invented in 1821 by Louis Braille when he was just 12 years old (though he continued working on it and improving it throughout his life). Braille was a French boy who lost his eyesight at the age of 5 as the result of an accident.

■ Example 2.22 — Representing objects in C (optional). We can use pro- gramming languages to probe how our computing environment represents various values. This is easiest to do in “unsafe” pro- gramming languages such as C that allow direct access to the memory. Using a simple C program we have produced the following representations of various values. One can see that for integers, multiplying by 2 corresponds to a “left shift” inside each . In contrast, for floating point numbers, multiplying by two corre- sponds to adding one to the exponent part of the representation. In the architecture we used, a negative number is represented computation and representation 109

using the two’s complement approach. C represents strings in a prefix-free form by ensuring that a zero byte is at their end.

int 2 : 00000010 00000000 00000000 00000000 int 4 : 00000100 00000000 00000000 00000000 int 513 : 00000001 00000010 00000000 00000000 long 513 : 00000001 00000010 00000000 00000000 00000000 00000000 00000000 00000000

int↪ -1 : 11111111 11111111 11111111 11111111 int -2 : 11111110 11111111 11111111 11111111 string Hello: 01001000 01100101 01101100 01101100 01101111 00000000

string↪ abcd : 01100001 01100010 01100011 01100100 00000000

float↪ 33.0 : 00000000 00000000 00000100 01000010 float 66.0 : 00000000 00000000 10000100 01000010 float 132.0: 00000000 00000000 00000100 01000011 double 132.0: 00000000 00000000 00000000 00000000 00000000 10000000 01100000 01000000

2.5.6 Representing vectors, matrices, images Once we can represent numbers and lists of numbers, then we can also represent vectors (which are just lists of numbers). Similarly, we can represent lists of lists, and thus, in particular, can represent matrices. To represent an image, we can represent the color at each pixel by a list of three numbers corresponding to the intensity of Red, Green and Blue. (We can restrict to three primary colors since most humans only have three types of cones in their retinas; we would have needed 16 primary colors to represent colors visible to the Mantis Shrimp.) Thus an image of pixels would be represented by a list of such length- three lists. A video can be represented as a list of images. Of course these representations푛 are rather wasteful and much more푛 compact representations are typically used for images and videos, though this will not be our concern in this book.

2.5.7 Representing graphs A graph on vertices can be represented as an adjacency matrix whose entry is equal to if the edge is present and is equal to otherwise.푛푡ℎ That is, we can represent푛 an × 푛vertex directed graph (푖, 푗) as a string 1 such(푖, 푗) that iff the 2 edge 0 . We can transform an undirected푛 graph푛 to a directed 푖,푗 graph by퐺 = replacing (푉 , 퐸) every edge퐴 ∈ {0,with 1} both edges퐴 and= 1 Another푖⃗⃗⃗⃗⃗⃗⃗⃗ 푗 ∈ 퐸 representation for graphs is the adjacency list representa- tion. That is, we identify the vertex{푖, 푗} set of a graph with푖⃗⃗⃗⃗⃗⃗⃗⃗ 푗 the푖⃖⃖⃖⃖⃖⃖⃖⃖ set 푗

푉 [푛] 110 introduction to theoretical computer science

where , and represent the graph as a list of lists, where the -th list consists of the out-neighbors of vertex . The difference푛 = |푉between | these representations퐺 = (푉can , 퐸) be significant푛 forsome applications, though푖 for us would typically be immaterial. 푖

2.5.8 Representing lists and nested lists If we have a way of representing objects from a set as binary strings, then we can represent lists of these objects by applying a prefix-free transformation. Moreover, we can use a trick similar풪 to the above to handle nested lists. The idea is that if we have some representation Figure 2.11: Representing the graph , then we can represent nested lists of items from in the adjacency matrix and adjacency list representa- using strings over∗ the five element alphabet 0,1,[ , ] , , . 퐺 = ({0,tions. 1, 2, 3, 4}, {(1, 0), (4, 0), (1, 4), (4, 1), (2, 1), (3, 2), (4, 3)}) 퐸For ∶ example, 풪 → {0, if 1} is represented by 0011, is represented by 10011, 풪and is represented by 00111, then we can representΣ = { the nested} list 1 2 as the표 string "[0011,[10011,00111]]"표 over the alphabet 3 . By표 encoding every element of itself as a three-bit string, we can 1 2 3 (표transform, (표 , 표 )) any representation for objects into a representation that Σenables representing (potentiallyΣ nested) lists of these objects. 풪 2.5.9 Notation We will typically identify an object with its representation as a string. For example, if is some function that maps strings to strings and is an∗ integer, we∗ might make statements such as “ is퐹 prime” ∶ {0, to1} mean→ {0, that 1} if we represent as a string , then the integer represented푛 by the string satisfies that is prime.퐹 (푛)+ (You 1 can see how this convention of identifying푛 objects푥 with their representation푚 can save us a lot of cumbersome퐹 (푥) formalism.)푚 + 1 Similarly, if are some objects and is a function that takes strings as inputs, then by we will mean the result of applying to the representation푥, 푦 of the ordered pair 퐹 . We use the same notation to invoke functions on퐹 (푥,-tuples 푦) of objects for every . 퐹 This convention of identifying an(푥, object 푦) with its representation as a string is one that we푘 humans follow all the time.푘 For example, when people say a statement such as “ is a prime number”, what they really mean is that the integer whose decimal representation is the string “17”, is prime. 17

When we say is an algorithm that computes the multiplication function on natural num- bers. 퐴 what we really mean is that is an algorithm that computes the function such that for every pair , if is a string representing∗ the pair∗ 퐴then will be a string representing∗ their product퐹 ∶ {0, 1} .→ {0, 1} 푎, 푏 ∈ ℕ 푥 ∈ {0, 1} (푎, 푏) 퐹 (푥) 푎 ⋅ 푏 computation and representation 111

2.6 DEFINING COMPUTATIONAL TASKS AS MATHEMATICAL FUNC- TIONS

Abstractly, a computational process is some process that takes an input which is a string of bits and produces an output which is a string of bits. This transformation of input to output can be done using a modern computer, a person following instructions, the evolution of some natural system, or any other means. In future chapters, we will turn to mathematically defining com- putational processes, but, as we discussed above, at the moment we focus on computational tasks. That is, we focus on the specification and not the implementation. Again, at an abstract level, a computational task can specify any relation that the output needs to have with the in- put. However, for most of this book, we will focus on the simplest and most common task of computing a function. Here are some examples:

• Given (a representation of) two integers , compute the product . Using our representation above, this corresponds to com- puting a function from to . We푥, 푦 have seen that there is 푥more × 푦 than one way to solve this∗ computational∗ task, and in fact, we still do not know the best{0, algorithm 1} {0, 1}for this problem.

• Given (a representation of) an integer , compute its factoriza- tion; i.e., the list of primes such that . This again corresponds to computing a function푧 > 1 from to . 1 푘 1 푘 The gaps in our knowledge푝 of≤ the ⋯ complexity≤ 푝 of this푧 = problem 푝∗ ⋯ 푝 are∗ even larger. {0, 1} {0, 1}

• Given (a representation of) a graph and two vertices and , compute the length of the shortest path in between and , or do the same for the longest path (with no퐺 repeated vertices)푠 between푡 and . Both these tasks correspond to computing퐺 a function푠 푡 from to , though it turns out that there is a vast difference in their푠 computational∗푡 ∗ difficulty. {0, 1} {0, 1} • Given the code of a Python program, determine whether there is an input that would force it into an infinite loop. This task corresponds to computing a partial function from to since not every string corresponds to a syntactically valid Python∗ program. We will see that we do understand the computational{0, 1} status{0, 1} of this problem, but the answer is quite surprising.

• Given (a representation of) an image , decide if is a photo of a cat or a dog. This corresponds to computing some (partial) func- tion from to . 퐼 퐼 ∗ {0, 1} {0, 1} 112 introduction to theoretical computer science

R Remark 2.23 — Boolean functions and languages. An important special case of computational tasks corre- sponds to computing Boolean functions, whose output is a single bit . Computing such functions corre- sponds to answering a YES/NO question, and hence this task is also{0, known 1} as a decision problem. Given any function and , the task of computing ∗corresponds to the task of∗ deciding whether or퐹 not∶ {0, 1} →where {0, 1} 푥 ∈ {0, 1} is known as the language퐹 (푥) that corresponds to the function . (The language푥∈ terminology 퐿 퐿 is = due {푥 to ∶ historical 퐹 (푥) = 1} connections between the theory of computation and 퐹formal linguistics as developed by Noam Chomsky.) Hence many texts refer to such a computational task as deciding a language.

For every particular function , there can be several possible algo- rithms to compute . We will be interested in questions such as: 퐹 • For a given function퐹 , can it be the case that there is no algorithm to compute ? 퐹 • If there is퐹 an algorithm, what is the best one? Could it be that is “effectively uncomputable” in the sense that every algorithm for computing requires a prohibitively large amount of resources?퐹 Figure 2.12: A subset can be identified with the function ∗ such that if and if . Functions • If we cannot퐹 answer this question, can we show equivalence be- 퐿 ⊆ {0,∗ 1} tween different functions and in the sense that either they are with a single bit of퐹 output ∶ {0, 1} are called→ {0,Boolean 1} functions, while퐹(푥) =subsets 1 푥 of ∈ strings퐿 퐹(푥) are called = 0 languages푥 ∉ 퐿 . The both easy (i.e., have fast algorithms)′ or they are both hard? above shows that the two are essentially the same 퐹 퐹 object, and we can identify the task of deciding • Can a function being hard to compute ever be a good thing? Can we membership in (known as deciding a language in the use it for applications in areas such as cryptography? literature) with the task of computing the function . 퐿 퐹 In order to do that, we will need to mathematically define the no- tion of an algorithm, which is what we will do in Chapter 3.

2.6.1 Distinguish functions from programs! You should always out for potential confusions between speci- fications and implementations or equivalently between mathematical functions and algorithms/programs. It does not help that program- ming languages (Python included) use the term “functions” to denote (parts of) programs. This confusion also stems from thousands of years of mathematical history, where people typically defined functions by means of a way to compute them. For example, consider the multiplication function on natural num- bers. This is the function MULT that maps a pair of natural numbers to the number . As we mentioned, it can be implemented in more than one way:∶ ℕ × ℕ → ℕ (푥, 푦) 푥 ⋅ 푦 computation and representation 113

def mult1(x,y): res = 0 while y>0: res += x y -= 1 return res

def mult2(x,y): a = str(x) # represent x as string in decimal notation b = str(y) # represent y as string in decimal notation res = 0 for i in range(len(a)): for j in range(len(b)): res += int(a[len(a)-i])*int(b[len(b)- j])*(10**(i+j))

return res↪

print(mult1(12,7)) # 84 print(mult2(12,7)) # 84

Both mult1 and mult2 produce the same output given the same pair of natural number inputs. (Though mult1 will take far longer to do so when the numbers become large.) Hence, even though these are two different programs, they compute the same mathematical function. This distinction between a program or algorithm , and the function that computes will be absolutely crucial for us in this course (see also Fig. 2.13). 퐴 퐹 퐴  Big Idea 2 A function is not the same as a program. A program computes a function.

Distinguishing functions from programs (or other ways for comput- ing, including circuits and machines) is a crucial theme for this course. For this reason, this is often a running theme in questions that I (and many other instructors) assign in homework and exams (hint, hint). Figure 2.13:A function is a mapping of inputs to outputs. A program is a set of instructions on how to obtain an output given an input. A program R computes a function, but it is not the same as a func- Remark 2.24 — Computation beyond functions (advanced, tion, popular programming language terminology optional). Functions capture quite a lot of compu- notwithstanding. tational tasks, but one can consider more general settings as well. For starters, we can and will talk about partial functions, that are not defined on all inputs. When computing a partial function, we only 114 introduction to theoretical computer science

need to worry about the inputs on which the function is defined. Another way to say it is that we can design an algorithm for a partial function under the as- sumption that someone “promised” us that all inputs would be such that is defined퐹 (as otherwise, we do not care about the result). Hence such tasks are 푥also known as promise퐹 problems (푥) . Another generalization is to consider relations that may have more than one possible admissible output. For example, consider the task of finding any solution for a given set of equations. A relation maps a string into a set of strings (for example, might describe∗ a set of equations, in푅 which case 푥would ∈ correspond {0, 1} to the set of all푅(푥) solutions to ). We 푥 can also identify a relation with the set of pairs푅(푥) of strings where . A computational푥 pro- cess solves a relation if for every푅 , it outputs some string(푥, 푦) 푦. ∈ 푅(푥) ∗ Later in this book, we will consider푥 ∈ even {0, 1}more general tasks, including푦 ∈ 푅(푥)interactive tasks, such as finding a good strategy in a game, tasks defined using proba- bilistic notions, and others. However, for much of this book, we will focus on the task of computing a func- tion, and often even a Boolean function, that has only a single bit of output. It turns out that a great deal of the theory of computation can be studied in the context of this task, and the insights learned are applicable in the more general settings.

Chapter Recap

• We can represent objects we want to compute on ✓ using binary strings. • A representation scheme for a set of objects is a one-to-one map from to . • We can use prefix-free encoding∗ to “”풪 a repre- sentation for a set into풪 representations{0, 1} of lists of elements in . • A basic computational풪 task is the task of computing a function 풪 . This task encom- passes not just arithmetical∗ ∗ such as multiplication,퐹 ∶ {0, factoring, 1} → {0, etc. 1} but a great many other tasks arising in areas as diverse as scientific computing, , image processing, and many many more. • We will study the question of finding (or at least giving bounds on) what the best algorithm for computing for various interesting functions is.

퐹 퐹 computation and representation 115

2.7 EXERCISES

Exercise 2.1 Which one of these objects can be represented by a binary string? a. An integer b. An undirected푥 graph . c. A directed graph 퐺 d. All of the above. 퐻

Exercise 2.2 — Binary representation. a. Prove that the function of the binary representation defined in2.1 ( ) satisfies that for every∗ , if then max log 푁푡푆and ∶ ℕ → {0, 1} log mod . 2 푛 ∈⌊ ℕ2 푛⌋−푖푥 = 푁푡푆(푛) |푥| = 1 + (0, ⌊ 푛⌋) 푖 b.푥 Prove= ⌊푥/2 that is⌋ a one to2 one function by coming up with a func- tion such that for every . 푁푡푆 ∗ 푆푡푁 ∶ {0, 1} → ℕ 푆푡푁(푁푡푆(푛)) = 푛 ■ 푛 ∈ ℕ Exercise 2.3 — More compact than ASCII representation. The ASCII encoding can be used to encode a string of English letters as a bit binary string, but in this exercise, we ask about finding a more compact rep- resentation for strings of English푛 lowercase letters. 7푛

1. Prove that there exists a representation scheme for strings over the 26-letter alphabet as binary strings such that for every and length- string (퐸, 퐷) , the representation is a binary{푎, 푏, string 푐, … , 푧} of length at most 푛 . In other words,푛 prove > 0 that for every푛 , there푥 ∈ exists {푎, 푏, a … one-to-one , 푧} function 퐸(푥) . 4.8푛 + 1000 푛 ⌊4.8푛+1000⌋푛 2. Prove that퐸 there∶ {푎, 푏, exists … , 푧}no→representation {0, 1} scheme for strings over the alphabet as binary strings such that for every length- string , the representation is a binary string of length {푎, 푏, … , 푧} .푛 In other words, prove that there exists some푛 푥such ∈ {푎, that 푏,… there , 푧} is no one-to-one function퐸(푥) ⌊4.6푛 + 1000⌋. 푛 푛 > 0⌊4.6푛+1000⌋ 퐸 ∶ {푎, 푏, … , 푧} → bz2.compress 3.{0, Python’s 1} function is a mapping from strings to strings, which uses the lossless (and hence one to one) bzip2 algo- rithm for compression. After converting to lowercase, and truncat- ing spaces and numbers, the text of Tolstoy’s “War and Peace” con- tains . Yet, if we run bz2.compress on the string of the text of “War and Peace” we get a string of length 푛 = 2, 517, 262 푘 = 6, 274, 768 116 introduction to theoretical computer science

bits, which is only (and in particular much smaller than ). Explain why this does not contradict your answer to the previous question.2.49푛 4.6푛 4. Interestingly, if we try to apply bz2.compress on a random string, we get much worse performance. In my experiments, I got a ratio of about between the number of bits in the output and the number of characters in the input. However, one could imagine that one could4.78 do better and that there exists a company called “Pied 5 Piper” with an algorithm that can losslessly compress a string of Actually that particular fictional company uses a 5 metric that focuses more on compression speed then random lowercase letters to fewer than bits. Show that this ratio, see here and here. is not the case by proving that for every and one to one 푛 function 4.6푛, if we let be the random variable (i.e., the푛 length of∗푛 > 100 ) for chosen uniformly퐸푛푐표푑푒 at random ∶ {푎, from … , 푧} the→ set {0, 1} , then푍 the expected value of |퐸푛푐표푑푒(푥)|is at least . 퐸푛푐표푑푒(푥)푛 푥 {푎, … , 푧}

푍 4.6푛 ■

Exercise 2.4 — Representing graphs: upper bound. Show that there is a string representation of directed graphs with vertex set and degree at most that uses at most log bits. More formally, show the following: Suppose we define for every , the[푛] set as the set containing10 all directed graphs1000푛 (with푛 no self loops) over the vertex 푛 set where every vertex has degree at푛 most ∈ ℕ . Then,퐺 prove that for every sufficiently large , there exists a one-to-one function [푛] log . 10 푛 ⌊1000푛 푛⌋ 푛 퐸 ∶ 퐺 →■ {0, 1} Exercise 2.5 — Representing graphs: lower bound. 1. Define to be the set of one-to-one and onto functions mapping to . Prove that 푛 there is a one-to-one mapping from to , where푆 is the set defined in Exercise 2.4 above. [푛] [푛] 푛 2푛 2푛 푆 퐺 퐺 2. In this question you will show that one cannot improve the rep- resentation of Exercise 2.4 to length log . Specifically, prove for every sufficiently large there is no one-to-one function log . 표(푛 푛) ⌊0.001푛 푛⌋+1000푛 ∈ ℕ ■ 푛 퐸 ∶ 퐺 → {0, 1} Exercise 2.6 — Multiplying in different representation. Recall that the grade- school algorithm for multiplying two numbers requires oper- ations. Suppose that instead of using decimal representation,2 we use one of the following representations to represent a푂(푛 number) between and . For which one of these representations you can still multiply the numbers푛 in operations?푅(푥) 푥 0 10 − 1 2 푂(푛 ) computation and representation 117

a. The standard binary representation: where and is the largest number s.t. . 0 푘 푘 푖 퐵(푥) = (푥 , …푘 , 푥 ) b. The reverse푖 binary representation: where is 푥 = ∑푖=0 푥 2 푘 푥 ≥ 2 defined as above for . 푘 0 푖 퐵(푥) = (푥 , … , 푥 ) 푥 c. Binary coded decimal representation: where 푖 = 0, … , 푘 − 1 represents the decimal digit of mapping to , 0 푛−1 to , 4 to , etc. (i.e.푡ℎ maps to 퐵(푥)) = (푦 , … , 푦 ) 푖 푦 ∈ {0, 1} 푖 푥 0 0000 d. All of the above. 1 0001 2 0010 9 1001 ■

Exercise 2.7 Suppose that corresponds to representing a number as a string of ’s, (e.g., ∗ , , etc.). If are numbers between푅 ∶ ℕand → {0, 1} , can we still multiply and using 푥 operations푥 if1 we are given푅(4)푛 = them 1111 in푅(7) the representation = 1111111 푥,? 푦 2 0 10 − 1 푥 푦 푂(푛 ) ■ 푅(⋅) Exercise 2.8 Recall that if is a one-to-one and onto function mapping elements of a finite set into a finite set then the sizes of and are the same. Let 퐹 be the function such that for every , is the binary푈 representation∗ of푉 . 푈 푉 퐵 ∶ ℕ → {0, 1} 1. Prove that if and only if . 푥 ∈ ℕ 퐵(푥) 푥 푘 2. Use 1. to compute the size of the set where 푥 < 2 |퐵(푥)| ≤ 푘 denotes the length of the string . ∗ {푦 ∈ {0, 1} ∶ |푦| ≤ 푘} |푦| 3. Use 1. and 2. to prove that . 푦 푘 푘−1 ■ 2 −1=1+2+4+⋯+2 Exercise 2.9 — Prefix-free encoding of tuples. Suppose that is a one-to-one function that is prefix-free in the sense that there is no ∗ s.t. is a prefix of . 퐹 ∶ ℕ → {0, 1} a. Prove that , defined as (i.e., 푎 ≠ 푏 퐹 (푎) 퐹 (푏) the concatenation of and ∗ ) is a one-to-one function. 2 2 퐹 ∶ ℕ × ℕ → {0, 1} 퐹 (푎, 푏) = 퐹 (푎)퐹 (푏) b. Prove that defined as 퐹 (푎) 퐹 (푏) is a one-to-one∗ function,∗ where denotes the set of ∗ ∗ 1 푘 all finite-length퐹 ∶ lists ℕ → of {0, natural 1} numbers. 퐹 (푎 ∗, … , 푎 ) = 1 푘 퐹 (푎 ) ⋯ 퐹 (푎 ) ℕ ■

Exercise 2.10 — More efficient prefix-free transformation. Suppose that is some (not necessarily prefix-free) representation of the objects in the∗ set , and is a prefix-free representa-퐹 ∶ 푂tion → of {0, the 1} natural numbers. Define ∗ (i.e., the concatenation of푂 the representation퐺 ∶ ℕ → {0, of 1}′ the length and ). 퐹 (표) = 퐺(|퐹 (표)|)퐹 (표) 퐹 (표) 퐹 (표) 118 introduction to theoretical computer science

a. Prove that is a prefix-free representation of . ′ b. Show that we퐹 can transform any representation푂 to a prefix-free one by a modification that takes a bit string into a string of length at most log . 푘 c. Show푘 that + 푂( we can푘) transform any representation to a prefix-free one

by a modification that takes a bit string into a string of length at 6 6 Hint: Think recursively how to represent the length most log log log . of the string. 푘 푘 + 푘 + 푂( 푘) ■ Exercise 2.11 — Kraft’s Inequality. Suppose that is some finite prefix-free set. 푛 푆 ⊆ {0, 1} a. For every and length- string , let denote all the length- strings whose first bits are . Prove푛 that (1) 푘 ≤ 푛 and (2) 푘If 푥then ∈ 푆 퐿(푥)is disjoint ⊆ {0, 1} from 0 푘−1 . 푛푛−|푥| 푘 ′ 푥 , … , 푥 ′|퐿(푥)| = 2 푥 ≠ 푥 퐿(푥) b.퐿(푥 Prove) that . −|푥| c. Prove that there∑푥∈푆 is2 no prefix-free≤ 1 encoding of strings with less than logarithmic overhead. That is, prove that there is no function PF s.t. PF log for every and such∗ that the∗ set PF is prefix-free. The factor∶ ∗ {0, 1}is arbitrary;→ {0, 1} all that| matters(푥)| ≤|푥| is that + 0.9 it is∗ less|푥| than . 푥 ∈ {0, 1} { (푥) ∶ 푥 ∈ {0, 1} } 0.9 1 ■ Exercise 2.12 — Composition of one-to-one functions. Prove that for every two one-to-one functions and , the function defined as is one to one. 퐹 ∶ 푆 → 푇 퐺 ∶ 푇 → 푈 ■ 퐻 ∶ 푆 → 푈 퐻(푥) = 퐺(퐹 (푥)) Exercise 2.13 — Natural numbers and strings. 1. We have shown that the natural numbers can be represented as strings. Prove that the other direction holds as well: that there is a one-to-one map .( stands for “strings to numbers.”) ∗ 2.푆푡푁 Recall ∶ {0, that 1} Cantor→ ℕ proved푆푡푁 that there is no one-to-one map . Show that Cantor’s result implies Theorem 2.5. 푅푡푁 ∶ ■ ℝ → ℕ Exercise 2.14 — Map lists of integers to a number. Recall that for every set , the set is defined as the set of all finite sequences ofmem- bers of (i.e.,∗ ). 푆Prove that푆 there∗ is a one-one-map from to where is the set of 0 푛−1 푖∈[푛] 푖 푆 푆 = {(푥 , … , 푥 of) all | 푛 integers. ∈∗ ℕ , ∀ 푥 ∈ 푆} ℤ ℕ ℤ ■ {… , −3, −2, −1, 0, +1, +2, +3, …} computation and representation 119

2.8 BIBLIOGRAPHICAL NOTES

The study of representing data as strings, including issues such as compression and error corrections falls under the purview of , as covered in the classic textbook of Cover and Thomas [CT06]. Representations are also studied in the field of data structures design, as covered in texts such as [Cor+09]. The question of whether to represent integers with the most signif- icant digit first or last is known as Big Endian vs. Little Endian repre- sentation. This terminology comes from Cohen’s [Coh81] entertaining and informative paper about the conflict between adherents of both schools which he compared to the warring tribes in Jonathan Swift’s “Gulliver’s Travels”. The two’s complement representation of signed integers was suggested in von Neumann’s classic report [Neu45] that detailed the design approaches for a stored-program computer, though similar representations have been used even earlier in and other mechanical computation devices. The idea that we should separate the definition or specification of a function from its implementation or computation might seem “obvi- ous,” but it took quite a lot of time for mathematicians to arrive at this viewpoint. Historically, a function was identified by rules or formu- las showing how to derive the output from the input. As we discuss in greater depth in Chapter 9, in the퐹 1800s this somewhat informal notion of a function started “breaking at the seams,” and eventually mathematicians arrived at the more rigorous definition of a function as an arbitrary assignment of input to outputs. While many functions may be described (or computed) by one or more formulas, today we do not consider that to be an essential property of functions, and also allow functions that do not correspond to any “nice” formula. We have mentioned that all representations of the real numbers are inherently approximate. Thus an important endeavor is to under- stand what guarantees we can offer on the approximation quality of the output of an algorithm, as a function of the approximation quality of the inputs. This question is known as the question of determining the numerical stability of given equations. The Floating Points Guide website contains an extensive description of the floating point repre- sentation, as well the many ways in which it could subtly fail, see also the website 0.30000000000000004.com. Dauben [Dau90] gives a biography of Cantor with emphasis on the development of his mathematical ideas. [Hal60] is a classic text- book on set theory, including also Cantor’s theorem. Cantor’s Theo- rem is also covered in many texts on discrete mathematics, including [LLM18; LZ19]. 120 introduction to theoretical computer science

The adjacency matrix representation of graphs is not merely a con- venient way to map a graph into a binary string, but it turns out that many natural notions and operations on matrices are useful for graphs as well. (For example, Google’s PageRank algorithm relies on this viewpoint.) The notes of Spielman’s course are an excellent source for this area, known as spectral . We will return to this view much later in this book when we talk about random walks. I FINITE COMPUTATION

Learning Objectives: • See that computation can be precisely modeled. • Learn the computational model of Boolean circuits / straight-line programs. • Equivalence of circuits and straight-line programs. • Equivalence of AND/OR/NOT and NAND. • Examples of computing in the physical world. 3 Defining computation

“there is no reason why mental as well as bodily labor should not be economized by the aid of machinery”, , 1852

“If, unwarned by my example, any man shall undertake and shall succeed in constructing an engine embodying in itself the whole of the executive de- partment of upon different principles or by simpler mechanical means, I have no fear of leaving my reputation in his charge, for he alone will be fully able to appreciate the nature of my efforts and the value of their results.”, Charles Babbage, 1864

“To understand a program you must become both the machine and the pro- gram.”, Alan Perlis, 1982

People have been computing for thousands of years, with aids that include not just pen and paper, but also abacus, slide rules, vari- ous mechanical devices, and modern electronic computers. A priori, the notion of computation seems to be tied to the particular mech- anism that you use. You might think that the “best” algorithm for multiplying numbers will differ if you implement itin Python on a modern than if you use pen and paper. However, as we saw Figure 3.1: Calculating wheels by Charles Babbage. in the introduction (Chapter 0), an algorithm that is asymptotically Image taken from the Mark I ‘operating manual’ better would eventually beat a worse one regardless of the underly- ing technology. This gives us hope for a technology independent way of defining computation. This is what we do in this chapter. Wewill define the notion of computing an output from an input by applying a sequence of basic operations (see Fig. 3.3). Using this, we will be able to precisely define statements such as “function can be computed by model ” or “function can be computed by model using operations”. 푓 푋 푓 푋 푠 Figure 3.2: A 1944 Popular Mechanics article on the This chapter: A non-mathy overview Harvard Mark I computer. The main takeaways from this chapter are:

Compiled on 8.26.2020 18:10 124 introduction to theoretical computer science

Figure 3.3: A function mapping strings to strings specifies a computational task, i.e., describes what the desired relation between the input and the output is. In this chapter we define models for implementing computational processes that achieve the desired relation, i.e., describe how to compute the output from the input. We will see several examples of such models using both Boolean circuits and straight-line programming languages.

• We can use logical operations such as AND, OR, and NOT to compute an output from an input (see Section 3.2).

•A Boolean circuit is a way to compose the basic logical operations to compute a more complex function (see Sec- tion 3.3). We can think of Boolean circuits as both a mathe- matical model (which is based on directed acyclic graphs) as well as physical devices we can construct in the real world in a variety of ways, including not just silicon-based semi-conductors but also mechanical and even biological mechanisms (see Section 3.4).

• We can describe Boolean circuits also as straight-line pro- grams, which are programs that do not have any looping constructs (i.e., no while / for/ do .. until etc.), see Section 3.3.2.

• It is possible to implement the AND, OR, and NOT oper- ations using the NAND operation (as well as vice versa). This means that circuits with AND/OR/NOT gates can compute the same functions (i.e., are equivalent in power) to circuits with NAND gates, and we can use either model to describe computation based on our convenience, see Section 3.5. To give out a “spoiler”, we will see in Chap- ter 4 that such circuits can compute all finite functions.

One “big idea” of this chapter is the notion of equivalence between models (Big Idea 3). Two computational models are equivalent if they can compute the same set of functions. defining computation 125

Boolean circuits with AND/OR/NOT gates are equivalent to circuits with NAND gates, but this is just one example of the more general phenomenon that we will see many times in this book.

3.1 DEFINING COMPUTATION

The name “algorithm” is derived from the Latin transliteration of Muhammad ibn Musa al-Khwarizmi’s name. Al-Khwarizmi was a Persian scholar during the 9th century whose books introduced the western world to the decimal positional numeral system, as well as to the solutions of linear and quadratic equations (see Fig. 3.4). However Al-Khwarizmi’s descriptions of algorithms were rather informal by today’s standards. Rather than use “variables” such as , he used concrete numbers such as 10 and 39, and trusted the reader to be able to extrapolate from these examples, much as algorithms푥, 푦 are still taught to children today. Here is how Al-Khwarizmi described the algorithm for solving an equation of the form : 2 [How to solve an equation푥 + of 푏푥 the = form 푐 ] “roots and squares are equal to num- bers”: For instance “one square , and ten roots of the same, amount to thirty- nine dirhems” that is to say, what must be the square which, when increased by ten of its own root, amounts to thirty-nine? The solution is this: you halve the number of the roots, which in the present instance yields five. This you multiply by itself; the product is twenty-five. Add this to thirty-nine’ the sum is sixty-four. Now take the root of this, which is eight, and subtract from it half the number of roots, which is five; the remainder is three. This is the root ofthe square which you sought for; the square itself is nine.

For the purposes of this book, we will need a much more precise way to describe algorithms. Fortunately (or is it unfortunately?), at least at the moment, computers lag far behind school-age children in learning from examples. Hence in the 20th century, people came Figure 3.4: Text pages from Algebra manuscript with up with exact formalisms for describing algorithms, namely program- geometrical solutions to two quadratic equations. ming languages. Here is al-Khwarizmi’s quadratic equation solving Shelfmark: MS. Huntington 214 fol. 004v-005r algorithm described in the Python programming language: from math import sqrt #Pythonspeak to enable use of the sqrt function to compute square roots.

↪ def solve_eq(b,c): # return solution of x^2 + bx = c following Al

Khwarizmi's instructions Figure 3.5: An explanation for children of the two digit #↪ Al Kwarizmi demonstrates this for the case b=10 and addition algorithm c= 39

↪ 126 introduction to theoretical computer science

val1 = b / 2.0 # "halve the number of the roots" val2 = val1 * val1 # "this you multiply by itself" val3 = val2 + c # "Add this to thirty-nine" val4 = sqrt(val3) # "take the root of this" val5 = val4 - val1 # "subtract from it half the number of roots"

return↪ val5 # "This is the root of the square which you sought for"

↪ # Test: solve x^2 + 10*x = 39 print(solve_eq(10,39)) # 3.0

We can define algorithms informally as follows:

Informal definition of an algorithm: An algorithm is a set of instruc- tions for how to compute an output from an input by following a se- quence of “elementary steps”. An algorithm computes a function if for every input , if we follow the instructions of on the input , we obtain the output . 퐴 퐹 푥 In this chapter we퐴 will make this푥 informal definition퐹 (푥) precise using the model of Boolean Circuits. We will show that Boolean Circuits are equivalent in power to straight line programs that are written in “ultra simple” programming languages that do not even have loops. We will also see that the particular choice of elementary operations is immaterial and many different choices yield models with equivalent power (see Fig. 3.6). However, it will take us some time to get there. We will start by discussing what are “elementary operations” and how we map a description of an algorithm into an actual physical process that produces an output from an input in the real world.

Figure 3.6: An overview of the computational models defined in this chapter. We will show several equiv- alent ways to represent a recipe for performing a finite computation. Specifically we will show thatwe can model such a computation using either a Boolean circuit or a straight line program, and these two repre- sentations are equivalent to one another. We will also show that we can choose as our basic operations ei- ther the set AND OR NOT or the set NAND and these two choices are equivalent in power. By making the choice of{ whether, to, use} circuits or programs,{ } and whether to use AND OR NOT or NAND we obtain four equivalent ways of modeling finite com- putation. Moreover,{ there, are many, other} { choices} of sets of basic operations that are equivalent in power. defining computation 127

3.2 COMPUTING USING AND, OR, AND NOT.

An algorithm breaks down a complex calculation into a series of sim- pler steps. These steps can be executed in a variety of different ways, including:

• Writing down symbols on a piece of paper.

• Modifying the current flowing on electrical wires.

• Binding a protein to a strand of DNA.

• Responding to a stimulus by a member of a collection (e.g., a bee in a colony, a trader in a market).

To formally define algorithms, let us try to “err on the side ofsim- plicity” and model our “basic steps” as truly minimal. For example, here are some very simple functions:

• OR defined as 2 ∶ {0, 1} → {0, 1} OR (3.1) otherwise {⎧0 푎 = 푏 = 0 (푎, 푏) = • AND defined⎨ as ⎩{1 2 ∶ {0, 1} → {0, 1} AND (3.2) otherwise {⎧1 푎 = 푏 = 1 (푎, 푏) = • NOT defined ⎨as ⎩{0

∶ {0, 1} → {0, 1} NOT (3.3)

{⎧0 푎 = 1 The functions AND, OR and(푎)NOT = , are the basic logical operators {⎨1 푎 = 0 used in logic and many computer systems.⎩ In the context of logic, it is common to use the notation for AND , for OR and and for NOT , and we will use this notation as well. Each one of the functions 푎AND ∧ 푏 OR NOT(푎,takes 푏) 푎 either∨ 푏 one(푎, or two 푏) single푎 ¬푎 bits as input,(푎) and produces a single bit as output. Clearly, it cannot get much more basic than that., , However, the power of compu- tation comes from composing such simple building blocks together. 128 introduction to theoretical computer science

■ Example 3.1 — Majority from , and . Consider the func- tion MAJ that is defined as follows: 퐴푁퐷 푂푅 푁푂푇 3 ∶ {0, 1} → {0, 1} MAJ (3.4) otherwise0 1 2 {⎧1 푥 + 푥 + 푥 ≥ 2 That is, for every (푥) = ⎨ , MAJ if and. only if the ma- ⎩{0 jority (i.e., at least two out of the3 three) of ’s elements are equal to . Can you come up푥 ∈ with {0, a 1} formula(푥) involving = 1 AND, OR and NOT to compute MAJ? (It would be useful푥 for you to pause at this point1 and work out the formula for yourself. As a hint, although the NOT operator is needed to compute some functions, you will not need to use it to compute MAJ.) Let us first try to rephrase MAJ in words: “MAJ if and only if there exists some pair of distinct elements such that both and are equal to .” In other(푥) words it means that(푥)MAJ = 1 iff either both and , or both and푖, 푗 , or both 푖 푗 푥 푥 and .1 Since the OR of three conditions (푥) =can 1 0 1 1 2 be written as 푥OR= 1OR 푥 = 1, we can now푥 = translate 1 푥 this= 1 into a 0 2 0 1 2 푥formula= 1 as follows:푥 = 1 푐 , 푐 , 푐 0 1 2 (푐 , (푐 , 푐 ))

MAJ OR AND OR AND AND (3.5) 0 1 2 0 1 1 2 0 2 Recall(푥 , 푥 that, 푥 we) = can also( write(푥 , 푥 ), for( OR (푥 , 푥and), (푥for, 푥 )) ) . AND . With this notation, (3.5) can also be written as 푎 ∨ 푏 (푎, 푏) 푎 ∧ 푏 (푎, 푏) MAJ (3.6)

0 1 2 0 1 1 2 0 2 We can(푥 also, 푥 write, 푥 ) = (3.5 ((푥) in∧ a 푥 “programming) ∨ (푥 ∧ 푥 )) ∨ language” (푥 ∧ 푥 ). form, expressing it as a set of instructions for computing MAJ given the basic operations AND OR NOT:

def MAJ(X[0],X[1],X[, 2]):, firstpair = AND(X[0],X[1]) secondpair = AND(X[1],X[2]) thirdpair = AND(X[0],X[2]) temp = OR(secondpair,thirdpair) return OR(firstpair,temp)

3.2.1 Some properties of AND and OR Like standard addition and multiplication, the functions AND and OR satisfy the properties of commutativity: and and associativity: and . As in 푎 ∨ 푏 = 푏 ∨ 푎 푎 ∧ 푏 = 푏 ∧ 푎 (푎∨푏)∨푐 = 푎∨(푏∨푐) (푎∧푏)∧푐 = 푎∧(푏∧푐) defining computation 129

the case of addition and multiplication, we often drop the parenthesis and write for , and similarly OR’s and AND’s of more terms. They also satisfy a variant of the distributive law: 푎∨푏 ∨푐 ∨푑 ((푎∨푏)∨푐)∨푑 Solved Exercise 3.1 — Distributive law for AND and OR. Prove that for every , . ■ 푎, 푏, 푐 ∈ {0, 1} 푎 ∧ (푏 ∨ 푐) = (푎 ∧ 푏) ∨ (푎 ∧ 푐) Solution: We can prove this by enumerating over all the possible values for but it also follows from the standard distributive law. Suppose that we identify any positive integer8 with “true” and the푎, value 푏, 푐 zero ∈ {0, with 1} “false”. Then for every numbers , is positive if and only if is true and is positive if and only if is true. This means that for every 푢, 푣, the ∈ ℕ expres-푢 + 푣 sion is true if푢 and∨ 푣 only if 푢 ⋅ 푣 is positive, and the expression푢 ∧ 푣 is true if and only푎, 푏, if 푐 ∈ {0, 1} is positive, But by푎 ∧the (푏 standard ∨ 푐) distributive law 푎 ⋅ (푏 + 푐) and hence the former(푎 ∧ 푏) expression ∨ (푎 ∧ 푐) is true if and only푎 if ⋅ the 푏 + 푎latter ⋅ 푐 one is. 푎 ⋅ (푏 + 푐) = 푎 ⋅ 푏 + 푎 ⋅ 푐 ■

3.2.2 Extended example: Computing XOR from AND, OR, and NOT Let us see how we can obtain a different function from the same building blocks. Define XOR to be the function XOR mod . That is, XOR2 XOR and XOR XOR .∶ We {0, claim 1} → that {0, we 1} can construct XOR using(푎, only 푏) =AND 푎 +, OR 푏 , and2NOT. (0, 0) = (1, 1) = 0 (1, 0) = (0, 1) = 1

P As usual, it is a good exercise to try to work out the algorithm for XOR using AND, OR and NOT on your own before reading further.

The following algorithm computes XOR using AND, OR, and NOT:

Algorithm 3.2 — from / / . Input: . 푋푂푅 퐴푁퐷 푂푅 푁푂푇 Output: 1: 푎, 푏 ∈ {0, 1} 2: 푋푂푅(푎, 푏) 3: 푤1 ← 퐴푁퐷(푎, 푏) 4: return푤2 ← 푁푂푇 (푤1) 푤3 ← 푂푅(푎, 푏) Lemma 3.3 For every퐴푁퐷(푤2, 푤3) , on input , Algorithm 3.2 outputs mod . 푎, 푏 ∈ {0, 1} 푎, 푏 푎 + 푏 2 130 introduction to theoretical computer science

Proof. For every , XOR if and only if is different from . On input , Algorithm 3.2 outputs AND where NOT AND푎, 푏 and(푎, 푏) =OR 1 . 푎 푏 푎, 푏 ∈ {0, 1} (푤2, 푤3) 푤2• If = ( then(푎, 푏)) OR푤3 = (푎,and 푏) so the output will be .

• If 푎 = 푏 = 0 then푤3AND = (푎, 푏) =and 0 so NOT AND 0 and the output will be . 푎 = 푏 = 1 (푎, 푏) = 1 푤2 = ( (푎, 푏)) = 0 • If and (or0 vice versa) then both OR and AND , in which case the algorithm will output OR푎NOT = 1 푏 = 0 . 푤3 = (푎, 푏) = 1 푤1 = (푎, 푏) = 0 ■ ( (푤1), 푤3) = 1 We can also express Algorithm 3.2 using a programming language. Specifically, the following is a Python program that computes the XOR function: def AND(a,b): return a*b def OR(a,b): return 1-(1-a)*(1-b) def NOT(a): return 1-a def XOR(a,b): w1 = AND(a,b) w2 = NOT(w1) w3 = OR(a,b) return AND(w2,w3)

# Test out the code print([f"XOR({a},{b})={XOR(a,b)}" for a in [0,1] for b in [0,1]])

#↪ ['XOR(0,0)=0', 'XOR(0,1)=1', 'XOR(1,0)=1', 'XOR(1,1)=0']

Solved Exercise 3.2 — Compute on three bits of input. Let XOR be the function defined as XOR 푋푂푅 3 mod 3. That is, XOR if is odd, and XOR ∶ 3 {0,otherwise. 1} → {0, Show 1} that you can compute XOR using(푎, 푏,AND, 푐) = 푎 OR, + 푏 and + 푐 3 3 NOT.2 You can express(푎, it as 푏, 푐) a formula, = 1 푎+푏+푐 use a programming language(푎, 푏, 푐) = 3 0such as Python, or use a Boolean circuit. ■

Solution: Addition modulo two satisfies the same properties of associativ- ity ( ) and commutativity ( ) as standard addition. This means that, if we define to equal (푎 + 푏) + 푐 = 푎 + (푏 + 푐) 푎 + 푏 = 푏 + 푎 푎 ⊕ 푏 푎 + 푏 defining computation 131

mod , then XOR (3.7) 2 or in other words 3 (푎, 푏, 푐) = (푎 ⊕ 푏) ⊕ 푐 XOR XOR XOR (3.8)

3 Since we know how(푎, to 푏, compute 푐) = XOR( using(푎, 푏), AND, 푐) . OR, and NOT, we can compose this to compute XOR using the same build- ing blocks. In Python this corresponds to the following program: 3 def XOR3(a,b,c): w1 = AND(a,b) w2 = NOT(w1) w3 = OR(a,b) w4 = AND(w2,w3) w5 = AND(w4,c) w6 = NOT(w5) w7 = OR(w4,c) return AND(w6,w7)

# Let's test this out print([f"XOR3({a},{b},{c})={XOR3(a,b,c)}" for a in [0,1] for b in [0,1] for c in [0,1]])

#↪ ['XOR3(0,0,0)=0', 'XOR3(0,0,1)=1', 'XOR3(0,1,0)=1', 'XOR3(0,1,1)=0', 'XOR3(1,0,0)=1', 'XOR3(1,0,1)=0',

↪ 'XOR3(1,1,0)=0', 'XOR3(1,1,1)=1']

↪ ■

P Try to generalize the above examples to obtain a way to compute XOR for every us- ing at most basic steps involving푛 applications of a 푛 function in AND OR∶ {0,NOT 1} to→ outputs {0, 1} or previously푛 computed values.4푛 { , , }

3.2.3 Informally defining “basic operations” and “algorithms” We have seen that we can obtain at least some examples of interesting functions by composing together applications of AND, OR, and NOT. This suggests that we can use AND, OR, and NOT as our “basic opera- tions”, hence obtaining the following definition of an “algorithm”:

Semi-formal definition of an algorithm: An algorithm consists of a sequence of steps of the form “compute a new value by applying AND, OR, or NOT to previously computed values”. 132 introduction to theoretical computer science

An algorithm computes a function if for every input to , if we feed as input to the algorithm, the value computed in its last step is . 퐴 퐹 푥 퐹 푥 There퐹 (푥) are several concerns that are raised by this definition:

1. First and foremost, this definition is indeed too informal. We do not specify exactly what each step does, nor what it means to “feed as input”. 푥 2. Second, the choice of AND, OR or NOT seems rather arbitrary. Why not XOR and MAJ? Why not allow operations like addition and multiplication? What about any other logical constructions such if/then or while?

3. Third, do we even know that this definition has anything to do with actual computing? If someone gave us a description of such an algorithm, could we use it to actually compute the function in the real world?

P These concerns will to a large extent guide us in the upcoming chapters. Thus you would be well advised to re-read the above informal definition and see what you think about these issues.

A large part of this book will be devoted to addressing the above issues. We will see that:

1. We can make the definition of an algorithm fully formal, andso give a precise mathematical meaning to statements such as “Algo- rithm computes function ”.

2. While퐴 the choice of AND/OR푓 /NOT is arbitrary, and we could just as well have chosen other functions, we will also see this choice does not matter much. We will see that we would obtain the same computational power if we instead used addition and multiplica- tion, and essentially every other operation that could be reasonably thought of as a basic step.

3. It turns out that we can and do compute such “AND/OR/NOT based algorithms” in the real world. First of all, such an algorithm is clearly well specified, and so can be executed by a human witha pen and paper. Second, there are a variety of ways to mechanize this computation. We’ve already seen that we can write Python code that corresponds to following such a list of instructions. But in fact we can directly implement operations such as AND, OR, and NOT defining computation 133

via electronic signals using components known as transistors. This is how modern electronic computers operate.

In the remainder of this chapter, and the rest of this book, we will begin to answer some of these questions. We will see more examples of the power of simple operations to compute more complex opera- tions including addition, multiplication, sorting and more. We will also discuss how to physically implement simple operations such as AND, OR and NOT using a variety of technologies.

3.3 BOOLEAN CIRCUITS

Boolean circuits provide a precise notion of “composing basic opera- tions together”. A Boolean circuit (see Fig. 3.9) is composed of gates and inputs that are connected by wires. The wires carry a signal that represents either the value or . Each gate corresponds to either the OR, AND, or NOT operation. An OR gate has two incoming wires, and one or more outgoing wires.0 1 If these two incoming wires carry the signals and (for ), then the signal on the outgoing wires will be OR . AND and NOT gates are defined similarly. The Figure 3.7: Standard symbols for the logical operations inputs have푎 only outgoing푏 푎, 푏 wires. ∈ {0, If1} we set a certain input to a value or “gates” of AND, OR, NOT, as well as the operation , then this(푎, 푏) value is propagated on all the wires outgoing NAND discussed in Section 3.5. from it. We also designate some gates as output gates, and their value 푎corresponds ∈ {0, 1} to the result of evaluating the circuit. For example, ?? gives such a circuit for the XOR function, following Section 3.2.2. We evaluate an -input Boolean circuit on an input by plac- ing the bits of on the inputs, and then propagating the values푛 on the wires until we푛 reach an output, see Fig.퐶 3.9. 푥 ∈ {0, 1} 푥

R Remark 3.4 — Physical realization of Boolean circuits. Figure 3.8: A circuit with AND, OR and NOT gates for Boolean circuits are a mathematical model that does not computing the XOR function. necessarily correspond to a physical object, but they can be implemented physically. In physical imple- mentation of circuits, the signal is often implemented by electric potential, or voltage, on a wire, where for example voltage above a certain level is interpreted as a logical value of , and below a certain level is interpreted as a logical value of . Section 3.4 dis- cusses physical implementation1 of Boolean circuits (with examples including using0 electrical signals such as in silicon-based circuits, but also biological and mechanical implementations as well).

Solved Exercise 3.3 — All equal function. Define ALLEQ to be the function that on input outputs if and4 only if . Give a Boolean circuit for4 computing∶ {0, 1}ALLEQ→ {0,. 1} 푥 ∈ {0, 1} 1 0 1 2 3 푥 = 푥 = 푥 = 푥 134 introduction to theoretical computer science

Figure 3.9:A Boolean Circuit consists of gates that are connected by wires to one another and the inputs. The left side depicts a circuit with inputs and gates, one of which is designated the output gate. The right side depicts the evaluation of this2 circuit on5 the input with and . The value of

every gate2 is obtained by applying the corresponding function푥 ∈ {0, 1} (AND, OR푥0, or=NOT 1 ) to푥1 values= 0 on the wire(s) that enter it. The output of the circuit on a given input is the value of the output gate(s). In this case, the circuit computes the XOR function and hence it outputs on the input .

1 10

Solution: Another way to describe the function ALLEQ is that it outputs on an input if and only if or . We can phrase the condition 4 as 4 which4 can be 1computed using푥 ∈ three {0, AND1} gates.4 Similarly푥 = we 0 can푥 phrase = 1 the con- 0 1 2 3 dition as 푥 = 1 which푥 ∧ can푥 ∧ be 푥 computed∧ 푥 using four NOT gates and4 three AND gates. The output of ALLEQ is the OR 0 1 2 3 of these푥 two= 0 conditions,푥 ∧ 푥 ∧ which푥 ∧ 푥 results in the circuit of 4 NOT gates, 6 AND gates, and one OR gate presented in Fig. 3.10. ■

3.3.1 Boolean circuits: a formal definition We defined Boolean circuits informally as obtained by connecting AND, OR, and NOT gates via wires so as to produce an output from an input. However, to be able to prove theorems about the existence or non-existence of Boolean circuits for computing various functions we need to:

1. Formally define a Boolean circuit as a mathematical object. Figure 3.10: A Boolean circuit for computing the all equal function ALLEQ that outputs 2. Formally define what it means for a circuit to compute a function on if and only if4 . . 4 ∶ {0, 1} → {0, 1} 1 푥 ∈ {0, 1} 푥0 = 푥1 = 푥2 = 푥3 퐶 We now proceed to do so. We will define a Boolean circuit asa 푓 labeled Directed Acyclic Graph (DAG). The vertices of the graph corre- spond to the gates and inputs of the circuit, and the edges of the graph correspond to the wires. A wire from an input or gate to a gate in the circuit corresponds to a directed edge between the corresponding vertices. The inputs are vertices with no incoming edges,푢 while each푣 gate has the appropriate number of incoming edges based on the func- tion it computes. (That is, AND and OR gates have two in-neighbors, while NOT gates have one in-neighbor.) The formal definition is as follows (see also Fig. 3.11): defining computation 135

Figure 3.11:A Boolean Circuit is a labeled directed acyclic graph (DAG). It has input vertices, which are marked with X[ ], , X[ ] and have no incoming edges, and the rest of the vertices푛 are gates. AND, OR, and NOT gates0 … have푛 two, − 1 two, and one incoming edges, respectively. If the circuit has outputs, then of the gates are known as outputs and are marked with Y[ ], ,Y[ ]. When we evaluate푚 a circuit 푚on an input , we start by setting the value of0 the… input푚 vertices− 1 푛 to and then propagate퐶 the values,푥 ∈ {0, assigning 1} to each gate the result of applying the operation푥0, of … , 푥to푛−1 the values of ’s in-neighbors. The output of the circuit is푔 the value assigned to the output gates. 푔 푔

Definition 3.5 — Boolean Circuits. Let be positive integers with .A Boolean circuit with inputs, outputs, and gates, is a labeled directed acyclic graph (DAG)푛, 푚, 푠 with vertices 푠satisfying ≥ 푚 the following properties:푛 푚 푠 퐺 = (푉 , 퐸) 푠+푛 • Exactly of the vertices have no in-neighbors. These vertices are known as inputs and are labeled with the labels X[ ], , X[ ]푛. Each input has at least one out-neighbor. 푛 0 … • The푛 −other 1 vertices are known as gates. Each gate is labeled with , or . Gates labeled with (AND) or (OR) have two in- neighbors.푠 Gates labeled with (NOT) have one in-neighbor. ∧We∨ will¬ allow parallel edges. 1∧ ∨ ¬ • Exactly of the gates are also labeled with the labels Y[ ], , Y[ ] (in addition to their label / / ). These are known as outputs.푚 푚 0 … 푚 − 1 ∧ ∨ ¬ The size of a Boolean circuit is the number of gates it contains.

1 Having parallel edges means an AND or OR gate can have both its in-neighbors be the same gate . Since AND OR for every 푢 , such parallel edges don’t help in computing P new values in circuits with AND/OR/NOT gates. This is a non-trivial mathematical definition, so it is 푣 (푎, 푎) = (푎, 푎) = 푎 푎 ∈ {0,However, 1} we will see circuits with more general sets worth taking the time to read it slowly and carefully. of gates later on. As in all mathematical definitions, we are using a known mathematical object — a directed acyclic graph (DAG) — to define a new object, a Boolean circuit. This might be a good time to review some of the basic 136 introduction to theoretical computer science

properties of DAGs and in particular the fact that they can be topologically sorted, see Section 1.6.

If is a circuit with inputs and outputs, and , then we can compute the output of on the input in the natural way:푛 assign퐶 the input vertices푛X[ ], , X[ 푚 ] the values푥 ∈ {0, 1} , apply each gate on the values of퐶 its in-neighbors,푥 and then output the 0 푛−1 values that correspond to the0 output… 푛 vertices. − 1 Formally,푥 this, … , is 푥 defined as follows:

Definition 3.6 — Computing a function via a Boolean circuit. Let be a Boolean circuit with inputs and outputs. For every , the output of on the input , denoted by , is defined퐶 as the 푛 result of the following푛 process: 푚 푥 ∈ {0, 1} We let 퐶 be the푥 minimal layering퐶(푥) of (aka topological sorting, see Theorem 1.26). We let be the maximum layer of , and for ℎ ∶ 푉 →we ℕ do the following: 퐶 퐿 ℎ • For everyℓ = 0, 1,in … the , 퐿 -th layer (i.e., such that ) do:

– If is an푣 input vertexℓ labeled with푣 X[ ] forℎ(푣) some = ℓ , then we assign to the value . 푣 푖 푖 ∈ [푛] – If is a gate vertex labeled푖 with and with two in-neighbors 푣 푥 then we assign to the AND of the values assigned to and푣 . (Since and are in-neighbors∧ of , they are in a lower푢, 푤 layer than , and푣 hence their values have already been assigned.)푢 푤 푢 푤 푣 푣 – If is a gate vertex labeled with and with two in-neighbors then we assign to the OR of the values assigned to and푣 . ∨ 푢, 푤 푣 푢 – If is a gate vertex labeled with and with one in-neighbor 푤 then we assign to the negation of the value assigned to . 푣 ¬ 푢 • The result of this process푣 is the value such that푢 for every , is the value assigned to the vertex푚 with label Y[ ]. 푦 ∈ {0, 1} 푗 푗 ∈ [푚] 푦 Let푗 . We say that the circuit computes if for every 푛 , 푚 . 푓 ∶ {0, 1} →푛 {0, 1} 퐶 푓 푥 ∈ {0, 1} 퐶(푥) = 푓(푥)

R Remark 3.7 — Boolean circuits nitpicks (optional). In phrasing Definition 3.5, we’ve made some technical defining computation 137

choices that are not very important, but will be con- venient for us later on. Having parallel edges means an AND or OR gate can have both its in-neighbors be the same gate . Since AND OR for every 푢, such parallel edges don’t help in computing new values푣 in circuits(푎, 푎)with = AND/OR/NOT(푎, 푎) = 푎 gates. However,푎 ∈ {0, we 1} will see circuits with more gen- eral sets of gates later on. The condition that every input vertex has at least one out-neighbor is also not very important because we can always add “dummy gates” that touch these inputs. However, it is conve- nient since it guarantees that (since every gate has at most two in-neighbors) that the number of inputs in a circuit is never larger than twice its size.

3.3.2 Equivalence of circuits and straight-line programs We have seen two ways to describe how to compute a function using AND, OR and NOT: 푓 •A Boolean circuit, defined in Definition 3.5, computes by connect- ing via wires AND, OR, and NOT gates to the inputs. 푓 • We can also describe such a computation using a straight-line program that has lines of the form foo = AND(bar,blah), foo = OR(bar,blah) and foo = NOT(bar) where foo, bar and blah are variable names. (We call this a straight-line program since it contains no loops or branching (e.g., if/then) statements.)

We now formally define the AON-CIRC programming language (“AON” stands for AND/OR/NOT; “CIRC” stands for circuit) which has the above operations, and show that it is equivalent to Boolean circuits.

Definition 3.8 — AON-CIRC Programming language. An AON-CIRC pro- gram is a string of lines of the form foo = AND(bar,blah), foo = OR(bar,blah) and foo = NOT(bar) where foo, bar and blah are variable names. 2 Variables of the form X[ ] are known as input variables, and variables of the form Y[ ] are known as output variables. In every line, the variables on the righthand푖 side of the assignment operators must either be input variables푗 or variables that have already been assigned a value. A valid AON-CIRC program includes input variables of the form X[ ], ,X[ ] and output variables of the form Y[ ], , Y[ ] for some . If 푃 is valid AON-CIRC program and 0 … , then푛 − we define1 the output of on input , denoted0 … by 푚, − to 1 be the푛 string푛, 푚 ≥ 1 푃 corresponding to the values of 푥the ∈ output {0, 1} variables Y[ ] , , Y[ 푚 ] in the푃 execution푥 of where 푃 (푥) 푦 ∈ {0, 1} 0 … 푚 − 1 푃 138 introduction to theoretical computer science

we initialize the input variables X[ ], ,X[ ] to the values . We say that such an AON-CIRC0 program… 푛computes − 1 a function 0 푛−1 푥 , … , 푥 if for every . 푃 푛 푚 푛 2 We follow the common programming languages 푓AON-CIRC ∶ {0, 1} → is {0, not 1} a practical푃 (푥) = programming 푓(푥) language:푥 ∈ {0, 1} it was de- convention of using names such as foo, bar, baz, signed for pedagogical purposes only, as a way to model computation blah as stand-ins for generic identifiers. A variable identifier in our programming language canbe as composition of AND, OR, and NOT. However, AON-CIRC can still any combination of letters, numbers, underscores, be easily implemented on a computer. The following solved exercise and brackets. The appendix contains a full formal gives an example of an AON-CIRC program. specification of our programming language. Solved Exercise 3.4 Consider the following function that on four input bits , outputs iff the4 number represented by is larger than the number represented∶ {0, by 1} →. {0,That 1} is CMP iff푎, 푏, 푐, 푑 ∈ {0, 1}. 1 Write an AON-CIRC(푎, 푏) program to compute CMP. (푐, 푑) (푎, 푏, 푐, 푑) = 1 2푎 + 푏 > 2푐 + 푑 ■

Solution: Writing such a program is tedious but not truly hard. To com- pare two numbers we first compare their most significant digit, and then go down to the next digit and so on and so forth. In this case where the numbers have just two binary digits, these compar- isons are particularly simple. The number represented by is larger than the number represented by if and only if one of the following conditions happens: (푎, 푏) (푐, 푑) 1. The most significant bit of is larger than the most signifi- cant bit of . 푎 (푎, 푏) or 푐 (푐, 푑)

2. The two most significant bits and are equal, but .

Another way to express the same푎 condition푐 is the following:푏 > 푑 the number is larger than iff OR ( AND ). For binary digits , the condition is simply that and (푎,or 푏)AND NOT (푐, 푑) ,푎 and > 푐 the condition푎 ≥ 푐 푏 >is 푑 sim- ply OR NOT 훼, 훽. Together these훼 observations > 훽 can be used훼 = to 1 give훽 the = following 0 (훼, AON-CIRC(훽)) program = 1 to compute CMP훼 ≥: 훽 (훼, (훽)) = 1 temp_1 = NOT(X[2]) temp_2 = AND(X[0],temp_1) temp_3 = OR(X[0],temp_1) temp_4 = NOT(X[3]) defining computation 139

temp_5 = AND(X[1],temp_4) temp_6 = AND(temp_5,temp_3) Y[0] = OR(temp_2,temp_6)

We can also present this 8-line program as a circuit with 8 gates, see Fig. 3.12. ■ It turns out that AON-CIRC programs and Boolean circuits have exactly the same power:

Theorem 3.9 — Equivalence of circuits and straight-line programs. Let and be some number. Then is computable푛 by a Boolean푚 circuit of gates if and only if is com- 푓putable ∶ {0, by 1} an AON-CIRC→ {0, 1} program푠 ≥ of 푚lines. 푓 푠 푓 Figure 3.12: A circuit for computing the CMP function. Proof Idea: 푠 The evaluation of this circuit on yields the output , since the number (represented in binary The idea is simple - AON-CIRC programs and Boolean circuits as ) is larger than the number(1,(represented 1, 1, 0) in are just different ways of describing the exact same computational binary as1 ). 3 process. For example, an AND gate in a Boolean circuit corresponding 11 2 10 to computing the AND of two previously-computed values. In a AON- CIRC program this will correspond to the line that stores in a variable the AND of two previously-computed variables.

⋆ P This proof of Theorem 3.9 is simple at heart, but all the details it contains can make it a little cumbersome to read. You might be better off trying to work it out yourself before reading it. Our GitHub repository con- tains a “proof by Python” of Theorem 3.9: implemen- tation of functions circuit2prog and prog2circuits mapping Boolean circuits to AON-CIRC programs and vice versa.

Proof of Theorem 3.9. Let . Since the theorem is an “if and only if” statement, to prove푛 it we need푚 to show both directions: translating an AON-CIRC푓 program ∶ {0, 1} → that {0, computes 1} into a circuit that computes , and translating a circuit that computes into an AON- CIRC program that does so. 푓 We start푓 with the first direction. Let be an line푓 AON-CIRC that computes . We define a circuit as follows: the circuit will have inputs and gates. For every , if the푃 -th line푠 has the form foo = AND(bar,blah)푓 then the -th gate퐶 in the circuit will be an AND 푛 gate that is푠 connected to gates푖 ∈and [푠] where푖 and correspond to the last lines before where푖 the variables bar and blah (respectively) 푗 푘 푗 푘 푖 140 introduction to theoretical computer science

where written to. (For example, if and the last line bar was written to is and the last line blah was written to is then the two in-neighbors of gate will be gates푖 = 57and .) If either bar or blah is an input variable35 then we connect the gate to the corresponding17 input vertex instead. If foo57is an output variable35 of17 the form Y[ ] then we add the same label to the corresponding gate to mark it as an output gate. We do the analogous operations if the -th line involves푗 an OR or a NOT operation (except that we use the corresponding OR or NOT gate, and in the latter case have only one in-neighbor푖 instead of two). For every input , if we run the program on , then the value written that is computed푛 in the -th line is exactly the value that will be assigned푥 ∈ {0, to the 1} -th gate if we evaluate the푃 circuit푥 on . Hence for every 푖 . For the other direction, let푖 be a circuit푛 of gates and inputs퐶 that푥 computes퐶(푥) the = function 푃 (푥) . We sort푥 ∈ the{0, 1} gates according to a topological order and write them as 퐶 . We now can푠 create a푛 program of lines as follows.푓 For every , if is an AND gate with 0 푠−1 in-neighbors then we푣 , will … , 푣 add a line to of the form temp_ 푖 푃= AND(temp_푠 ,temp_ ), unless one푖 ∈ of [푠] the vertices푣 is an input vertex 푗 푘 or an output gate,푣 , 푣 in which case we change this푃 to the form X[.] 푖or Y[.] appropriately.푗 Because푘 we work in topological ordering, we are guaranteed that the in-neighbors and correspond to variables that have already been assigned a value. We do the same for OR and 푗 푘 NOT gates. Once again, one can verify푣 that푣 for every input , the value will equal and hence the program computes the same function as the circuit. (Note that since is a valid circuit,푥 per Definition푃 (푥) 3.5, every input퐶(푥) vertex of has at least one out-neighbor and there are exactly output gates labeled 퐶 ; hence all the variables X[0], , X[ ] and Y[0]퐶, , Y[ ] will appear in the program .) 푚 0, … , 푚 − 1 … 푛 − 1 … 푚 − 1 ■ 푃

3.4 PHYSICAL IMPLEMENTATIONS OF COMPUTING DEVICES (DI- GRESSION)

Computation is an abstract notion that is distinct from its physical Figure 3.13: Two equivalent descriptions of the same implementations. While most modern computing devices are obtained AND/OR/NOT computation as both an AON pro- by mapping logical gates to based transistors, over gram and a Boolean circuit. history people have computed using a huge variety of mechanisms, including mechanical systems, gas and liquid (known as fluidics), biological and chemical processes, and even living creatures (e.g., see Fig. 3.14 or this video for how crabs or slime mold can be used to do computations). defining computation 141

In this section we will review some of these implementations, both so you can get an appreciation of how it is possible to directly translate Boolean circuits to the physical world, without going through the en- tire stack of architecture, operating systems, and , as well as to emphasize that silicon-based processors are by no means the only way to perform computation. Indeed, as we will see in Chapter 23, a very exciting recent line of work involves using different media for computation that would allow us to take advantage of quantum me- chanical effects to enable different types of algorithms. Such a cool way to explain logic gates. .twitter.com/6Wgu2ZKFCx — Lionel Page (@page_eco) October 28, 2019

3.4.1 Transistors A can be thought of as an electric circuit with two inputs, known as the source and the gate and an output, known as the sink. The gate controls whether current flows from the source to the sink. In a standard transistor, if the gate is “ON” then current can flow from the Figure 3.14: Crab-based logic gates from the paper source to the sink and if it is “OFF” then it can’t. In a complementary “Robust soldier-crab ball gate” by Gunji, Nishiyama and Adamatzky. This is an example of an AND gate transistor this is reversed: if the gate is “OFF” then current can flow that relies on the tendency of two swarms of crabs from the source to the sink and if it is “ON” then it can’t. arriving from different directions to combine toa There are several ways to implement the logic of a transistor. For single swarm that continues in the average of the directions. example, we can use faucets to implement it using water pressure (e.g. Fig. 3.15). This might seem as merely a curiosity, but there is a field known as fluidics concerned with implementing logical op- erations using liquids or gasses. Some of the motivations include operating in extreme environmental conditions such as in space or a battlefield, where standard electronic equipment would not survive. The standard implementations of transistors use electrical current.

One of the original implementations used vacuum tubes. As its name Figure 3.15: We can implement the logic of transistors implies, a vacuum tube is a tube containing nothing (i.e., a vacuum) using water. The water pressure from the gate closes and where a priori electrons could freely flow from the source (a or opens a faucet between the source and the sink. wire) to the sink (a plate). However, there is a gate (a grid) between the two, where modulating its voltage can block the flow of electrons. Early vacuum tubes were roughly the size of lightbulbs (and looked very much like them too). In the 1950’s they were supplanted by transistors, which implement the same logic using semiconduc- tors which are materials that normally do not conduct electricity but whose conductivity can be modified and controlled by inserting impu- rities (“doping”) and applying an external electric field (this is known as the field effect). In the 1960’s computers started to be implemented using integrated circuits which enabled much greater density. In 1965, Gordon Moore predicted that the number of transistors per would double every year (see Fig. 3.16), and that this would lead to “such wonders as home computers —or at least terminals con- 142 introduction to theoretical computer science

nected to a central computer— automatic controls for automobiles, and personal portable communications equipment”. Since then, (ad- justed versions of) this so-called “Moore’s law” have been running strong, though exponential growth cannot be sustained forever, and some physical limitations are already becoming apparent.

3.4.2 Logical gates from transistors We can use transistors to implement various Boolean functions such as AND, OR, and NOT. For each a two-input gate , such an implementation would be a system with two input2 wires and one output wire , such that if we identify high퐺 ∶ {0, voltage 1} → with {0, “ 1}” and low voltage with “ ”, then the wire will equal to “ ” if and only푥, 푦 if applying to the values푧 of the wires and is (see Fig. 3.19 and1 0 푧 1 Figure 3.16: The number of transistors per integrated Fig. 3.20). This means that if there exists a AND/OR/NOT circuit to circuits from 1959 till 1965 and a prediction that ex- compute a function퐺 푥 , then푦 we1 can compute in ponential growth will continue for at least another decade. Figure taken from “Cramming More Com- the physical world using transistors푛 as well.푚 ponents onto Integrated Circuits”, Gordon Moore, 푔 ∶ {0, 1} → {0, 1} 푔 1965 3.4.3 Biological computing Computation can be based on biological or chemical systems. For ex- ample the lac operon produces the enzymes needed to digest lactose only if the conditions hold where is “lactose is present” and is “glucose is present”. Researchers have managed to create transis- tors, and from them logic푥 ∧ (¬푦) gates, based on DNA푥 molecules (see also 푦Fig. 3.21). One motivation for DNA computing is to achieve increased Figure 3.17: Cartoon from Gordon Moore’s article parallelism or storage density; another is to create “smart biological “predicting” the implications of radically improving transistor density. agents” that could perhaps be injected into bodies, replicate them- selves, and fix or cells that were damaged by a disease suchas cancer. Computing in biological systems is not restricted, of course, to DNA: even larger systems such as flocks of birds can be considered as computational processes.

3.4.4 Cellular automata and the game of life Cellular automata is a model of a system composed of a sequence of cells, each of which can have a finite state. At each step, a cell updates its state based on the states of its neighboring cells and some simple Figure 3.18: The exponential growth in computing power over the last 120 years. Graph by Steve Jurvet- rules. As we will discuss later in this book (see Section 8.4), cellular son, extending a prior graph of Ray Kurzweil. automata such as Conway’s “Game of Life” can be used to simulate computation gates .

3.4.5 Neural networks One computation device that we all carry with us is our own brain.

Brains have served humanity throughout history, doing computations Figure 3.19: Implementing logical gates using transis- that range from distinguishing prey from predators, through making tors. Figure taken from Rory Mangles’ website. scientific discoveries and artistic masterpieces, to composing witty 280 character messages. The exact working of the brain is still not fully defining computation 143

understood, but one common mathematical model for it is a (very large) neural network. A neural network can be thought of as a Boolean circuit that instead of AND/OR/NOT uses some other gates as the basic basis. For exam- ple, one particular basis we can use are threshold gates. For every vector of integers and integer (some or all of which could be negative), the threshold function corresponding to is the 0 푘−1 푤function = (푤 , … , 푤 ) that maps 푡 to if and only if Figure 3.20: Implementing a NAND gate (see Sec- . For example,푘 the threshold function푘 푤,correspond- 푡 tion 3.5) using transistors. 푤,푡 ing푘−1 to 푇 ∶ {0, 1} →and {0, 1} is simply푥 the ∈ {0, majority 1} function1 MAJ 푖=0 푖 푖 푤,푡 ∑on 푤 푥. Threshold≥ 푡 gates can be thought of as an approximation푇 for 5 neuron푤 cells =5 (1,that 1, make1, 1, 1) up the푡 =core 3 of human and animal brains. To a first{0, 1}approximation, a neuron has inputs and a single output, and the neurons “fires” or “turns on” its output when those signals pass some threshold. 푘 Many machine learning algorithms use artificial neural networks whose purpose is not to imitate biology but rather to perform some computational tasks, and hence are not restricted to threshold or Figure 3.21: Performance of DNA-based logic gates. other biologically-inspired gates. Generally, a neural network is often Figure taken from paper of Bonnet et al, Science, 2013. described as operating on signals that are real numbers, rather than values, and where the output of a gate on inputs is obtained by applying where is an activation 0 푘−1 function0/1 such as rectified linear unit (ReLU), Sigmoid,푥 , … , 푥 or many others 푖 푖 푖 (see Fig. 3.23). However,푓(∑ for푤 the푥 ) purposes푓 of ∶ ℝ our → discussion, ℝ all of the above are equivalent (see also Exercise 3.13). In particular we can reduce the setting of real inputs to binary inputs by representing a real number in the binary basis, and multiplying the weight of the bit corresponding to the digit by . Figure 3.22: An AND gate using a “Game of Life” configuration. Figure taken from Jean-Philippe 푡ℎ 푖 Rennard’s paper. 3.4.6 A computer made푖 from marbles2 and pipes We can implement computation using many other physical media, without any electronic, biological, or chemical components. Many suggestions for mechanical computers have been put forward, going back at least to Gottfried Leibniz’ computing machines from the 1670s and Charles Babbage’s 1837 plan for a mechanical “”. As one example, Fig. 3.24 shows a simple implementation of a NAND (negation of AND, see Section 3.5) gate using marbles going through pipes. We represent a logical value in by a pair of pipes, such that there is a marble flowing through exactly one of the pipes. We call one of the pipes the “ pipe” and the{0, 1}other the “ pipe”, and so the identity of the pipe containing the marble determines the logical value. A NAND gate corresponds0 to a mechanical object1 with two pairs of incoming pipes and one pair of outgoing pipes, such that for every , if two marbles are rolling toward the object in the

푎, 푏 ∈ {0, 1} 144 introduction to theoretical computer science

pipe of the first pair and the pipe of the second pair, then a marble will roll out of the object in the NAND -pipe of the outgoing pair. 푎In fact, there is even a commercially-available푏 educational game that uses marbles as a basis of computing, see(푎, 푏)Fig. 3.26.

3.5 THE NAND FUNCTION

The NAND function is another simple function that is extremely use- ful for defining computation. It is the function mapping to Figure 3.23: Common activation functions used in Neural Networks, including rectified linear units defined by: 2 (ReLU), sigmoids, and hyperbolic tangent. All of {0, 1} those can be thought of as continuous approximations {0, 1} to simple the step function. All of these can be used NAND (3.9) to compute the NAND gate (see Exercise 3.13). This otherwise property enables neural networks to (approximately) {⎧0 푎 = 푏 = 1 compute any function that can be computed by a (푎, 푏) = . As its name implies, NAND is the⎨ NOT of AND (i.e., NAND Boolean circuit. ⎩{1 NOT AND ), and so we can clearly compute NAND using AND and NOT. Interestingly, the opposite direction holds as well: (푎, 푏) = ( (푎, 푏)) Theorem 3.10 — NAND computes AND,OR,NOT. We can compute AND, OR, and NOT by composing only the NAND function.

Proof. We start with the following observation. For every , AND . Hence, NAND NOT AND NOT . Figure 3.24 This means that NAND can compute NOT. By the principle푎 ∈ of {0, “dou- 1} : A physical implementation of a NAND gate using marbles. Each wire in a Boolean circuit is ble negation”,(푎, 푎) = 푎AND NOT(푎, 푎)NOT = AND( (푎,, 푎)) and = hence (푎) modeled by a pair of pipes representing the values we can use NAND to compute AND as well. Once we can compute and respectively, and hence a gate has four input (푎, 푏) = ( ( (푎, 푏))) pipes (two for each logical input) and two output AND and NOT, we can compute OR using “De Morgan’s Law”: pipes.0 1 If one of the input pipes representing the value OR NOT AND NOT NOT (which can also be writ- has a marble in it then that marble will flow to the ten as ) for every . output pipe representing the value . (The dashed ■ 0line represents a gadget that will ensure that at most (푎, 푏) = ( ( (푎), (푏))) one marble is allowed to flow onward1 in the pipe.) 푎 ∨ 푏 = 푎 ∧ 푏 푎, 푏 ∈ {0, 1} If both the input pipes representing the value have marbles in them, then the first marble will be stuck but the second one will flow onwards to the1 output P pipe representing the value . Theorem 3.10’s proof is very simple, but you should make sure that (i) you understand the statement of 0 the theorem, and (ii) you follow its proof. In partic- ular, you should make sure you understand why De Morgan’s law is true.

We can use NAND to compute many other functions, as demon- strated in the following exercise. Solved Exercise 3.5 — Compute majority with NAND. Let MAJ Figure 3.25: A “gadget” in a pipe that ensures that at be the function that on input outputs iff 3 . most one marble can pass through it. The first marble Show how to compute MAJ using a composition of NAND∶ {0,’s. 1} → that passes causes the barrier to lift and block new {0, 1} 푎, 푏, 푐 1 푎 + 푏 + 푐 ≥ 2 ■ ones. defining computation 145

Solution: Recall that (3.5) states that

MAJ OR AND OR AND AND (3.10) 0 1 2 0 1 1 2 0 2 We(푥 can, 푥 use, 푥 )Theorem = ( 3.10 (푥to replace, 푥 ), all( the occurrences(푥 , 푥 ), of AND(푥 , 푥 )) ) . and OR with NAND’s. Specifically, we can use the equivalence Figure 3.26 AND NOT NAND , OR NAND NOT NOT , : The game “Turing Tumble” contains an implementation of logical gates using marbles. and NOT NAND to replace the righthand side of (3.10)(푎, with 푏) = an expression( (푎, involving 푏)) only(푎, 푏)NAND = , yielding( (푎), that (푏)) MAJ (푎)is equivalent = to the(푎, (somewhat 푎) unwieldy) expression

(푎, 푏, 푐) NAND NAND NAND NAND NAND

(NAND NAND( ( NAND(푎, 푏), (푎, 푐)), (3.11)

( NAND(푎, 푏), (푎, 푐)) ),

The same formula can also be(푏, expressed 푐) ) as a circuit with NAND gates, see Fig. 3.27. ■

3.5.1 NAND Circuits We define NAND Circuits as circuits in which all the gates are NAND operations. Such a circuit again corresponds to a directed acyclic graph (DAG) since all the gates correspond to the same function (i.e., NAND), we do not even need to label them, and all gates have in- degree exactly two. Despite their simplicity, NAND circuits can be quite powerful. Figure 3.27: A circuit with NAND gates to compute the Majority function on three bits ■ Example 3.11 — circuit for . Recall the XOR function which maps to mod . We have seen in Section 3.2.2 that푁퐴푁퐷 we can compute푋푂푅XOR using AND, OR, and NOT, 0 1 0 1 and so by Theorem푥 , 푥 3.10∈we {0, can 1} compute푥 + 푥 it using2 only NAND’s. However, the following is a direct construction of computing XOR by a sequence of NAND operations:

1. Let NAND . 2. Let NAND 0 1 3. Let 푢 = NAND(푥 , 푥 ). 0 4. The푣XOR = of (푥and, 푢) is NAND . 1 푤 = (푥 , 푢) 0 1 0 푥 푥 푦 = (푣, 푤) 146 introduction to theoretical computer science

One can verify that this algorithm does indeed compute XOR by enumerating all the four choices for . We can also represent this algorithm graphically as a circuit, see Fig. 3.28. 0 1 푥 , 푥 ∈ {0, 1} In fact, we can show the following theorem:

Theorem 3.12 — NAND is a universal operation. For every Boolean circuit of gates, there exists a NAND circuit of at most gates that computes the same function as . ′ 퐶 푠 퐶 3푠 Proof Idea: 퐶 Figure 3.28: A circuit with NAND gates to compute The idea of the proof is to just replace every AND, OR and NOT the XOR of two bits. gate with their NAND implementation following the proof of Theo- rem 3.10.

Proof⋆ of Theorem 3.12. If is a Boolean circuit, then since, as we’ve seen in the proof of Theorem 3.10, for every 퐶 • NOT NAND 푎, 푏 ∈ {0, 1}

• AND(푎) = NAND(푎, 푎)NAND NAND

• OR (푎, 푏) =NAND NAND( (푎, 푏),NAND (푎, 푏))

we(푎, can 푏) replace = every( gate(푎, of 푎),with at most(푏, 푏)) three NAND gates to obtain an equivalent circuit . The resulting circuit will have at most gates. ′ 퐶 퐶 ■ 3푠

 Big Idea 3 Two models are equivalent in power if they can be used to compute the same set of functions.

3.5.2 More examples of NAND circuits (optional) Here are some more sophisticated examples of NAND circuits:

Incrementing integers. Consider the task of computing, given as input a string that represents a natural number , the representation of 푛 . That is, we want to compute the function INC 푥 ∈ {0, 1} such that for every 푋 ∈, INC ℕ which satisfies푛 푋 + 1푛+1 . (For simplicity of 푛 0 푛−1 푛 notation,∶ {0, in 1} this→ example {0,푛 1} we푖 use the푛−1 representation푖 푥 , … where , 푥 the least(푥) = 푖=0 푖 푖=0 푖 significant푦 digit∑ is first푦 2 =rather (∑ than푥 last.)2 ) + 1 The increment operation can be very informally described as fol- lows: “Add to the least significant bit and propagate the carry”. A little

1 defining computation 147

more precisely, in the case of the binary representation, to obtain the increment of , we scan from the least significant bit onwards, and flip all ’s to ’s until we encounter a bit equal to , in which case we flip it to and푥 stop. 푥 Thus1 we can0 compute the increment of 0 by doing the following:1 0 푛−1 푥 , … , 푥 Algorithm 3.13 — Compute Increment Function. Input: representing the number # we use LSB-first representation 푛−1 푖 0 1 푛−1 푖=0 푖 Output:푥 , 푥 , … , 푥 such that ∑ 푥 ⋅ 2 1: Let 푛+1# we pretend we have푛 a ”carry”푖 푛−1 of initially푖 푖=0 푖 푖=0 푖 2: for 푦 ∈ {0, 1} do ∑ 푦 ⋅2 = ∑ 푥 ⋅2 +1 0 3: Let푐 ← 1 . 1 4: if푖 = 0, … , 푛 − 1then 푖 푖 푖 5: 푦 ← 푋푂푅(푥 , 푐 ) 푖 푖 6: else푐 = 푥 = 1 푖+1 7: 푐 = 1 8: end if 푖+1 9: end for푐 = 0 10: Let .

푛 푛 Algorithm푦 3.13← 푐describes precisely how to compute the increment operation, and can be easily transformed into Python code that per- forms the same computation, but it does not seem to directly yield a NAND circuit to compute this. However, we can transform this algorithm line by line to a NAND circuit. For example, since for ev- ery , NAND NOT , we can replace the initial statement with NAND NAND . We already know how푎 to compute(푎, XOR(푎))using = NAND 1 and so we can use this to im- 0 0 0 0 0 푐plement= 1 the operation푐 = (푥 XOR, (푥 ,. 푥 Similarly,)) we can write the “if” statement as saying AND , or in other words 푖 푖 푖 NAND NAND푦 ← NAND(푥 , 푐 ) . Finally, the assignment 푖+1 푖 푖 can be written as 푐 NAND← NAND(푐 , 푥 ) NAND . 푖+1 푖 푖 푖 푖 푐Combining← these( observations(푐 , 푥 ), yields(푐 for, 푥 every)) , a NAND circuit 푛 푛 푛 푛 푛 푛 푛 to푦 compute= 푐 INC . For example,푦 = Fig. 3.29( shows(푐 what, 푐 ), this circuit(푐 , looks 푐 )) like for . 푛 ∈ ℕ 푛

푛 = 4 From increment to addition. Once we have the increment operation, we can certainly compute addition by repeatedly incrementing (i.e., compute by performing INC times). However, that would be quite inefficient and unnecessary. With the same idea of keeping track 푥+푦 (푥) 푦 of carries we can implement the “grade-school” addition algorithm Figure 3.29: NAND circuit with computing the incre- and compute the function ADD that on ment function on bits. 2푛 푛+1 푛 4 ∶ {0, 1} → {0, 1} 148 introduction to theoretical computer science

input outputs the binary representation of the sum of the numbers represented2푛 by and : 푥 ∈ {0, 1} 0 푛−1 푛+1 푛 Algorithm 3.14 — Addition푥 using, … , 푥 NAND. 푥 , … , 푥 Input: , representing numbers in LSB-first binary푛 representation.푛 Output:푢LSB-first ∈ {0, 1} binary푣 ∈ {0,representation 1} of . 1: Let 2: for do 푥 + 푦 0 3: Let푐 ← 0 mod 4: if푖 = 0, … , 푛 − 1 then 푖 푖 푖 5: 푦 ← 푢 + 푣 2 푖 푖 푖 6: else푢 + 푣 + 푐 ≥ 2 푖+1 7: 푐 ← 1 8: end if 푖+1 9: end for푐 ← 0 10: Let

푛 푛 Once again,푦 ←Algorithm 푐 3.14 can be translated into a NAND cir- cuit. The crucial observation is that the “if/then” statement simply corresponds to MAJ and we have seen in Solved Exercise 3.5 that the function MAJ can be computed 푖+1 3 푖 푖 푖 using NANDs. 푐 ← (푢 , 푣 , 푣 ) 3 3 ∶ {0, 1} → {0, 1}

3.5.3 The NAND-CIRC Programming language Just like we did for Boolean circuits, we can define a programming- language analog of NAND circuits. It is even simpler than the AON- CIRC language since we only have a single operation. We define the NAND-CIRC Programming Language to be a programming language where every line has the following form: foo = NAND(bar,blah)

where foo, bar and blah are variable identifiers.

■ Example 3.15 — Our first NAND-CIRC program. Here is an example of a NAND-CIRC program:

u = NAND(X[0],X[1]) v = NAND(X[0],u) w = NAND(X[1],u) Y[0] = NAND(v,w) defining computation 149

P Do you know what function this program computes? Hint: you have seen it before.

Formally, just like we did in Definition 3.8 for AON-CIRC, we can define the notion of computation by a NAND-CIRC program inthe natural way:

Definition 3.16 — Computing by a NAND-CIRC program. Let be some function, and let be a NAND-CIRC program.푛 We say푚 that computes the function if: 푓 ∶ {0, 1} → {0, 1} 푃 1. has input푃 variables X[ ] X[푓 ] and output variables Y[ ], ,Y[ ]. 푃 푛 0 ,…, 푛−1 푚 2. For0 every… 푚 − 1 , if we execute when we assign to X[ ] X[ ] the values푛 , then at the end of the execution,푥 the ∈ output {0, variables 1} Y[ ], ,Y[푃 ] have the 0 푛−1 values0 ,…, 푛 −where 1 .푥 , … , 푥 0 … 푚 − 1 0 푚−1 푦 , … , 푦 푦 = 푓(푥) As before we can show that NAND circuits are equivalent to NAND-CIRC programs (see Fig. 3.30):

Theorem 3.17 — NAND circuits and straight-line program equivalence. For every and , is computable by a NAND-CIRC program푛 of lines푚 if and only if is computable by a NAND푓 circuit ∶ {0, of 1} gates.→ {0, 1} 푠 ≥ 푚 푓 푠 푓 We omit the proof푠 of Theorem 3.17 since it follows along exactly the same lines as the equivalence of Boolean circuits and AON-CIRC program (Theorem 3.9). Given Theorem 3.17 and Theorem 3.12, we know that we can translate every -line AON-CIRC program into an equivalent NAND-CIRC program of at most lines. In fact, this 푠 푃 Figure 3.30: A NAND program and the corresponding translation can be easily done by replacing every line of the form circuit. Note how every line in the program corre- foo = AND(bar,blah), foo = OR(bar,blah) or3푠foo = NOT(bar) sponds to a gate in the circuit. with the equivalent 1-3 lines that use the NAND operation. Our GitHub repository contains a “proof by code”: a simple Python program AON2NAND that transforms an AON-CIRC into an equivalent NAND- CIRC program.

R Remark 3.18 — Is the NAND-CIRC programming language Turing Complete? (optional note). You might have heard of a term called “Turing Complete” that is sometimes 150 introduction to theoretical computer science

used to describe programming languages. (If you haven’t, feel free to ignore the rest of this remark: we define this term precisely in Chapter 8.) If so, you might wonder if the NAND-CIRC programming lan- guage has this property. The answer is no, or perhaps more accurately, the term “Turing Completeness” is not really applicable for the NAND-CIRC program- ming language. The reason is that, by design, the NAND-CIRC programming language can only com- pute finite functions that take a fixed number of input bits푛 and produce푚 a fixed num- ber of outputs bits. The퐹 ∶ term {0, 1} “Turing→ {0, Complete” 1} is only applicable to programming languages for infinite functions that can take inputs of arbitrary length. We will come back to this distinction later on in this book.

3.6 EQUIVALENCE OF ALL THESE MODELS

If we put together Theorem 3.9, Theorem 3.12, and Theorem 3.17, we obtain the following result:

Theorem 3.19 — Equivalence between models of finite computation. For every sufficiently large and , the following conditions are all equivalent to one another:푛 푚 푠, 푛, 푚 푓 ∶ {0, 1} → {0, 1} • can be computed by a Boolean circuit (with gates) of at most gates. 푓 ∧, ∨, ¬ • can푂(푠) be computed by an AON-CIRC straight-line program of at most lines. 푓 • can푂(푠) be computed by a NAND circuit of at most gates.

• 푓 can be computed by a NAND-CIRC straight-line푂(푠) program of at most lines. 푓 By “ 푂(푠)” we mean that the bound is at most where is a con- stant that is independent of . For example, if can be computed by a Boolean푂(푠) circuit of gates, then it can be computed푐 ⋅ by 푠 a NAND-CIRC푐 program of at most lines,푛 and if can be computed푓 by a NAND circuit of gates, then푠 it can be computed by an AON-CIRC program of at most lines. 3푠 푓 푠 Proof Idea: 2푠 We omit the formal proof, which is obtained by combining Theo- rem 3.9, Theorem 3.12, and Theorem 3.17. The key observation is that the results we have seen allow us to translate a program/circuit that computes in one of the above models into a program/circuit that

푓 defining computation 151

computes in another model by increasing the lines/gates by at most a constant factor (in fact this constant factor is at most ). 푓 3 Theorem⋆ 3.9 is a special case of a more general result. We can con- sider even more general models of computation, where instead of AND/OR/NOT or NAND, we use other operations (see Section 3.6.1 below). It turns out that Boolean circuits are equivalent in power to such models as well. The fact that all these different ways to define computation lead to equivalent models shows that we are “on the right track”. It justifies the seemingly arbitrary choices that we’ve made of using AND/OR/NOT or NAND as our basic operations, since these choices do not affect the power of our computational model. Equivalence results such as Theorem 3.19 mean that we can easily translate between Boolean circuits, NAND circuits, NAND- CIRC programs and the like. We will use this ability later on in this book, often shifting to the most convenient formulation without mak- ing a big deal about it. Hence we will not worry too much about the distinction between, for example, Boolean circuits and NAND-CIRC programs. In contrast, we will continue to take special care to distinguish between circuits/programs and functions (recall Big Idea 2). A func- tion corresponds to a specification of a computational task, and it is a fundamentally different object than a program or a circuit, which corresponds to the implementation of the task.

3.6.1 Circuits with other gate sets There is nothing special about AND/OR/NOT or NAND. For every set of functions , we can define a notion of circuits that use elements of as gates, and a notion of a “ programming 0 푘−1 language” where풢 every = {퐺 line, … , involves 퐺 } assigning to a variable foo the re- sult of applying some풢 to previously defined풢 or input variables. Specifically, we can make the following definition: 푖 퐺 ∈ 풢 Definition 3.20 — General straight-line programs. Let be a finite collection of Boolean functions, such that 0 푡−1 for some . An program is a sequenceℱ = of lines,{푓 , … each ,푘 푓푖 of} 푖 which assigns to some variable the result of applying푓 some∶ {0, 1} → 푖 to{0, 1}other variables.푘 ∈ ℕAs above,ℱ we use X[ ] and Y[ ] to denote the 푖 input and output variables. 푓 ∈ ℱ 푖 푘We say that is a universal set of operations푖 (also푗 known as a uni- versal gate set) if there exists a program to compute the function NAND. ℱ ℱ 152 introduction to theoretical computer science

AON-CIRC programs correspond to OR NOT programs, NAND-CIRC programs corresponds to programs for the set that only contains the NAND function,{퐴푁퐷, but we can, also} define IF ZERO ONE programs (see below),ℱ or use any other set. ℱ We can also define circuits, which will be directed graphs in {which, each, gate corresponds} to applying a function , and will each have incomingℱ wires and a single outgoing wire. (If the func- 푖 tion is not symmetric, in the sense that the order of푓 its∈ input ℱ matters 푖 then we need푘 to label each wire entering a gate as to which parameter 푖 of the푓 function it corresponds to.) As in Theorem 3.9, we can show that circuits and programs are equivalent. We have seen that for AND OR NOT , the resulting circuits/programs are equivalent in powerℱ to the NAND-CIRCℱ programming language, as we can com- ℱpute =NAND { , using, AND} /OR/NOT and vice versa. This turns out to be a special case of a general phenomena— the universality of NAND and other gate sets — that we will explore more in depth later in this book.

■ Example 3.21 — IF,ZERO,ONE circuits. Let IF ZERO ONE where ZERO and ONE are the constant zero and one functions, 3 andℱIF = { , , is the} function that on∶ input {0, 1} → {0}outputs if ∶ {0, 1}3and → {1}otherwise. Then is universal. ∶ {0, 1} → {0, 1} Indeed, we can demonstrate(푎, 푏, 푐) that IF푏 ZERO푎 =ONE 1 is푐 universal usingℱ the following formula for NAND: { , , } NAND IF IF ZERO ONE ONE (3.12) 3 One can also define these functions as taking a There are also(푎, some 푏) = sets(푎, that(푏, are more, restricted), ). in power. For length zero input. This makes no difference for the example it can be shown that if we use only AND or OR gates (with- computational power of the model. out NOT) then we do not getℱ an equivalent model of computation. The exercises cover several examples of universal and non-universal gate sets.

3.6.2 Specification vs. implementation (again) As we discussed in Section 2.6.1, one of the most important distinc- tions in this book is that of specification versus implementation or sep- arating “what” from “how” (see Fig. 3.31). A function corresponds to the specification of a computational task, that is what output should be produced for every particular input. A program (or circuit, or any other way to specify algorithms) corresponds to the implementation of how to compute the desired output from the input. That is, a program is a set of instructions how to compute the output from the input. Even within the same computational model there can be many differ- ent ways to compute the same function. For example, there is more defining computation 153

Figure 3.31: It is crucial to distinguish between the specification of a computational task, namely what is the function that is to be computed and the implemen- tation of it, namely the algorithm, program, or circuit that contains the instructions defining how to map an input to an output. The same function could be computed in many different ways.

than one NAND-CIRC program that computes the majority function, more than one Boolean circuit to compute the addition function, and so on and so forth. Confusing specification and implementation (or equivalently func- tions and programs) is a common mistake, and one that is unfortu- nately encouraged by the common programming-language termi- nology of referring to parts of programs as “functions”. However, in both the theory and practice of computer science, it is important to maintain this distinction, and it is particularly important for us in this book.

Chapter Recap

• An algorithm is a recipe for performing a compu- ✓ tation as a sequence of “elementary” or “simple” operations. • One candidate definition for “elementary” opera- tions is the set AND, OR and NOT. • Another candidate definition for an “elementary” operation is the NAND operation. It is an operation that is easily implementable in the physical world in a variety of methods including by electronic transistors. • We can use NAND to compute many other func- tions, including majority, increment, and others. • There are other equivalent choices, including the sets OR NOT and IF ZERO ONE . • We can formally define the notion of a function {퐴푁퐷, , } being{ computable, , using} the NAND-CIRC푛 Programming푚 language. •퐹 For ∶ every {0, 1} set→ of basic {0, 1} operations, the notions of be- ing computable by a circuit and being computable by a straight-line program are equivalent. 154 introduction to theoretical computer science

3.7 EXERCISES

Exercise 3.1 — Compare bit numbers. Give a Boolean circuit (with AND/OR/NOT gates) that computes the function CMP 4 such that CMP if and only if the8 number represented by is larger than the 8 8 0 1 2 3 0 1 2 3 number∶ {0, represented 1} → {0, by 1} . (푎 , 푎 , 푎 , 푎 , 푏 , 푏 , 푏 , 푏 ) = 1 0 1 2 3 푎 푎 푎 푎 ■ 0 1 2 3 푏 푏 푏 푏 Exercise 3.2 — Compare bit numbers. Prove that there exists a constant such that for every there is a Boolean circuit (with AND/OR/NOT gates) of at most 푛 gates that computes the function CMP 푐 such푛 that CMP if and 2푛 only if2푛퐶 the number represented푐 ⋅ 푛 by is larger than the number∶ 2푛 0 푛−1 0 푛−1 represented{0, 1} → {0, by 1} . (푎 ⋯ 푎 푏 ⋯ 푏 ) = 1 0 푛−1 푎 ⋯ 푎 ■ 0 푛−1 푏 ⋯ 푏 Exercise 3.3 — OR,NOT is universal. Prove that the set OR NOT is univer- sal, in the sense that one can compute NAND using these gates. { , } ■

Exercise 3.4 — AND,OR is not universal. Prove that for every -bit input circuit that contains only AND and OR gates, as well as gates that compute the constant functions and , is monotone, in푛 the sense that if 퐶 , for every , then . Conclude′ that the푛 set AND′ OR0 1 퐶is not universal. ′ 푖 푖 푥, 푥 ∈ {0, 1} 푥 ≤ 푥 푖 ∈ [푛] 퐶(푥) ≤ 퐶(푥 ) ■ { , , 0, 1} Exercise 3.5 — XOR is not universal. Prove that for every -bit input circuit that contains only XOR gates, as well as gates that compute the constant functions and , is affine or linear modulo푛 , two in the sense 퐶that there exists some and such that for every , 0 1 퐶 푛 mod . Conclude푛 that the set푎푛−1 ∈XOR {0, 1} is 푏not ∈universal. {0, 1} 푖=0 푖 푖 푥 ∈ {0, 1} 퐶(푥) = ∑ 푎 푥 + 푏 2 ■ { , 0, 1} Exercise 3.6 — MAJ,NOT, 1 is universal. Let MAJ be the majority function. Prove that MAJ NOT is a universal3 set of gates. ∶ {0, 1} → {0, 1} ■ { , , 1} Exercise 3.7 — MAJ,NOT is not universal. Prove that MAJ NOT is not a 4 Hint: Use the fact that MAJ 4 universal set. See footnote for hint. to prove that every computable { , } ■ by a circuit with only MAJ and(푛푎,NOT푏, 푐) =gates푀퐴퐽(푎, satisfies 푏, 푐) 푓 ∶ {0, 1} . Thanks→ {0, 1} to Nathan Exercise 3.8 — NOR is universal. Let NOR defined as Brunelle and David Evans for suggesting this exercise. 푓(0, 0, … , 0) ≠ 푓(1, 1, … , 1) NOR NOT OR . Prove that NOR2 is a universal set of gates. ∶ {0, 1} → {0, 1} (푎, 푏) = ( (푎, 푏)) { } ■ defining computation 155

Exercise 3.9 — Lookup is universal. Prove that LOOKUP is a uni- versal set of gates where and are the constant functions and 1 LOOKUP satisfies LOOKUP{ , 0,equals 1} if and equals 3if 0. 1 1 1 ∶ {0, 1} → {0, 1} (푎, 푏, 푐) 푎 ■ 푐 = 0 푏 푐 = 1 Exercise 3.10 — Bound on universal basis size (challenge). Prove that for ev- ery subset of the functions from to , if is universal then there is a -circuit of at most gates푘 to compute the NAND 퐵 {0, 1} {0, 1} 퐵 function (you can start by showing that there is a circuit of at most 5 5 Thanks to Alec Sun and Simon Fischer for comments gates). 퐵 푂(1) on this problem. 16 퐵 ■ 푂(푘 ) Exercise 3.11 — Size and inputs / outputs. Prove that for every NAND cir- cuit of size with inputs and outputs, min . See 6 Hint: Use the conditions of Definition 3.5 stipulating 6 footnote for hint. that every input vertex has at least one out-neighbor 푠 푛 푚 푠 ≥ {푛/2, 푚} ■ and there are exactly output gates. See also Re- mark 3.7. Exercise 3.12 — Threshold using NANDs. Prove that there is some constant 푚 such that for every , and integers , there is a NAND circuit with at most 푛 gates푛 0 푛−1 푐that computes the threshold푛푛 > 1 function 푎 , … , 푎 , 푏 ∈ {−2 ,4 −2that+ 1, … , −1, 0, +1, … , 2 } 푐푛 on input outputs if and only0 if푛−1 푛 . 푎 ,…,푎 ,푏 푛 푓 푛−1∶ {0, 1} → {0, 1} ■ 푖 푖 푥 ∈ {0, 1} 1 ∑푖=0 푎 푥 > 푏 Exercise 3.13 — NANDs from activation functions. We say that a function is a NAND approximator if it has the following property: for every2 , if min and min then 푓 ∶ ℝ → ℝNAND where we denote by the integer closest푎, to 푏 ∈. ℝ That is, if{|푎|, |1are − 푎|}within ≤ 1/3 a distance {|푏|,to |1 − 푏|} ≤then 1/3 we |푓(푎,want 푏) − to equal(⌊푎⌉, the ⌊푏⌉)|NAND ≤ 1/3of the values in that⌊푥⌉ are closest to and 푥respectively.푎, Otherwise, 푏 we do not care1/3 what{0, the 1} output of is on푓(푎,and 푏) . {0, 1} 푎In this푏 exercise you will show that you can construct a NAND ap- 푓proximator푎 from푏 many common activation functions used in deep neural networks. As a corollary you will obtain that deep neural net- works can simulate NAND circuits. Since NAND circuits can also simulate deep neural networks, these two computational models are equivalent to one another.

1. Show that there is a NAND approximator defined as where is an affine function (of the form ′ ′ 2for some 푓 ),푓(푎,is 푏) an = 퐿(푅푒퐿푈(퐿affine function′ (푎, 푏))) (of the form퐿 ∶ ℝ → ℝ for ), and 퐿 (푎,, 푏) is = the 훼푎 function + 훽푏 + defined 훾 as 훼, 훽, 훾 ∈ ℝmax퐿 . 퐿(푦) = 훼푦 + 훽 훼, 훽 ∈ ℝ 푅푒퐿푈 ∶ ℝ → ℝ 푅푒퐿푈(푥) = {0, 푥} 156 introduction to theoretical computer science

2. Show that there is a NAND approximator defined as where are affine as above and is the function′ defined′ as 푓 푓(푎,. 푏) = 퐿(푠푖푔푚표푖푑(퐿 (푎, 푏))) 퐿 , 퐿 푥 푥 푠푖푔푚표푖푑 ∶ 3.ℝ Show → ℝ that there is a NAND approximator푠푖푔푚표푖푑(푥) =defined 푒 /(푒 + as 1) where are affine as above and is the function′ defined as ′ 푓 푓(푎,. 푏) = 퐿(푡푎푛ℎ(퐿 (푎, 푏))) 퐿 , 퐿 푥 −푥 푥 푡푎푛ℎ−푥 ∶ ℝ → ℝ 4. Prove that for every NAND-circuit푡푎푛ℎ(푥) = (푒with− 푒 inputs)/(푒 + and 푒 one) output that computes a function , if we replace every gate of with a NAND-approximator푛퐶 and푛 then invoke the result- ing circuit on some 푔 ∶ {0,, the 1} output→ {0, will1} be a number such that 퐶 . 푛 푥 ∈ {0, 1} 푦 |푦 − 푔(푥)| ≤ 1/3 ■ Exercise 3.14 — Majority with NANDs efficiently. Prove that there is some constant such that for every , there is a NAND circuit of at most gates that computes the majority function on input bits MAJ 푐 . That푛 is >MAJ 1 iff . See 7 One approach to solve this is using recursion and 7 analyzing it using the so called “Master Theorem”. footnote푐 ⋅ 푛 for hint.푛 푛−1 푛 푛 푛 푖=0 푖 ∶ {0, 1} → {0, 1} (푥) = 1 ∑ 푥 > 푛/2 ■

Exercise 3.15 — Output at last layer. Prove that for every , if there is a Boolean circuit of gates that computes 푛then there is a Boolean circuit of at most gates such that푓 ∶ in{0, the 1} minimal→ {0,layering 1} of , the output gate′ of 퐶 is placed푠 in the last layer.푓 See 8 Hint: Vertices in layers beyond the output can be 8 safely removed without changing the functionality of footnote for hint.′ 퐶 ′ 푠 the circuit. 퐶 퐶 ■

3.8 BIOGRAPHICAL NOTES

The excerpt from Al-Khwarizmi’s book is from “The Algebra of Ben- Musa”, Fredric Rosen, 1831. Charles Babbage (1791-1871) was a visionary scientist, mathemati- cian, and inventor (see [Swa02; CM00]). More than a century before the invention of modern electronic computers, Babbage realized that computation can be in principle mechanized. His first design for a was the difference engine that was designed to do polynomial interpolation. He then designed the analytical engine which was a much more general machine and the first prototype for a pro- grammable general purpose computer. Unfortunately, Babbage was never able to complete the design of his prototypes. One of the earliest people to realize the engine’s potential and far reaching implications was (see the notes for Chapter 7). Boolean algebra was first investigated by Boole and DeMorgan in the 1840’s [Boo47; De 47]. The definition of Boolean circuits and defining computation 157

connection to electrical relay circuits was given in Shannon’s Masters Thesis [Sha38]. (Howard Gardener called Shannon’s thesis “possibly the most important, and also the most famous, master’s thesis of the [20th] century”.) Savage’s book [Sav98], like this one, introduces the theory of computation starting with Boolean circuits as the first model. Jukna’s book [Juk12] contains a modern in-depth exposition of Boolean circuits, see also [Weg87]. The NAND function was shown to be universal by Sheffer [She13], though this also appears in the earlier work of Peirce, see [Bur78]. Whitehead and Russell used NAND as the basis for their logic in their magnum opus Principia Mathematica [WR12]. In her Ph.D thesis, Ernst [Ern09] investigates empirically the minimal NAND circuits for various functions. Nisan and Shocken’s book [NS05] builds a computing system starting from NAND gates and ending with high level programs and games (“NAND to Tetris”); see also the website nandtotetris.org.

Learning Objectives: • Get comfortable with syntactic sugar or automatic translation of higher level logic to low level gates. • Learn proof of major result: every finite function can be computed by a Boolean circuit. • Start thinking quantitatively about number of lines required for computation. 4 Syntactic sugar, and computing every function

“[In 1951] I had a running compiler and nobody would touch it because, they carefully told me, computers could only do arithmetic; they could not do programs.”, Grace Murray Hopper, 1986.

“Syntactic sugar causes cancer of the semicolon.”, Alan Perlis, 1982.

The computational models we considered thus far are as “bare bones” as they come. For example, our NAND-CIRC “programming language” has only the single operation foo = NAND(bar,blah). In this chapter we will see that these simple models are actually equiv- alent to more sophisticated ones. The key observation is that we can implement more complex features using our basic building blocks, and then use these new features themselves as building blocks for even more sophisticated features. This is known as “syntactic sugar” in the field of programming language design since we are not modi- fying the underlying programming model itself, but rather we merely implement new features by syntactically transforming a program that uses such features into one that doesn’t. This chapter provides a “toolkit” that can be used to show that many functions can be computed by NAND-CIRC programs, and hence also by Boolean circuits. We will also use this toolkit to prove a fundamental theorem: every finite function can be computed by a Boolean circuit, see Theorem 4.13푛 below. While푚 the syntactic sugar toolkit is important in its own푓 ∶ {0, right, 1} Theorem→ {0, 1} 4.13 can also be proven directly without using this toolkit. We present this alternative proof in Section 4.5. See Fig. 4.1 for an outline of the results of this chapter.

This chapter: A non-mathy overview In this chapter we will see our first major result: every fi- nite function can be computed some Boolean circuit (see Theorem 4.13 and Big Idea 5). This is sometimes known as

Compiled on 8.26.2020 18:10 160 introduction to theoretical computer science

Figure 4.1: An outline of the results of this chapter. In Section 4.1 we give a toolkit of “syntactic sugar” trans- formations showing how to implement features such as programmer-defined functions and conditional statements in NAND-CIRC. We use these tools in Section 4.3 to give a NAND-CIRC program (or alter- natively a Boolean circuit) to compute the LOOKUP function. We then build on this result to show in Sec- tion 4.4 that NAND-CIRC programs (or equivalently, Boolean circuits) can compute every finite function. An alternative direct proof of the same result is given in Section 4.5.

the “universality” of AND, OR, and NOT (and, using the equivalence of Chapter 3, of NAND as well) Despite being an important result, Theorem 4.13 is actually not that hard to prove. Section 4.5 presents a relatively sim- ple direct proof of this result. However, in Section 4.1 and Section 4.3 we derive this result using the concept of “syntac- tic sugar” (see Big Idea 4). This is an important concept for programming languages theory and practice. The idea be- hind “syntactic sugar” is that we can extend a programming language by implementing advanced features from its basic components. For example, we can take the AON-CIRC and NAND-CIRC programming languages we saw in Chapter 3, and extend them to achieve features such as user-defined functions (e.g., def Foo(...)), condtional statements (e.g., if blah ...), and more. Once we have these features, it is not that hard to show that we can take the “truth table” (table of all inputs and outputs) of any function, and use that to create an AON-CIRC or NAND-CIRC program that maps each input to its corresponding output. We will also get our first glimpse of quantitative measures in this chapter. While Theorem 4.13 tells us that every func- tion can be computed by some circuit, the number of gates in this circuit can be exponentially large. (We are not using here “exponentially” as some colloquial term for “very very big” but in a very precise mathematical sense, which also happens to coincide with being very very big.) It turns out that some functions (for example, integer addition and multi- syntactic sugar, and computing every function 161

plication) can be in fact computed using far fewer gates. We will explore this issue of “gate complexity” more deeply in Chapter 5 and following chapters.

4.1 SOME EXAMPLES OF SYNTACTIC SUGAR

We now present some examples of “syntactic sugar” transformations that we can use in constructing straightline programs or circuits. We focus on the straight-line programming language view of our computa- tional models, and specifically (for the sake of concreteness) onthe NAND-CIRC programming language. This is convenient because many of the syntactic sugar transformations we present are easiest to think about in terms of applying “search and replace” operations to the of a program. However, by Theorem 3.19, all of our results hold equally well for circuits, whether ones using NAND gates or Boolean circuits that use the AND, OR, and NOT operations. Enu- merating the examples of such syntactic sugar transformations can be a little tedious, but we do it for two reasons:

1. To convince you that despite their seeming simplicity and limita- tions, simple models such as Boolean circuits or the NAND-CIRC programming language are actually quite powerful.

2. So you can realize how lucky you are to be taking a theory of com- putation course and not a compilers course… :)

4.1.1 User-defined procedures One staple of almost any programming language is the ability to define and then execute procedures or subroutines. (These are often known as functions in some programming languages, but we prefer the name procedures to avoid confusion with the function that a pro- gram computes.) The NAND-CIRC programming language does not have this mechanism built in. However, we can achieve the same effect using the time honored technique of “copy and paste”. Specifi- cally, we can replace code which defines a procedure such as def Proc(a,b): proc_code return c some_code f = Proc(d,e) some_more_code

with the following code where we “paste” the code of Proc 162 introduction to theoretical computer science

some_code proc_code' some_more_code

and where proc_code' is obtained by replacing all occurrences of a with d, b with e, and c with f. When doing that we will need to ensure that all other variables appearing in proc_code' don’t interfere with other variables. We can always do so by renaming variables to new names that were not used before. The above reasoning leads to the proof of the following theorem:

Theorem 4.1 — Procedure definition synctatic sugar. Let NAND-CIRC- PROC be the programming language NAND-CIRC augmented with the syntax above for defining procedures. Then for every NAND-CIRC-PROC program , there exists a standard (i.e., “sugar free”) NAND-CIRC program that computes the same function as . 푃 ′ 푃 푃 R Remark 4.2 — No recursive procedure. NAND-CIRC- PROC only allows non recursive procedures. In particu- lar, the code of a procedure Proc cannot call Proc but only use procedures that were defined before it. With- out this restriction, the above “search and replace” procedure might never terminate and Theorem 4.1 would not be true.

Theorem 4.1 can be proven using the transformation above, but since the formal proof is somewhat long and tedious, we omit it here.

■ Example 4.3 — Computing Majority from NAND using syntactic sugar. Pro- cedures allow us to express NAND-CIRC programs much more cleanly and succinctly. For example, because we can compute AND, OR, and NOT using NANDs, we can compute the Majority function as follows:

def NOT(a): return NAND(a,a) def AND(a,b): temp = NAND(a,b) return NOT(temp) def OR(a,b): temp1 = NOT(a) temp2 = NOT(b) syntactic sugar, and computing every function 163

return NAND(temp1,temp2)

def MAJ(a,b,c): and1 = AND(a,b) and2 = AND(a,c) and3 = AND(b,c) or1 = OR(and1,and2) return OR(or1,and3)

print(MAJ(0,1,1)) # 1

Fig. 4.2 presents the “sugar free” NAND-CIRC program (and the corresponding circuit) that is obtained by “expanding out” this program, replacing the calls to procedures with their definitions.

 Big Idea 4 Once we show that a computational model is equiv- alent to a model that has feature , we can assume we have when showing that a function is computable by . 푋 푌 푌 푓 푋

Figure 4.2: A standard (i.e., “sugar free”) NAND- CIRC program that is obtained by expanding out the procedure definitions in the program for Majority of Example 4.3. The corresponding circuit is on the right. Note that this is not the most efficient NAND circuit/program for majority: we can save on some gates by “short cutting” steps where a gate computes NAND and then a gate computes NAND (as indicated by the dashed green 푢 arrows in the above(푣, figure). 푣) 푤 (푢, 푢)

R Remark 4.4 — Counting lines. While we can use syn- tactic sugar to present NAND-CIRC programs in more readable ways, we did not change the definition of the language itself. Therefore, whenever we say that some function has an -line NAND-CIRC program we mean a standard “sugar free” NAND-CIRC pro- gram, where all푓 syntactic푠 sugar has been expanded out. For example, the program of Example 4.3 is a -line program for computing the MAJ function,

12 164 introduction to theoretical computer science

even though it can be written in fewer lines using NAND-CIRC-PROC.

4.1.2 Proof by Python (optional) We can write a Python program that implements the proof of Theo- rem 4.1. This is a Python program that takes a NAND-CIRC-PROC program that includes procedure definitions and uses simple “search and replace” to transform into a standard (i.e., “sugar free”) NAND-CIRC푃 program that computes the same function as without using any procedures.′ 푃 The idea is simple: if the program contains a definition of a푃 procedure Proc of two arguments x and y, then푃 whenever we see a line of the form foo = Proc(bar,blah), we 푃can replace this line by:

1. The body of the procedure Proc (replacing all occurrences of x and y with bar and blah respectively).

2. A line foo = exp, where exp is the expression following the re- turn statement in the definition of the procedure Proc.

To make this more robust we add a prefix to the internal variables used by Proc to ensure they don’t conflict with the variables of ; for simplicity we ignore this issue in the code below though it can be easily added. 푃 The code of the Python function desugar below achieves such a transformation. Fig. 4.2 shows the result of applying desugar to the program of Ex- ample 4.3 that uses syntactic sugar to compute the Majority function. Specifically, we first apply desugar to remove usage of the OR func- tion, then apply it to remove usage of the AND function, and finally apply it a third time to remove usage of the NOT function.

R Remark 4.5 — Parsing function definitions (optional). The function desugar in Fig. 4.3 assumes that it is given the procedure already split up into its name, argu- ments, and body. It is not crucial for our purposes to describe precisely how to scan a definition and split it up into these components, but in case you are curious, it can be achieved in Python via the following code:

def parse_func(code): """Parse a function definition into name, arguments and body"""

lines↪ = [l.strip() for l in code.split('\n')] regexp = r'def\s+([a-zA-Z\_0-9]+)\(([\sa-zA- Z0-9\_,]+)\)\s*:\s*'

↪ syntactic sugar, and computing every function 165

Figure 4.3: Python code for transforming NAND-CIRC-PROC programs into standard sugar free NAND-CIRC programs. def desugar(code, func_name, func_args,func_body): """ Replaces all occurences of foo = func_name(func_args) with func_body[x->a,y->b] foo = [result returned in func_body] """ # Uses Python regular expressions to simplify the search and replace, # see https://docs.python.org/3/library/re.html and Chapter 9 of the book

# for capturing a list of variable names separated by commas arglist = ",".join([r"([a-zA-Z0-9\_\[\]]+)" for i in range(len(func_args))]) # regular expression for capturing a statement of the form # "variable = func_name(arguments)" regexp = fr'([a-zA-Z0-9\_\[\]]+)\s*=\s*{func_name}\({arglist}\)\s*$' while True: m = re.search(regexp, code, re.MULTILINE) if not m: break newcode = func_body # replace function arguments by the variables from the function invocation for i in range(len(func_args)): newcode = newcode.replace(func_args[i], m.group(i+2)) # Splice the new code inside newcode = newcode.replace('return', m.group(1) + " = ") code = code[:m.start()] + newcode + code[m.end()+1:] return code 166 introduction to theoretical computer science

m = re.match(regexp,lines[0]) return m.group(1), m.group(2).split(','), '\n'.join(lines[1:])

4.1.3 Conditional statements Another sorely missing feature in NAND-CIRC is a conditional statement such as the if/then constructs that are found in many programming languages. However, using procedures, we can ob- tain an ersatz if/then construct. First we can compute the function IF such that IF equals if and if . 3 ∶ {0, 1} → {0, 1} (푎, 푏, 푐) 푏 푎 = 1 푐 푎 = 0 P Before reading onward, try to see how you could com- pute the IF function using NAND’s. Once you do that, see how you can use that to emulate if/then types of constructs.

The IF function can be implemented from NANDs as follows (see Exercise 4.2): def IF(cond,a,b): notcond = NAND(cond,cond) temp = NAND(b,notcond) temp1 = NAND(a,cond) return NAND(temp,temp1)

The IF function is also known as a function, since can be thought of as a switch that controls whether the output is con- nected to or . Once we have a procedure for computing the IF func-푐표푛푑 tion, we can implement conditionals in NAND. The idea is that we replace code푎 of푏 the form if (condition): assign blah to variable foo

with code of the form foo = IF(condition, blah, foo)

that assigns to foo its old value when condition equals , and assign to foo the value of blah otherwise. More generally we can replace code of the form 0 if (cond): a = ... b = ... c = ... syntactic sugar, and computing every function 167

with code of the form temp_a = ... temp_b = ... temp_c = ... a = IF(cond,temp_a,a) b = IF(cond,temp_b,b) c = IF(cond,temp_c,c)

Using such transformations, we can prove the following theorem. Once again we omit the (not too insightful) full formal proof, though see Section 4.1.2 for some hints on how to obtain it.

Theorem 4.6 — Conditional statements synctatic sugar. Let NAND-CIRC- IF be the programming language NAND-CIRC augmented with if/then/else statements for allowing code to be conditionally executed based on whether a variable is equal to or . Then for every NAND-CIRC-IF program , there exists a stan- dard (i.e., “sugar free”) NAND-CIRC program 0 that1 computes the same function as . 푃 ′ 푃 4.2 EXTENDED EXAMPLE:푃 ADDITION AND MULTIPLICATION (OP- TIONAL)

Using “syntactic sugar”, we can write the integer addition function as follows:

# Add two n-bit integers # Use LSB first notation for simplicity def ADD(A,B): Result = [0]*(n+1) Carry = [0]*(n+1) Carry[0] = zero(A[0]) for i in range(n): Result[i] = XOR(Carry[i],XOR(A[i],B[i])) Carry[i+1] = MAJ(Carry[i],A[i],B[i]) Result[n] = Carry[n] return Result

ADD([1,1,1,0,0],[1,0,0,0,0]);; # [0, 0, 0, 1, 0, 0]

where zero is the constant zero function, and MAJ and XOR corre- spond to the majority and XOR functions respectively. While we use Python syntax for convenience, in this example is some fixed integer and so for every such , ADD is a finite function that takes as input 푛 푛 2푛 168 introduction to theoretical computer science

bits and outputs bits. In particular for every we can remove the loop construct for i in range(n) by simply repeating the code times, replacing the푛 + value 1 of i with 푛. By expanding out all the features, for every value of we can translate the above pro- 푛 gram into a standard (“sugar free”)0, NAND-CIRC 1, 2, … , 푛 − 1 program. Fig. 4.4 depicts what we get for . 푛

푛 = 2 Figure 4.4: The NAND-CIRC program and corre- sponding NAND circuit for adding two-digit binary numbers that are obtained by “expanding out” all the syntactic sugar. The program/circuit has 43 lines/- gates which is by no means necessary. It is possible to add bit numbers using NAND gates, see Exercise 4.5. 푛 9푛

By going through the above program carefully and accounting for the number of gates, we can see that it yields a proof of the following theorem (see also Fig. 4.5):

Theorem 4.7 — Addition using NAND-CIRC programs. For every , let ADD be the function that, given computes2푛 the representation푛+1 of the sum of푛 the ∈ num- ℕ 푛 bers′ that and∶푛 {0,represent. 1} → Then {0, there 1} is a constant such 푥,that 푥 for∈ every{0, 1} there′ is a NAND-CIRC program of at most lines 1 푥 푥 푐 ≤ 30 1 computing ADD . The value of can be improved to , see Exercise 4.5. 푛 푐푛 Once we have addition,푛 we can use the grade-school algorithm to 푐 9 obtain multiplication as well, thus obtaining the following theorem:

Theorem 4.8 — Multiplication using NAND-CIRC programs. For every , let MULT be the function that, given computes2푛 the representation2푛 of the product푛 of the 푛 numbers′ that ∶and푛 {0, 1}represent.→ {0, Then 1} there is a constant such 푥,that 푥 for∈ every {0, 1}, there′ is a NAND-CIRC program of at most Figure 4.5: The number of lines in our NAND-CIRC that computes푥 the function푥 MULT . 푐 2 program to add two bit numbers, as a function of 푛 푐푛 , for ’s between and . This is not the most 푛 We omit the proof, though in Exercise 4.7 we ask you to supply efficient program푛 for this task, but the important point a “constructive proof” in the form of a program (in your favorite is푛 that푛 it has the form1 100.

푂(푛) syntactic sugar, and computing every function 169

programming language) that on input a number , outputs the code of a NAND-CIRC program of at most lines that computes the MULT function. In fact, we can use Karatsuba’s2 푛 algorithm to show that there is a NAND-CIRC program of1000푛log lines to compute 푛 MULT (and can get even further asymptotic improvements2 3 using better algorithms). 푂(푛 ) 푛

4.3 THE LOOKUP FUNCTION

The LOOKUP function will play an important role in this chapter and later. It is defined as follows:

Definition 4.9 — Lookup function. For every , the lookup function of order , LOOKUP is defined as follows: For 푘 every and 2 +푘 , 푘 푘 푘 푘 2 ∶ {0, 1} 푘 → {0, 1} 푥 ∈ {0, 1} 푖LOOKUP ∈ {0, 1} (4.1)

푘 푖 where denotes the entry of (푥,, using 푖) = the푥 binary representation to identify with a number푡ℎ in . 푖 푥 푖 푥 푘 푖 {0, … , 2 − 1} Figure 4.6: The LOOKUP function takes an input in , which we denote by (with 푘 푘 2and+푘 ). The output is : the -th 푘 coordinate{0,2 1} of , where we푘 identify 푥,as 푖 a number푥 ∈ in{0, 1}using the푖 ∈ binary {0, 1} representation. In푥 the푖 above푖 example 푥 and 푖 . Since is the[푘] binary representation16 of the number4 , the output of푥LOOKUP ∈ {0, 1} in푖 ∈ this {0, case 1} is 푖 =. 0110 6 4(푥, 푖) 푥6 = 1

See Fig. 4.6 for an illustration of the LOOKUP function. It turns out that for every , we can compute LOOKUP using a NAND-CIRC program: 푘 푘 Theorem 4.10 — Lookup function. For every , there is a NAND- CIRC program that computes the function LOOKUP 푘 . Moreover, the number of lines in푘 this > program 0 is at most2 +푘 푘 . ∶ {0, 1} → {0, 1}푘 An immediate corollary of Theorem 4.10 is that for every , 4 ⋅ 2 LOOKUP can be computed by a Boolean circuit (with AND, OR and NOT gates) of at most gates. 푘 > 0 푘 푘 4.3.1 Constructing a NAND-CIRC8 ⋅ 2 program for LOOKUP We prove Theorem 4.10 by induction. For the case , LOOKUP maps to . In other words, if then it outputs 1 3 푘 = 1 0 1 푖 (푥 , 푥 , 푖) ∈ {0, 1} 푥 푖 = 0 170 introduction to theoretical computer science

and otherwise it outputs , which (up to reordering variables) is the same as the IF function presented in Section 4.1.3, which can be 0 1 푥computed by a 4-line NAND-CIRC푥 program. As a warm-up for the case of general , let us consider the case of . Given input for LOOKUP and an index , if the most significant푘 bit of the index is then 0 1 2 3 2 LOOKUP푘 = 2 will equal푥 = (푥if , 푥 , 푥 and, 푥 ) equal if . Similarly, 0 1 0 if the most푖 = (푖 significant, 푖 ) bit is then LOOKUP푖 will equal0 if 2 0 1 1 1 and(푥, will 푖) equal if푥 푖 .= Another 0 way to푥 say푖 this= 1 is that we 0 2 2 can write LOOKUP as follows:푖 1 (푥, 푖) 푥 1 3 1 푖 = 0 푥 푖 = 1 def LOOKUP2(X[0],X[2 1],X[2],X[3],i[0],i[1]): if i[0]==1: return LOOKUP1(X[2],X[3],i[1]) else: return LOOKUP1(X[0],X[1],i[1])

or in other words, def LOOKUP2(X[0],X[1],X[2],X[3],i[0],i[1]): a = LOOKUP1(X[2],X[3],i[1]) b = LOOKUP1(X[0],X[1],i[1]) return IF( i[0],a,b)

More generally, as shown in the following lemma, we can compute LOOKUP using two invocations of LOOKUP and one invocation of IF: 푘 푘−1 Lemma 4.11 — Lookup recursion. For every , LOOKUP is equal to 푘 푘 0 2 −1 0 푘−1 푘 ≥ 2 (푥 , … , 푥 , 푖 , … , 푖 ) IF LOOKUP LOOKUP 푘−1 푘 (4.2)푘−1 0 푘−1 2 2 −1 1 푘−1 푘−1 0 2 −1 1 푘−1 (푖 , (푥 , … , 푥 , 푖 , … , 푖 ), (푥 , … , 푥 , 푖 , … , 푖 )) Proof. If the most significant bit of is zero, then the index is in and hence we can perform the lookup on the 0 “first half”푘−1 of and the result of푖LOOKUP푖 will be the same푖 as {0,LOOKUP … , 2 − 1} . On the other hand, if this 푘 most significant푥 bit is equal푘−1 to , then the(푥, index 푖) is in 푘−1 0 2 −1 1 푘−1 푎 =, in which case(푥 the, result… , 푥 of LOOKUP, 푖 , … , 푖 ) is the same푘−1 as 푘 0 LOOKUP 푖 1 . Thus we can compute{2 , … , 2 − 푘 LOOKUP1} 푘−1by first 푘computing and(푥,and 푖) then outputting푏 = 푘−1 2 2 −1 1 푘−1 IF .(푥 , … , 푥 , 푖 , … , 푖 ) 푘 (푥, 푖) 푎 푏 ■ 푘−1 (푖 , 푎, 푏) Proof of Theorem 4.10 from Lemma 4.11. Now that we have Lemma 4.11, we can complete the proof of Theorem 4.10. We will prove by induc- tion on that there is a NAND-CIRC program of at most 푘 푘 4 ⋅ (2 − 1) syntactic sugar, and computing every function 171

lines for LOOKUP . For this follows by the four line program for IF we’ve seen before. For , we use the following pseudocode: 푘 푘 = 1 a = LOOKUP_(k-1)(X[0],...,X[2^(k-1)-1],i[1],...,i[k-1]) 푘 > 1 b = LOOKUP_(k-1)(X[2^(k-1)],...,Z[2^(k-1)],i[1],...,i[k- 1]) return↪ IF(i[0],b,a)

If we let be the number of lines required for LOOKUP , then the above pseudo-code shows that 푘 퐿(푘) (4.3)

Since under our induction퐿(푘) hypothesis ≤ 2퐿(푘 − 1) + 4 . , we get that which is what푘−1 we wanted to prove. See Fig. 4.7푘−1for a plot of the푘 actual퐿(푘 − number 1) ≤ 4(2 of lines− 1) in our implementation퐿(푘) ≤ 2 ⋅ 4(2 of LOOKUP− 1) + 4. = 4(2 − 1)

푘 4.4 COMPUTING EVERY FUNCTION

At this point we know the following facts about NAND-CIRC pro- grams (and so equivalently about Boolean circuits and our other equivalent models):

1. They can compute at least some non trivial functions.

2. Coming up with NAND-CIRC programs for various functions is a Figure 4.7: The number of lines in our implementation very tedious task. of the LOOKUP_k function as a function of (i.e., the length of the index). The number of lines in our implementation is roughly . Thus I would not blame the reader if they were not particularly 푘 푘 looking forward to a long sequence of examples of functions that can 3 ⋅ 2 be computed by NAND-CIRC programs. However, it turns out we are not going to need this, as we can show in one fell swoop that NAND- CIRC programs can compute every finite function:

Theorem 4.12 — Universality of NAND. There exists some constant such that for every and function , there is a NAND-CIRC program with at most lines푛 that푐 com- >푚 0 putes the function 푛,. 푚 > 0 푓 ∶ {0,푛 1} → {0, 1} 푐⋅푚2 By Theorem 3.19, the푓 models of NAND circuits, NAND-CIRC pro- grams, AON-CIRC programs, and Boolean circuits, are all equivalent to one another, and hence Theorem 4.12 holds for all these models. In particular, the following theorem is equivalent to Theorem 4.12:

Theorem 4.13 — Universality of Boolean circuits. There exists some constant such that for every and function

푐 > 0 푛, 푚 > 0 172 introduction to theoretical computer science

, there is a Boolean circuit with at most gates푛 that computes푚 the function . 푓 ∶푛 {0, 1} → {0, 1} 푐 ⋅ 푚2 푓  Big Idea 5 Every finite function can be computed by alarge enough Boolean circuit.

Improved bounds. Though it will not be of great importance to us, it is possible to improve on the proof of Theorem 4.12 and shave an extra factor of , as well as optimize the constant , and so prove that for every , and sufficiently large , if then can푛 be computed by a NAND circuit푐 of at most 푛 푚 푛 gates.휖 The > 0 proof푚 ∈ of ℕ this result is beyond the푛 scope푓 ∶ {0,of this 1} book,→ {0,푚⋅2 but 1} we 푛 do discuss푓 how to obtain a bound of the form (1in +Section 휖) 4.4.2; 푛 see also the biographical notes. 푚⋅2 푂( 푛 ) 4.4.1 Proof of NAND’s Universality To prove Theorem 4.12, we need to give a NAND circuit, or equiva- lently a NAND-CIRC program, for every possible function. We will restrict our attention to the case of Boolean functions (i.e., ). Exercise 4.9 asks you to extend the proof for all values of . A func- tion can be specified by a table of its푚 =values 1 for 푚 each one of the푛 inputs. For example, the table below describes one 2 2 In case you are curious, this is the function on input particular퐹 ∶ {0, function 1} →푛 {0, 1} : (which we interpret as a number in ),

2 4 that outputs4 the -th digit of in the binary basis. Table 4.1: An퐺 example ∶ {0, 1} of→ a {0, function 1} . 푖 ∈ {0, 1} [16] 푖 휋 4 Input ( ) Output (퐺 ∶ {0,) 1} → {0, 1} 1 푥 퐺(푥) 1 0000 0 0001 0 0010 1 0011 0 0100 0 0101 1 0110 0 0111 0 1000 0 1001 0 1010 1 1011 1 1100 1 1101 1 1110 1111 syntactic sugar, and computing every function 173

For every , LOOKUP , and so the following is NAND-CIRC4 “pseudocode” to compute using 4 syntactic sugar푥 ∈ for {0, the 1} LOOKUP_4퐺(푥) = procedure.(1100100100001111, 푥) 퐺 G0000 = 1 G1000 = 1 G0100 = 0 ... G0111 = 1 G1111 = 1 Y[0] = LOOKUP_4(G0000,G1000,...,G1111, X[0],X[1],X[2],X[3])

We can translate this pseudocode into an actual NAND-CIRC pro- gram by adding three lines to define variables zero and one that are initialized to and respectively, and then replacing a statement such as Gxxx = 0 with Gxxx = NAND(one,one) and a statement such as Gxxx = 1 with0 Gxxx1 = NAND(zero,zero). The call to LOOKUP_4 will be replaced by the NAND-CIRC program that computes LOOKUP , plugging in the appropriate inputs. 4 There was nothing about the above reasoning that was particular to the function above. Given every function , we can write a NAND-CIRC program that does the following:푛 퐺 퐹 ∶ {0, 1} → {0, 1} 1. Initialize variables of the form F00...0 till F11...1 so that for every 푛 , the variable corresponding to is assigned the value 2. 푛 푧 ∈ {0, 1} 푧 2. Compute LOOKUP on the variables initialized in the previ- 퐹 (푧) ous step, with the index variable푛 being the input variables X[ 푛 ],…,X[ ]. That is, just like2 in the pseudocode for G above, we use Y[0]푛 = LOOKUP(F00..00,...,F11..1,X[0],..,x[ 0]) 2 − 1 The total number of lines in the resulting program is 푛 − 1lines for initializing the variables plus the lines that we pay for computing푛 LOOKUP . This completes the proof푛 of Theorem 4.12. 3 + 2 4 ⋅ 2 푛 R Remark 4.14 — Result in perspective. While Theo- rem 4.12 seems striking at first, in retrospect, it is perhaps not that surprising that every finite function can be computed with a NAND-CIRC program. After all, a finite function can be represented by simply the list of푛 its outputs for푚 each one of the input values.퐹 ∶ {0,So it1} makes→ sense {0, 1} that we could write푛 a NAND-CIRC program of similar size to compute2 it. What is more interesting is that some functions, such as addition and multiplication, have 174 introduction to theoretical computer science

a much more efficient representation: one that only requires or even fewer lines. 2 푂(푛 )

4.4.2 Improving by a factor of (optional) By being a little more careful, we can improve the bound of Theo- rem 4.12 and show that every푛 function can be computed by a NAND-CIRC program of at most푛 푚lines. In other words, we can prove the following퐹 improved ∶ {0, 1} → version: {0,푛 1} 푂(푚2 /푛) Theorem 4.15 — Universality of NAND circuits, improved bound. There ex- ists a constant such that for every and function , there is a NAND-CIRC program with at most 3 푛lines푐 that >푚 computes 0 the function 푛,. 푚 > 0 푓 ∶ {0,푛 1} → {0, 1} 3 The constant in this theorem is at most and in 푐 ⋅ 푚2 /푛 푓 fact can be arbitrarily close to , see Section 4.8. Proof. As before, it is enough to prove the case that . Hence 푐 10 we let , and our goal is to prove that there exists 1 a NAND-CIRC program푛 of lines (or equivalently푚 = 1 a Boolean circuit푓 of ∶ {0, 1} →gates) {0, 1} that computes푛 . We let 푛log log푂(2(the/푛) reasoning behind this choice will become clear푂(2 later/푛) on). We define the 푓function 푛−푘 as follows:푘 = (푛 − 2 푛) 푘 2 푔 ∶ {0, 1} → {0, 1} (4.4) 푛−푘 푛−푘−1 푛−푘 In other words,푔(푎) if we = use 푓(푎0 the usual)푓(푎0 binary1) representation ⋯ 푓(푎1 ). to identify the numbers with the strings , then for every and 푛−푘 푛−푘 푘 {0, … , 2 −푛−푘 1} {0, 1} 푎 ∈ {0, 1} 푏 ∈ {0, 1} (4.5)

푏 (4.5) means that for every푔(푎) = 푓(푎푏),. if we write with and then we can푛 compute by first computing푘 the string 푛−푘푥of ∈ length{0, 1} , and then푥 = computing 푎푏 푎LOOKUP ∈ {0, 1} 푏 ∈to {0, retrieve 1} the element푛−푘 of at the푓(푥) position cor- responding to (see Fig.푇 = 4.8 푔(푎)). The cost to2 compute the LOOKUP 푛−푘 is lines/gates(푇 , 푏) and the cost in NAND-CIRC푇 lines (or Boolean 푛−푘 gates)푛−푘 to compute푏 is at most 푂(2 ) Figure 4.8: We can compute on input where and 푓 (4.6) 푛 by first computing the long string that 푛−푘 푓 ∶푘 {0, 1} → {0, 1}푛−푘 corresponds to all ’s values on inputs that begin with where is the number of operations (i.e., lines of NAND-CIRC 푥 = 푎푏 푎 ∈푛−푘 {0, 1} 푏 ∈ {0, 1} 푐표푠푡(푔) + 푂(2 ), , and then outputting2 the -th coordinate푔(푎) of this programs or gates in a circuit) needed to compute . string. 푓 To complete푐표푠푡(푔) the proof we need to give a bound on . Since 푎 푏 is a function mapping to , we can also푔 think of it as a 푛−푘 collection of functions 푘 2 푐표푠푡(푔), where푔 푛−푘 {0, 1} {0, 1}푛−푘 푘 0 2 −1 2 푔 , … , 푔 ∶ {0, 1} → {0, 1} syntactic sugar, and computing every function 175

for every and . (That is, is the -th bit of .) Naively, we could푘 use Theorem푛−푘 4.12 to compute 푖 푖 푖 each푔 (푥) =in 푔(푎) lines, but푎 ∈ then {0, the 1} total푖 cost ∈ [2 is ] 푔 (푎) which푖 does not푔(푎)푘 save us anything. However, the crucial푛−푘 observation푘 푛 푖 is that푔 there푂(2 are) only distinct functions mapping푂(2 ⋅ 2 )to = 푂(2 .) 푘 For example, if is an2 identical function to that means푘 that if we already computed2 then we can compute {0, 1} using{0, only 1} 17 67 a constant number푔 of operations: simply copy푔 the same value! In 17 67 general, if you have a collection푔 (푎) of functions 푔 (푎) mapping to , of which at most are distinct then for every value 0 푁−1 푘 we can compute the 푁values 푔 , … , 푔 using at {0,most 1} {0,푘 1} operations (see푆 Fig. 4.9). 0 푁−1 푎 ∈In {0, our 1} case,푘 because there are푁 at most 푔 (푎),distinct … , 푔 functions(푎) map- 푘 ping 푂(푆 ⋅ 2to+ 푁) , we can compute the2 function (and hence by (4.5) also 푘) using at most 2 {0, 1} {0, 1} 푔 푓 (4.7) 푘 operations. Now all that is left2 is to푘 plug푛−푘 into (4.7) our choice of 푂(2 ⋅ 2 + 2 ) log log . By definition, log , which means that (4.7) Figure 4.9 can be bounded 푘 푘 = : If is a collection of functions (푛 − 2 푛) 2 = 푛 − 2 푛 each mapping to such that at most of them are distinct0 then푁−1 for every , we log log log log (4.8) 푔 , … , 푔푘 can compute all{0, the 1} values{0, 1} 푘using푆 푛−2 푛 푛− (푛−2 푛) at most operations by푎 ∈ first {0, 1}computing 0 푁−1 푂 (2 ⋅ (푛 − 2 푛) + 2 ) ≤ the distinct functions푘 and then푔 (푎), copying … , 푔 the(푎) resulting (4.9) values. 푂(푆 ⋅ 2 + 푁) 푛 푛log 푛 푛 푛 which is what we2 2 wanted to2 prove. (We2 used2 above the2 fact that 푂 ( 푛 ⋅ 푛 + 푛−2 푛 ) ≤ 푂 ( 푛 + 0.5푛 ) = 푂 ( 푛 ) log log for sufficiently large .) 푛 − ■ 2 푛 ≥ 0.5 푛 푛 Using the connection between NAND-CIRC programs and Boolean circuits, an immediate corollary of Theorem 4.15 is the following improvement to Theorem 4.13:

Theorem 4.16 — Universality of Boolean circuits, improved bound. There exists some constant such that for every and func- tion , there is a Boolean circuit with at most gates푛 that computes푐 > 0푚 the function . 푛, 푚 > 0 푓푛 ∶ {0, 1} → {0, 1} 푐 ⋅ 푚2 /푛 푓 4.5 COMPUTING EVERY FUNCTION: AN ALTERNATIVE PROOF

Theorem 4.13 is a fundamental result in the theory (and practice!) of computation. In this section we present an alternative proof of this basic fact that Boolean circuits can compute every finite function. This alternative proof gives a somewhat worse quantitative bound on the number of gates but it has the advantage of being simpler, working 176 introduction to theoretical computer science

directly with circuits and avoiding the usage of all the syntactic sugar machinery. (However, that machinery is useful in its own right, and will find other applications later on.)

Theorem 4.17 — Universality of Boolean circuits (alternative phrasing). There exists some constant such that for every and func- tion , there is a Boolean circuit with at most gates푛 that computes푐 > 0푚 the function . 푛, 푚 > 0 푓 ∶푛 {0, 1} → {0, 1} 푐 ⋅ 푚 ⋅ 푛2 푓 Figure 4.10: Given a function ,

we let be푛 the set of inputs such that , and푓 ∶ note {0,푛 1} that→ {0, 1}. 0 1 푁−1 We can{푥 express, 푥 , … ,as 푥 the} OR ⊆ of {0, 1}for where푛 the function 푓(푥푖) = 1 (for 푁 ≤ 2 ) 푥푖 is defined as follows:푓 푛 훿 iff 푖 ∈ [푁]. We can푛 compute the훿 OR훼 ∶ of {0, 1}values→ {0, using 1} two-input훼 ∈ {0, 1} OR gates. Therefore if we훿 have훼(푥) a = circuit 1 푥 of = size 훼 to compute for every푁 , we푁 can compute using a circuit of size 푛 푂(푛). 훼 훿 훼 ∈ {0, 1} 푛 푓 푂(푛 ⋅ 푁) = 푂(푛 ⋅ 2 )

Proof Idea: The idea of the proof is illustrated in Fig. 4.10. As before, it is enough to focus on the case that (the function has a sin- gle output), since we can always extend this to the case of by looking at the composition of 푚circuits = 1 each computing푓 a differ- ent output bit of the function . We start by showing that for푚 every > 1 , there is an sized푚 circuit that computes the function 푛 defined푓 as follows: iff (that is, 훼 ∈outputs {0, 1}푛 on all inputs푂(푛) except the input ). We can then write any 훼 훼 훿function∶ {0, 1} → {0, 1} as the OR of훿 at(푥) most = 1 functions푥 = 훼 for 훼 훿the ’s on which0 푛 . 훼 푛 훼 푓 ∶ {0, 1} → {0, 1} 2 훿 훼 푓(훼) = 1 ⋆ Proof of Theorem 4.17. We prove the theorem for the case . The result can be extended for as before (see also Exercise 4.9). Let . We will prove that there is an 푚 =-sized 1 Boolean circuit푛 to compute푚 >in 1the following steps: 푛 푓 ∶ {0, 1} → {0, 1} 푂(푛 ⋅ 2 ) 푓 syntactic sugar, and computing every function 177

1. We show that for every , there is an sized circuit that computes the function 푛 , where iff . 훼 ∈ {0, 1} 푛 푂(푛) 훼 훼 훿 ∶ {0, 1} → {0, 1} 훿 (푥) = 1 2. We then show that this implies the existence of an -sized 푥 = 훼 circuit that computes , by writing as the OR of 푛for all such that . 푂(푛 ⋅ 2 ) 훼 푛 푓 푓(푥) 훿 (푥) We start with Step 1: 훼 ∈ {0, 1} 푓(훼) = 1 CLAIM: For , define as follows: 푛 푛 훼 훼 ∈ {0, 1} 훿 ∶ {0, 1} (4.10) otherwise ⎧ 훼 {1 푥 = 훼 then there is a Boolean훿 circuit(푥) = using at most gates. that computes . {⎨0 PROOF OF CLAIM: The proof⎩ is illustrated in Fig. 4.11. As an 훼 example, consider the function 2푛 . This function훿 outputs on if and only if , 3and , and so we can 011 write , which훿 ∶ translates {0, 1} → into {0, a1} Boolean circuit 0 1 2 with one1 NOT푥 gate and two푥 AND= 0 gates.푥 = More 1 generally,푥 = 1 for every 011 0 1 2 훿 (푥), we = can푥 ∧ express 푥 ∧ 푥 as , where푛 if we replace with and if we 훼 0 0 1 1 푛−1 replace훼 ∈ {0, 1} by simply 훿.(푥) This yields(푥 = a 훼 circuit)∧(푥 that= 훼 computes)∧⋯∧(푥 = 푛−1 푖 푖 푖 푖 푖 훼using) AND gates훼 = and 0 at most NOT푥 = gates, 훼 so푥 a total of훼 at most= 1 푖 푖 푖 훼 gates. 푥 = 훼 푥 훿 Now푛 for every function 푛 , we can write 2푛 푛 푓 ∶ {0, 1} → {0, 1} (4.11)

where 푥0 is the푥1 set of inputs푥 on푁−1 which outputs . 푓(푥) = 훿 (푥) ∨ 훿 (푥) ∨ ⋯ ∨ 훿 (푥) (To see this, you can verify that the right-hand side of (4.11) evaluates 0 푁−1 to on 푆 = {푥 ,if … and , 푥 only} if is in the set .) 푓 1 Therefore we can푛 compute using a Boolean circuit of at most gates1 for푥 each∈ {0, of 1} the functions푥 and combine푆 that with at most OR gates, thus obtaining a circuit푓 of at most gates. Since2푛 푥푖 , its size 푁 is at most훿 and hence the total number of 푁 gates in this푛 circuit is . 푛 2푛 ⋅ 푁 + 푁 ■ 푆 ⊆ {0, 1} 푁 푛 2 푂(푛 ⋅ 2 )

4.6 THE CLASS SIZE Figure 4.11: For every string , there is a

Boolean circuit of gates to compute푛 the function We have seen that every function can be com- such훼 that ∈ {0, 1} if and (푇 ) puted by a circuit of size , and some푛 functions푚 (such as ad- only if 푛 . The푂(푛) circuit is very simple. Given input 훿훼 ∶ {0, 1} we→ compute {0, 1} the AND훿훼 of(푥) = 1 dition and multiplication) can be푛푓 computed ∶ {0, 1} by→ much {0, 1} smaller circuits. where 푥 = 훼 if and NOT if . We define SIZE to be the푂(푚 set ⋅ of2 ) functions that can be computed by While푥0, … , formally 푥푛−1 Boolean circuits only have푧0, … a , gate 푧푛−1 for 푖 푖 푖 푖 푖 푖 NAND circuits of at most gates (or equivalently, by NAND-CIRC computing푧 = the 푥 AND훼 = of 1 two inputs,푧 = we can(푥 ) implement훼 = 0 an AND of inputs by composing two-input programs of at most(푠) lines). Formally, the definition is as follows: ANDs. 푠 푛 푛 푠 178 introduction to theoretical computer science

Definition 4.18 — Size class of functions. For every , we let set SIZE denotes the set of all functions 4 such that SIZE . We denote푛, 푚 by SIZE ∈ {1, …the ,푛 2푠} set 푛,푚 SIZE 푚 . For every(푠) integer , we let SIZE 푓 ∶ {0, 1}SIZE→ 푛 {0,be the 1} set of all functions푓 ∈ for(푠) which there exists a NAND(푠) circuit 푛,1 푛,푚≤2푠 푛,푚 of at most(푠) gates that compute푠 ≥. 1 (푠) = ∪ (푠) 푓 4 The restriction that makes no difference; Fig. 4.12 depicts푠 the set SIZE 푓 . Note that SIZE is a set of see Exercise 3.11. functions, not of programs! (Asking if a program or a circuit is a mem- 푚, 푛 ≤ 2푠 푛,1 푛,푚 ber of SIZE is a category error(푠)as in the sense of Fig.(푠) 4.13.) As we discussed in Section 3.6.2 (and Section 2.6.1), the distinction be- 푛,푚 tween programs(푠)and functions is absolutely crucial. You should always remember that while a program computes a function, it is not equal to a function. In particular, as we’ve seen, there can be more than one program to compute the same function.

Figure 4.12: There are functions mapping 푛 to , and an infinite2 number of circuits with bit푛 inputs and a single bit2 of output. Every circuit{0, com- 1} putes{0, one1} function, but every function can be com-푛 puted by many circuits. We say that SIZE if the smallest circuit that computes has or fewer gates. For example XOR SIZE 푓 ∈ . Theo-푛,1(푠) rem 4.12 shows that every function 푓is computable푠 by some circuit of at most푛 ∈ 푛,1gates,(4푛) and hence SIZE corresponds푛 to the푔 set of all func- tions from 푛 to 푐. ⋅ 2 /푛 푛,1 (푐 ⋅ 2 /푛)푛 {0, 1} {0, 1}

While we defined SIZE with respect to NAND gates, we would get essentially the same class if we defined it with respect to AND/OR/NOT gates: (푠) Lemma 4.19 Let SIZE denote the set of all functions that can be computed퐴푂푁 by an AND/OR/NOT Boolean circuit푛 of 푛,푚 at most푚 gates. Then, (푠) 푓 ∶ {0, 1} → {0, 1} 푠 SIZE SIZE SIZE (4.12) 퐴푂푁 푛,푚 푛,푚 푛,푚 Proof. If can be computed(푠/2) ⊆ by a NAND(푠) circuit ⊆ of at most(3푠) gates, then by replacing each NAND with the two gates NOT and AND, we can obtain푓 an AND/OR/NOT Boolean circuit of at most gates푠/2 that computes . On the other hand, if can be computed by a Boolean 푠 푓 푓 syntactic sugar, and computing every function 179

AND/OR/NOT circuit of at most gates, then by Theorem 3.12 it can be computed by a NAND circuit of at most gates. 푠 ■ 3푠 The results we have seen in this chapter can be phrased as showing that ADD SIZE and MULT SIZE log . Theorem 4.12 shows that for some constant , SIZE is equal2 3 푛 2푛,푛+1 푛 2푛,2푛 to the set of∈ all functions(100푛) from to ∈ . (10000푛푛 ) 푛,푚 푛 푐 푚 (푐푚2 ) {0, 1} {0, 1} Figure 4.13: A “category error” is a question such as R Remark 4.20 — Finite vs infinite functions. A NAND- “is a cucumber even or odd?” which does not even CIRC program can only compute a function with make sense. In this book one type of category error you should watch out for is confusing functions and a certain number of inputs and a certain number programs (i.e., confusing specifications and implemen- of outputs. Hence,푃 for example, there is no single tations). If is a circuit or program, then asking if NAND-CIRC program푛 that can compute the incre- SIZE is a category error, since SIZE is ment푚 function INC that maps a a set of functions퐶 and not programs or circuits. 푛,1 푛,1 string (which we identify with∗ a number∗ via the 퐶 ∈ (푠) (푠) binary representation)∶to {0, the 1} string→ that {0, 1} represents .푥 Rather for every , there is a NAND-CIRC program that computes the restriction INC of the푥 + function 1 INC to inputs푛> of length 0 . Since it can be 푛 푛 shown that푃 for every such a program exists of length at most , INC SIZE푛 for every 푛 . 푛 > 0 푃 푛 If 10푛and ∈ (10푛) , we will 푛write > 0 SIZE (or sometimes∗ slightly∗ abuse notation푇 ∶ ℕ and → write ℕ simply퐹 ∶ {0, 1} →SIZE {0, 1} ) to in- ∗ dicate퐹 that ∈ for every(푇 (푛))the restriction of to inputs in is in SIZE퐹 ∈. Hence we(푇 (푛)) can write ↾푛 INC SIZE 푛 .푛 We will come back퐹 to this퐹 issue of finite vs {0,infinite 1} functions(푇 (푛)) later in this course. ∗ ∈ (10푛)

Solved Exercise 4.1 — closed under complement.. In this exercise we prove a certain “closure property” of the class SIZE . That is, we show that if is in this푆퐼푍퐸 class then (up to some small additive term) so 푛 is the complement of , which is the function (푠) . Prove that푓 there is a constant such that for every and , if 푓 SIZE then 푔(푥)SIZE = 1 − 푓(푥). 푛 푐 푓 ∶ {0, 1} → ■ 푛 푛 {0, 1} 푠 ∈ ℕ 푓 ∈ (푠) 1 − 푓 ∈ (푠 + 푐) Solution: If SIZE then there is an -line NAND-CIRC program that computes . We can rename the variable Y[0] in to a variable푓temp ∈ and add(푠) the line 푠 푃 푓 푃 Y[0] = NAND(temp,temp)

at the very end to obtain a program that computes . ′ ■ 푃 1 − 푓 180 introduction to theoretical computer science

Chapter Recap

• We can define the notion of computing a function ✓ via a simplified “programming language”, where computing a function in steps would corre- spond to having a -line NAND-CIRC program that computes . 퐹 푇 • While the NAND-CIRC푇 programming only has one operation, other퐹 operations such as functions and conditional execution can be implemented using it. • Every function can be com- puted by a circuit of at most푛 gates푚 (and in fact at most 푓 ∶ {0,gates). 1} → {0,푛 1} • Sometimes (or maybe푛 always?)푂(푚2 we) can translate an efficient algorithm푂(푚2 to/푛) compute into a circuit that computes with a number of gates comparable to the number of steps in this algorithm.푓 푓

4.7 EXERCISES

Exercise 4.1 — Pairing. This exercise asks you to give a one-to-one map from to . This can be useful to implement two-dimensional arrays as “syntactic2 sugar” in programming languages that only have one- dimensionalℕ ℕ array.

1. Prove that the map is a one-to-one map from to . 푥 푦 2 퐹 (푥, 푦) = 2 3 ℕ 2. Show that there is a one-to-one map such that for every ℕ , max . 2 2 퐹 ∶ ℕ → ℕ 3. For every , show that there is a one-to-one map such 푥, 푦 퐹 (푥, 푦) ≤ 100 ⋅ {푥, 푦} + 100 that for every , 푘 푘 . 퐹 ∶ ℕ → ℕ 0 푘−1 0 푘−1 0 1 푘 푥 , … , 푥 ∈ ℕ 퐹 (푥 , … , 푥 ) ≤ 100 ⋅ (푥 + 푥 + … + 푘−1 ■ 푥 + 100푘) Exercise 4.2 — Computing MUX. Prove that the NAND-CIRC program be- low computes the function MUX (or LOOKUP ) where MUX equals if and equals if : 1 (푎, 푏, 푐) t = NAND(X[2],X[2]) 푎 푐 = 0 푏 푐 = 1 u = NAND(X[0],t) v = NAND(X[1],X[2]) Y[0] = NAND(u,v)

Exercise 4.3 — At least two / Majority. Give a NAND-CIRC program of at most 6 lines to compute the function MAJ where MAJ iff . 3 ∶ {0, 1} → {0, 1} (푎, 푏, 푐) = 1 푎 + 푏 + 푐 ≥ 2 syntactic sugar, and computing every function 181

Exercise 4.4 — Conditional statements. In this exercise we will explore The- orem 4.6: transforming NAND-CIRC-IF programs that use code such as if .. then .. else .. to standard NAND-CIRC programs.

1. Give a “proof by code” of Theorem 4.6: a program in a program- ming language of your choice that transforms a NAND-CIRC-IF program into a “sugar free” NAND-CIRC program that com- 5 5 You can start by transforming into a NAND-CIRC- putes the same function. See footnote for hint. ′ PROC program that uses procedure statements, and 푃 푃 then use the code of Fig. 4.3 to transform푃 the latter 2. Prove the following statement, which is the heart of Theorem 4.6: into a “sugar free” NAND-CIRC program. suppose that there exists an -line NAND-CIRC program to com- pute and an -line NAND-CIRC program to compute 푛 푠 . Prove′ that there exist a NAND- CIRC푓 program ∶ {0, 1} of→ at most{0,푛 1} 푠 lines to compute the func- tion 푔 ∶ {0, 1} → {0,where 1}′ equals 푛+1if and푠 +equals 푠 + 10 otherwise. 0 푛−1 푛 (Allℎ programs ∶ {0, 1} in this→ item {0, 1} are standardℎ(푥 “sugar-free”, … , 푥 , 푥 NAND-CIRC) 0 푛−1 푛 0 푛−1 푓(푥programs.), … , 푥 ) 푥 = 0 푔(푥 , … , 푥 )

Exercise 4.5 — Half and full adders. 1. A half is the function HA that corresponds to adding two binary bits. That is, for2 every 2 , HA where ∶ . {0,Prove 1} that∶→there {0, 1} is a NAND circuit of at most five NAND gates that computes HA푎,. 푏 ∈ {0, 1} (푎, 푏) = (푒, 푓) 2푒 + 푓 = 푎 + 푏

2. A full adder is the function FA that takes in two bits and a “carry” bit and outputs3 their sum.2 That is, for every , FA ∶ {0,such 1} that→ {0, 1} . Prove that there is a NAND circuit of at most nine NAND gates that 푎,computes 푏, 푐 ∈ {0,FA 1}. (푎, 푏, 푐) = (푒, 푓) 2푒 + 푓 = 푎 + 푏 + 푐

3. Prove that if there is a NAND circuit of gates that computes FA, then there is a circuit of gates that computes ADD where (as in Theorem 4.7) ADD 푐 is the function that 푛 outputs the addition of two푐푛 input2푛 -bit numbers.푛+1 See footnote for 푛 6 hint.6 ∶ {0, 1} → {0, 1} Use a “cascade” of adding the bits one after the other, starting with the least significant digit, just like 푛 in the elementary-school algorithm. 4. Show that for every there is a NAND-CIRC program to compute ADD with at most lines. 푛 ■ 푛 9푛 Exercise 4.6 — Addition. Write a program using your favorite program- ming language that on input of an integer , outputs a NAND-CIRC program that computes ADD . Can you ensure that the program it outputs for ADD has fewer than lines?푛 푛 푛 10푛 182 introduction to theoretical computer science

Exercise 4.7 — Multiplication. Write a program using your favorite pro- gramming language that on input of an integer , outputs a NAND- CIRC program that computes MULT . Can you ensure that the pro- gram it outputs for MULT has fewer than 푛 lines? 푛 2 ■ 푛 1000 ⋅ 푛 Exercise 4.8 — Efficient multiplication (challenge). Write a program using your favorite programming language that on input of an integer , outputs a NAND-CIRC program that computes MULT and has at 7 Hint: Use Karatsuba’s algorithm. most lines.7 What is the smallest number of lines you can푛 푛 use to multiply1.9 two 2048 bit numbers? 10000푛 ■

Exercise 4.9 — Multibit function. In the text Theorem 4.12 is only proven for the case . In this exercise you will extend the proof for every . Prove that푚 = 1 푚 1. If there is an -line NAND-CIRC program to compute and an -line NAND-CIRC program to compute푛 푠 ′ then there is an -line 푓program ∶ {0, 1} to compute→′ {0, 1} the푛 function푠 such′ that 푓 ∶ {0,. 1} → {0, 1} 푛 푠 +2 푠 ′ 푔 ∶ {0, 1} → {0, 1} 2.푔(푥) For every = (푓(푥), function 푓 (푥)) , there is a NAND-CIRC program of at most 푛lines that computes푚 . (You can use the case of Theorem푓 ∶ {0, 4.12 1}푛, as→ well {0, as 1} Item 1.) 10푚 ⋅ 2 푓 푚 = 1 ■ Exercise 4.10 — Simplifying using syntactic sugar. Let be the following NAND-CIRC program: 푃 Temp[0] = NAND(X[0],X[0]) Temp[1] = NAND(X[1],X[1]) Temp[2] = NAND(Temp[0],Temp[1]) Temp[3] = NAND(X[2],X[2]) Temp[4] = NAND(X[3],X[3]) Temp[5] = NAND(Temp[3],Temp[4]) Temp[6] = NAND(Temp[2],Temp[2]) Temp[7] = NAND(Temp[5],Temp[5]) Y[0] = NAND(Temp[6],Temp[7])

1. Write a program with at most three lines of code that uses both NAND as well as the syntactic′ sugar OR that computes the same func- tion as . 푃

푃 syntactic sugar, and computing every function 183

2. Draw a circuit that computes the same function as and uses only AND and NOT gates. 푃 ■

In the following exercises you are asked to compare the power of pairs of programming languages. By “comparing the power” of two programming languages and we mean determining the relation between the set of functions that are computable using programs in and respectively. That is,푋 to answer푌 such a question you need to do both of the following: 푋 푌 1. Either prove that for every program in there is a program in that computes the same function as , or give an example for′ a function that is computable by an 푃-program푋 but not computable푃 by푌 a -program. 푃 푋 and 푌 2. Either prove that for every program in there is a program in that computes the same function as , or give an example for′ a function that is computable by a -program푃 푌 but not computable푃 by an푋 -program. 푃 푌 When푋 you give an example as above of a function that is com- putable in one programming language but not the other, you need to prove that the function you showed is (1) computable in the first programming language and (2) not computable in the second program- ming language. Exercise 4.11 — Compare IF and NAND. Let IF-CIRC be the programming language where we have the following operations foo = 0, foo = 1, foo = IF(cond,yes,no) (that is, we can use the constants and , and the IF function such that IF equals if and equals 3if ). Compare the power of the NAND-CIRC0 1 programming∶ {0, language 1} → {0, and 1} the IF-CIRC programming(푎, 푏, 푐) language.푏 푎 = 1 푐 푎 = 0 ■

Exercise 4.12 — Compare XOR and NAND. Let XOR-CIRC be the pro- gramming language where we have the following operations foo = XOR(bar,blah), foo = 1 and bar = 0 (that is, we can use the constants , and the XOR function that maps to mod ). Compare the power of the NAND-CIRC programming2 language and0 1 the XOR-CIRC programming language.푎, 푏 ∈ {0, See 1} footnote푎 + for 푏 hint.82 8 You can use the fact that mod ■ mod . In particular it means that if you have the lines d = XOR(a,b) and e(푎+푏)+푐 = XOR(d,c) then2 =e gets 푎+푏+푐 the a b c Exercise 4.13 — Circuits for majority. Prove that there is some constant sum modulo2 of the variable , and . such that for every , MAJ SIZE where MAJ 2 푛푐 푛 > 1 푛 ∈ (푐푛) 푛 ∶ {0, 1} → 184 introduction to theoretical computer science

is the majority function on input bits. That is MAJ iff 9 One approach to solve this is using recursion and the . See footnote for hint.9 푛 so-called Master Theorem. {0,푛−1 1} 푛 (푥) = 1 ■ 푖 ∑푖=0 푥 > 푛/2 Exercise 4.14 — Circuits for threshold. Prove that there is some constant such that for every , and integers , there is a NAND circuit with at most푛 gates푛 푐 0 푛−1 that computes the threshold푛푛 > 1 function 푎 , … , 푎 , 푏 ∈ {−2 , −2푐 that+ 1, … , −1, 0, +1, … , 2 } 푛 on input outputs if and only0 if푛−1 푛 . 푎 ,…,푎 ,푏 푛 푓 푛−1∶ {0, 1} → {0, 1} ■ 푖 푖 푥 ∈ {0, 1} 1 ∑푖=0 푎 푥 > 푏 4.8 BIBLIOGRAPHICAL NOTES

See Jukna’s and Wegener’s books [Juk12; Weg87] for much more extensive discussion on circuits. Shannon showed that every Boolean function can be computed by a circuit of exponential size [Sha38]. The improved bound of (with the optimal value of for many bases) is due to Lupanov푛 [Lup58]. An exposition of this for the case of NAND (where 푐 ⋅ 2)/푛 is given in Chapter 4 of his book푐 [Lup84]. (Thanks to Sasha Golovnev for tracking down this reference!) The concept of “syntactic푐 = 1 sugar” is also known as “macros” or “meta-programming” and is sometimes implemented via a prepro- cessor or macro language in a programming language or a text editor. One modern example is the Babel JavaScript syntax transformer, that converts JavaScript programs written using the latest features into a format that older Browsers can accept. It even has a plug-in ar- chitecture, that allows users to add their own syntactic sugar to the language. Learning Objectives: • See one of the most important concepts in computing: duality between code and data. • Build up comfort in moving between different representations of programs. • Follow the construction of a “universal circuit evaluator” that can evaluate other circuits given their representation. • See major result that complements the result of the last chapter: some functions require an 5 exponential number of gates to compute. • Discussion of Physical extended Church-Turing thesis stating that Boolean Code as data, data as code circuits capture all feasible computation in the physical world, and its physical and philosophical implications.

“The term code script is, of course, too narrow. The chromosomal structures are at the same time instrumental in bringing about the development they foreshadow. They are law-code and executive power - or, to use another simile, they are architect’s plan and builder’s craft - in one.” , Erwin Schrödinger, 1944.

“A mathematician would hardly call a correspondence between the set of 64 triples of four units and a set of twenty other units,”universal“, while such correspondence is, probably, the most fundamental general feature of life on Earth”, Misha Gromov, 2013

A program is simply a sequence of symbols, each of which can be encoded as a string of ’s and ’s using (for example) the ASCII stan- dard. Therefore we can represent every NAND-CIRC program (and hence also every Boolean0 circuit)1 as a binary string. This statement seems obvious but it is actually quite profound. It means that we can treat circuits or NAND-CIRC programs both as instructions to car- rying computation and also as data that could potentially be used as inputs to other computations.

 Big Idea 6 A program is a piece of text, and so it can be fed as input to other programs.

This correspondence between code and data is one of the most fun- damental aspects of computing. It underlies the notion of general purpose computers, that are not pre-wired to compute only one task, and also forms the basis of our hope for obtaining general artificial intelligence. This concept finds immense use in all areas of comput- ing, from scripting languages to machine learning, but it is fair to say that we haven’t yet fully mastered it. Many security exploits involve cases such as “buffer overflows” when attackers manage to inject code where the system expected only “passive” data (see Fig. 5.1). The re- lation between code and data reaches beyond the realm of electronic

Compiled on 8.26.2020 18:10 186 introduction to theoretical computer science

computers. For example, DNA can be thought of as both a program and data (in the words of Schrödinger, who wrote before the discov- ery of DNA’s structure a book that inspired Watson and Crick, DNA is both “architect’s plan and builder’s craft”).

This chapter: A non-mathy overview In this chapter, we will begin to explore some of the many applications of the correspondence between code and data.

We start by using the representation of programs/circuits Figure 5.1: As illustrated in this xkcd cartoon, many as strings to count the number of programs/circuits up to a exploits, including , SQL injections, certain size, and use that to obtain a counterpart to the result and more, utilize the blurry line between “active programs” and “static strings”. we proved in Chapter 4. There we proved that every function can be computed by a circuit, but that circuit could be expo- nentially large (see Theorem 4.16 for the precise bound) In this chapter we will prove that there are some functions for which we cannot do better: the smallest circuit that computes them is exponentially large. We will also use the notion of representing programs/cir- cuits as strings to show the existence of a “universal circuit” - a circuit that can evaluate other circuits. In programming languages, this is known as a “meta circular evaluator” - a program in a certain programming language that can eval- uate other programs in the same language. These results do have an important restriction: the universal circuit will have to be of bigger size than the circuits it evaluates. We will show how to get around this restriction in Chapter 7 where we introduce loops and Turing Machines. See Fig. 5.2 for an overview of the results of this chapter.

Figure 5.2: Overview of the results in this chapter. We use the representation of programs/circuits as strings to derive two main results. First we show the existence of a universal program/circuit, and in fact (with more work) the existence of such a program/circuit whose size is at most polynomial in the size of the program/circuit it evaluates. We then use the string representation to count the number of programs/circuits of a given size, and use that to establish that some functions require an exponential number of lines/gates to compute. code as data, data as code 187

5.1 REPRESENTING PROGRAMS AS STRINGS

We can represent programs or circuits as strings in a myriad of ways. For example, since Boolean circuits are labeled directed acyclic graphs, we can use the adjacency matrix or adjacency list representations for them. However, since the code of a program is ultimately just a se- quence of letters and symbols, arguably the conceptually simplest representation of a program is as such a sequence. For example, the following NAND-CIRC program temp_0 = NAND(X[0],X[1]) 푃 temp_1 = NAND(X[0],temp_0) temp_2 = NAND(X[1],temp_0) Y[0] = NAND(temp_1,temp_2)

is simply a string of 107 symbols which include lower and upper case letters, digits, the underscore character _ and equality sign =, punctuation marks such as “(”,“)”,“,”, spaces, and “new line” mark- ers (often denoted as “\n” or “↵”). Each such symbol can be encoded as a string of bits using the ASCII encoding, and hence the program can be encoded as a string of length bits. Nothing in7 the above discussion was specific to the program , and 푃hence we can use the same reasoning to7 ⋅prove 107 = that 749every NAND-CIRC program can be represented as a string in . In fact, we can푃 do a bit better. Since the names of the working variables∗ of a NAND-CIRC program do not affect its functionality,{0, we 1} can always transform apro- gram to have the form of where all variables apart from the inputs and outputs have the form ′temp_0, temp_1, temp_2, etc.. Moreover, if the program has lines,푃 then we will never need to use an index larger than (since each line involves at most three variables), and similarly the indices푠 of the input and output variables will all be at most . Since3푠 a number between and can be expressed using at most log log digits, each line in the program (which3푠 has the form foo = NAND(bar,blah)0 3푠 ), can be represented 10 using ⌈ (3푠log + 1)⌉ = 푂(log 푠)symbols, each of which can be rep- resented by bits. Hence an line program can be represented as a string푂(1) of +log 푂( bits,푠) = resulting 푂( 푠) in the following theorem: 7 푠 Theorem푂(푠 5.1 — Representing푠) programs as strings. There is a constant such that for SIZE , there exists a program computing whose string representation has length at most log . 푐 푓 ∈ (푠) 푃 푓 푐푠 푠

P 188 introduction to theoretical computer science

We omit the formal proof of Theorem 5.1 but please make sure that you understand why it follows from the reasoning above.

5.2 COUNTING PROGRAMS, AND LOWER BOUNDS ON THE SIZE OF NAND-CIRC PROGRAMS

One consequence of the representation of programs as strings is that the number of programs of certain length is bounded by the number of strings that represent them. This has consequences for the sets SIZE that we defined in Section 4.6.

Theorem(푠) 5.2 — Counting programs. For every ,

log SIZE 푠 ∈ ℕ (5.1) 푂(푠 푠) log That is, there are at most| (푠)| ≤functions 2 . computed by NAND- 1 CIRC programs of at most푂(푠lines.푠) 1 2 The implicit constant in the notation is smaller than . That is, for all sufficiently large , SIZE 푠 log , see Remark 5.4. As discussed푂(⋅) in Section 1.7, Proof. We will show a one-to-one map from SIZE to the set of we10푠 use10푠 the bound simply because it is푠 a| round(푠)| < strings of length log for some constant . This will conclude the 2number. 10 proof, since it implies that SIZE is smaller퐸 than the(푠) size of the set of all strings of length푐푠 at푠 most , which equals푐 by the formula for sums of| geometric(푠)| progressions. ℓ ℓ+1 The map will simply mapℓ to the representation1+2+4+⋯+2 of the program= 2 −1 computing . Specifically, we let be the representation of the program computing퐸 given by푓 Theorem 5.1. This representation has size at most푓 log , and moreover퐸(푓) the map is one to one, since if 푃then every two푓 programs computing and respectively must have′ different푐푠 representations.푠 퐸 ′ 푓 ≠ 푓 푓 푓 ■

A function mapping to can be identified with the table of its four values on the2 inputs . A function mapping to can be identified{0, 1} {0, with 1} the table of its eight values on the inputs3 00, 01, 10,. More11 generally, every {0,function 1} {0, 1} can be identified with the table of its values on the000, inputs 001,푛 010, 011,. 100, Hence 101, the 110, number 111 of functions mapping푛 to퐹 ∶ {0,is 1} equal→ {0, to푛 the1} number of such tables which (since 2 we can푛 choose either{0,or 1} for every row) is exactly . Note that 푛 this{0, 1} is double{0, exponential1} in , and hence even for small2 values of 2 “Astronomical” here is an understatement: there are (e.g., ) the number0 1 of functions from to2 is truly much fewer than stars, or even particles, in the observable universe.10 2 푛 푛 2 astronomical. This has the following important corollary:푛 2 푛 = 10 {0, 1} {0, 1} code as data, data as code 189

Theorem 5.3 — Counting argument lower bound. There is a constant , such that for every sufficiently large , there is a function such that SIZE . That is, the shortest 푛 NAND-CIRC훿 > 0푛 program to compute requires훿2 more푛 than 3 푛 lines.푓 ∶ {0, 1} → {0, 1} 푓 ∉ ( ) 푛 푓 훿 ⋅ 2 /푛 3 The constant is at least and in fact, can be im- proved to be arbitrarily close to , see Exercise 5.7. Proof. The proof is simple. If we let be the constant such that 훿 0.1 SIZE log and , then setting we see that 1/2 푐푠 푠 푐 푛 log | (푠)| ≤ 2 SIZE 훿 = 1/푐 푛 푠 = 훿2 /푛 (5.2) 푛 훿2 푛 푛 훿2 푐 푛 푠 푐훿2 2 using the fact that| since( 푛 )| ≤ 2, log < 2and = 2 . But since SIZE is smaller than the total푛 number of functions mapping bits to bit, there must be at푠 least < 2 one such푠 < function 푛 훿 not = in 1/푐SIZE , which |is what(푠)| we needed to prove. 푛 1 (푠) ■

We have seen before that every function mapping to can be computed by an line program. Theorem 5.3푛 shows that this is tight in the sense푛 that some functions do require{0, 1} such{0, an 1} astronomical number of푂(2 lines/푛) to compute.

 Big Idea 7 Some functions cannot be computed by a Boolean circuit using fewer than exponential푛 (in ) number of gates. 푓 ∶ {0, 1} → {0, 1} 푛

In fact, as we explore in the exercises, this is the case for most func- tions. Hence functions that can be computed in a small number of lines (such as addition, multiplication, finding short paths in graphs, or even the EVAL function) are the exception, rather than the rule.

R Remark 5.4 — More efficient representation (advanced, optional). The ASCII representation is not the shortest representation for NAND-CIRC programs. NAND- CIRC programs are equivalent to circuits with NAND gates, which means that a NAND-CIRC program of lines, inputs, and outputs can be represented by a labeled directed graph of vertices, of which푠 have in-degree푛 zero,푚 and the others have in-degree at most two. Using the adjacency푠 + 푛matrix represen- 푛 tation for such graphs, we can푠 reduce the implicit constant in Theorem 5.2 to be arbitrarily close to , see Exercise 5.6. 5 190 introduction to theoretical computer science

5.2.1 Size hierarchy theorem (optional) By Theorem 4.15 the class SIZE contains all functions from to , while by Theorem푛 5.3, there is some function 푛 푛 that is not contained(10 ⋅ 2 in/푛)SIZE . In other words,{0, for 1}푛 every{0, sufficiently 1} large , 푛 푛 푓 ∶ {0, 1} → {0, 1} (0.1 ⋅ 2 /푛) SIZE 푛 SIZE (5.3) 푛 푛 2 2 푛 푛 It turns out that we can use(0.1Theorem푛 ) ⊊ 5.3 to(10 show푛 ). a more general re- sult: whenever we increase our “budget” of gates we can compute new functions.

Theorem 5.5 — Size Hierarchy Theorem. For every sufficiently large and , 푛 푛 10푛 < 푠 < 0.1 ⋅SIZE 2 /푛 SIZE (5.4)

푛 푛 Proof Idea: (푠) ⊊ (푠 + 10푛) . To prove the theorem we need to find a function such that can be computed by a circuit of gates but푛 it cannot be computed by a circuit of gates. We will do so by푓 coming ∶ {0, 1} up→ with {0, 1} a sequence푓 of functions with푠 + the 10푛 following properties: (1) can be computed by a푠 circuit of at most gates, (2) cannot 0 1 2 푁 be computed by a circuit푓 of, 푓 , 푓 , … , 푓 gates, and (3) for every 0 푁 푓 , if can be computed by푛 a circuit of10푛 size , then 푓 can be computed by a circuit of size0.1 at ⋅ most 2 /푛 . Together these properties푖 ∈ 푖 푖+1 {0,imply … , that푁} if푓 we let be the smallest number such that푠 푓SIZE , then since SIZE it must hold푠+10푛 that SIZE which is 푖 푛 what we need to prove.푖 See Fig. 5.4 for an illustration. 푓 ∉ (푠) 푖−1 푖 푓 ∈ (푠) 푓 ∈ (푠 + 10푛)

⋆ Proof of Theorem 5.5. Let be the function (whose existence we are guaranteed∗ 푛 by Theorem 5.3) such that SIZE . We푓 ∶ define {0, 1} the→ functions {0, 1} map- ping∗ to 푛 as follows. For every , if 푛 푛 0 1 2 Figure 5.4 푓 ∉ (0.1 ⋅ 2 /푛) 푓 , 푓 , … , 푓 : We prove Theorem 5.5 by coming up 푛 is ’s order in the lexicographical order푛 then with a list of functions such that is the

{0, 1}푛 {0, 1} 푥 ∈ {0, 1} 푙푒푥(푥) ∈ all zero function, 푛 is a function (obtained from Theorem 5.30) outside2 of SIZE and0 such {0, 1, … , 2 − 1} 푥 푓 , … , 푓 푛 푓 2 (5.5) that and differ푓 by one another푛 on at most one ∗ otherwise input. We can show that for every(0.1 ⋅, 2 the/푛) number of ⎧ 푖−1 푖 푖 {푓 (푥) 푙푒푥(푥) < 푖 gates푓 to compute푓 is at most larger than the 푓 (푥) = . ⎨ number of gates to compute and푖 so if we let The function is simply the{0 constant zero function, while the 푖 ⎩ be the smallest number푓 such that10푛 SIZE , then function is equal to . Moreover, for every , the functions SIZE . 푓푖−1 푖 0 푖 and 푛differ푓 on at∗ most one input (i.e., the input푛 such 푓 ∉ (푠) 2 푓푖 ∈ (푠 + 10푛) that 푓 ). Let 푓 , and푖 let ∈ [2be] the first푛 index 푖 푖+1 푓such that푓 SIZE . Since 푛 SIZE 푥 ∈ {0, 1}there 푙푒푥(푥) = 푖 10푛 < 푠 < 0.1푛 ⋅ 2 /푛∗ 푖 푛 푖 푛 2 푓 ∉ (푠) 푓 = 푓 ∉ (0.1 ⋅ 2 /푛) code as data, data as code 191

must exist such an index , and moreover since the constant zero function is a member of SIZE . By our choice of , 푖is a member of SIZE푖 > 0 . To complete the 푛 proof, we need to show that (10푛)SIZE . Let be the string 푖−1 푛 such that 푖 푓 is the value of (푠) . Then∗ we can 푖 푛 define also as∗ follows 푓 ∈ (푠 + 10푛)∗ ∗ 푥 푙푒푥(푥 ) = 푖 푏 ∈ {0, 1} 푓 (푥 ) 푖 푓 (5.6) ∗ ⎧ 푖 {푏 푥 = 푥 푓 (푥) = ∗ or in other words ⎨ 푖 ⎩{푓 (푥) 푥 ≠ 푥 EQUAL EQUAL (5.7) ∗ ∗ 푖 푖−1 where푓EQUAL(푥) = 푓 (푥) ∧ (푥 , 푥)is the ∨ 푏function ∧ ¬ that(푥 maps, 푥) to if they are equal2푛 and to otherwise. Since (by our choice′ of ), 푛 can be∶computed {0, 1} → using {0, 1} at most gates and (as can푥, be 푥 easily∈ verified){0, 1} 1 that EQUAL SIZE ,0 we can compute using at most 푖−1 푖 푓 gates which is what푠 we wanted to prove. 푛 푖 ∈ (9푛) 푓 ■ 푠 + 9푛 + 푂(1) ≤ 푠 + 10푛

Figure 5.5: An illustration of some of what we know about the size complexity classes (not to scale!). This figure depicts classes of the form SIZE but the state of affairs for other size complexity classes such as SIZE is similar. We know by Theorem푛,푛(푠) 4.12 (with the improvement of Section 4.4.2) that all functions푛,1 mapping(푠) bits to bits can be computed by a circuit of size for , while on the

other hand the counting푛 푛 lower푛 bound (Theorem 5.3, see also Exercise 5.4푐) ⋅ shows 2 that푐 ≤some 10 such functions will require , and the size hierarchy theorem

(Theorem 5.5) shows푛 the existence of functions in SIZE SIZE0.1 ⋅ 2 whenever , see also Exercise 5.5⧵ . We also consider some specific examples: addition(푆) of two (푠) bit numbers푠 = can 표(푆) be done in lines, while we don’t know of such a program for multiplying two 푛/2bit numbers, though we do know푂(푛) it can be done in and in fact even better size.

In the above, FACTOR푛 2 corresponds to the inverse problem of multiplying-푂(푛 ) finding the prime factorization of a given number. At the푛 moment we do not know of any circuit a polynomial (or even sub-exponential) number of lines that can compute FACTOR . R Remark 5.6 — Explicit functions. While the size hierar- 푛 chy theorem guarantees that there exists some function that can be computed using, for example, gates, but not using gates, we do not know of any2 explicit example of such a function. While we suspect푛 that integer multiplication100푛 is such an example, we do not have any proof that this is the case. 192 introduction to theoretical computer science

5.3 THE TUPLES REPRESENTATION

ASCII is a fine presentation of programs, but for some applications it is useful to have a more concrete representation of NAND-CIRC programs. In this section we describe a particular choice, that will be convenient for us later on. A NAND-CIRC program is simply a sequence of lines of the form blah = NAND(baz,boo)

There is of course nothing special about the particular names we use for variables. Although they would be harder to read, we could write all our programs using only working variables such as temp_0, temp_1 etc. Therefore, our representation for NAND-CIRC programs ignores the actual names of the variables, and just associate a number with each variable. We encode a line of the program as a triple of numbers. If the line has the form foo = NAND(bar,blah) then we encode it with the triple where is the number corresponding to the variable foo and and are the numbers corresponding to bar and blah respectively. (푖, 푗, 푘) 푖 More concretely, we will푗 associate푘 every variable with a number in the set . The first numbers correspond to the input variables, the last numbers correspond[푡] to = the {0,output 1, … , 푡variables, − 1} and the푛 intermediate{0, … numbers , 푛 − 1} correspond to the remaining푚 “workspace”{푡 − 푚, variables. … , 푡 − 1} Formally, we define our representation as follows: {푛, … , 푡 − 푚 − 1} Definition 5.7 — List of tuples representation. Let be a NAND-CIRC program of inputs, outputs, and lines, and let be the num- ber of distinct variables used by . The list푃 of tuples representation of is the triple푛 푚 where is a푠 list of triples of푡 the form for . 푃 푃We assign a number(푛, 푚, 퐿) for variable퐿 of as follows: (푖, 푗, 푘) 푖, 푗, 푘 ∈ [푡] • For every , the variable X[ ] is푃 assigned the number . Y[ ] • For every 푖 ∈ [푛] , the variable푖 is assigned the number푖 . 푗 ∈ [푚] 푗 • Every푡 − 푚 + other 푗 variable is assigned a number in in the order in which the variable appears in the program . {푛, 푛+1, … , 푡−푚− 1} 푃 The list of tuples representation is our default choice for represent- ing NAND-CIRC programs. Since “list of tuples representation” is a bit of a mouthful, we will often call it simply “the representation” for a program . Sometimes, when the number of inputs and number

푃 푛 code as data, data as code 193

of outputs are known from the context, we will simply represent a program as the list instead of the triple . 푚 ■ Example 5.8 — Representing퐿 the XOR program.(푛, 푚,Our 퐿) favorite NAND- CIRC program, the program

u = NAND(X[0],X[1]) v = NAND(X[0],u) w = NAND(X[1],u) Y[0] = NAND(v,w)

computing the XOR function is represented as the tuple where . That is, the variables X[0] and X[1] are given the indices and respectively, the(2, vari- 1, 퐿) ables u퐿,v,w =are ((2, given 0, 1), the (3, 0, indices 2), (4, 1, 2), (5,respectively, 3, 4)) and the variable Y[0] is given the index . 0 1 2, 3, 4 Transforming a NAND-CIRC5 program from its representation as code to the representation as a list of tuples is a fairly straightforward programming exercise, and in particular can be done in a few lines of 4 4 If you’re curious what these few lines are, see our Python. The list-of-tuples representation loses information such as the GitHub repository. particular names we used for the variables, but this is OK since these names do not make a difference to the functionality of the program.

5.3.1 From tuples to strings If is a program of size , then the number of variables is at most (as every line touches at most three variables). Hence we can encode every푃 variable index in 푠 as a string of length푡 log , by adding3푠 leading zeroes as needed. Since this is a fixed-length encoding, it is prefix free, and so [푡]we can encode the list of ℓtriples = ⌈ (corresponding(3푠)⌉ to the encoding of the lines of the program) as simply the string of length obtained by concatenating all of퐿 these푠 encodings. We define to be푠 the length of the string representing the list corresponding3ℓ푠 to a size program. By the above we see that 푆(푠) 퐿 푠 log (5.8)

We can represent 푆(푠) = 3푠⌈as a(3푠)⌉ string . by prepending a prefix free representation of and to the list . Since (a pro- gram must touch at least푃 = once (푛, 푚, all 퐿) its input and output variables), those prefix free representations푛 푚 can be encoded퐿 using푛, 푚 ≤strings 3푠 of length log . In particular, every program of at most lines can be rep- resented by a string of length log . Similarly, every circuit of 푂(at most푠) gates can be represented by a푃 string of length푠 log (for example by translating to the푂(푠 equivalent푠) program ). 퐶 푠 푂(푠 푠) 퐶 푃 194 introduction to theoretical computer science

5.4 A NAND-CIRC INTERPRETER IN NAND-CIRC

Since we can represent programs as strings, we can also think of a program as an input to a function. In particular, for every natural number we define the function EVAL as follows: 푆(푠)+푛 푠,푛,푚 푚 푠, 푛, 푚 > 0 ∶ {0, 1} → represents a size- program with inputs and outputs {0,EVAL 1} otherwise푆(푠) ⎧푃 (푥) 푝 ∈ {0, 1} 푠 푃 푛 푚 푠,푛,푚 { 푚 (5.9) (푝푥) = ⎨ where is defined⎩{0 as in (5.8) and we use the concrete representa- tion scheme described in Section 5.1. That푆(푠) is, EVAL takes as input the concatenation of two strings: a string and a string . If is a string that 푠,푛,푚 represents a list of푆(푠) triples such that 푛 is a list-of-tuples rep- resentation푝 ∈ of {0, a size- 1} NAND-CIRC푥 program ∈ {0, 1} , then푝 EVAL is equal to the evaluation 퐿 of the program(푛, 푚, 퐿) on the input . Oth- 푠,푛,푚 erwise, EVAL 푠 equals (this case is푃 not very important:(푝푥) you can simply think of as푃 some (푥) 푚 “junk value” that푃 indicates an푥 error). 푠,푛,푚 (푝푥)푚 0 Take-away points. The fine details of EVAL ’s definition are not 0 very crucial. Rather, what you need to remember about EVAL is 푠,푛,푚 that: 푠,푛,푚 • EVAL is a finite function taking a string of fixed length asinput and outputting a string of fixed length as output. 푠,푛,푚 • EVAL is a single function, such that computing EVAL allows to evaluate arbitrary NAND-CIRC programs of a certain 푠,푛,푚 푠,푛,푚 length on arbitrary inputs of the appropriate length.

• EVAL is a function, not a program (recall the discussion in Sec- tion 3.6.2). That is, EVAL is a specification of what output is as- 푠,푛,푚 sociated with what input. The existence of a program that computes 푠,푛,푚 EVAL (i.e., an implementation for EVAL ) is a separate fact, which needs to be established (and which we will do in Theo- 푠,푛,푚 푠,푛,푚 rem 5.9, with a more efficient program shown in Theorem 5.10).

One of the first examples of self circularity we will see in this book is the following theorem, which we can think of as showing a “NAND- CIRC interpreter in NAND-CIRC”:

Theorem 5.9 — Bounded Universality of NAND-CIRC programs. For every with there is a NAND-CIRC program that computes the function EVAL . 푠,푛,푚 푠, 푛, 푚 ∈ ℕ 푠 ≥ 푚 푈 That is, the NAND-CIRC program푠,푛,푚 takes the description of any other NAND-CIRC program (of the right length and input- 푠,푛,푚 푈 푃 code as data, data as code 195

s/outputs) and any input , and computes the result of evaluating the program on the input . Given the equivalence between NAND- CIRC programs and Boolean푥 circuits, we can also think of as a circuit that푃 takes as input푥 the description of other circuits and their 푠,푛,푚 inputs, and returns their evaluation, see Fig. 5.6. We call this푈 NAND- CIRC program that computes EVAL a bounded universal program (or a universal circuit, see Fig. 5.6). “Universal” stands for 푠,푛,푚 푠,푛,푚 the fact that this푈 is a single program that can evaluate arbitrary code, where “bounded” stands for the fact that only evaluates pro- grams of bounded size. Of course this limitation is inherent for the 푠,푛,푚 NAND-CIRC programming language, since푈 a program of lines (or, equivalently, a circuit of gates) can take at most inputs. Later, in Chapter 7, we will introduce the concept of loops (and the푠 model of Turing Machines), that allow푠 to escape this limitation.2푠

Proof. Theorem 5.9 is an important result, but it is actually not hard to prove. Specifically, since EVAL is a finite function Theorem 5.9 is an immediate corollary of Theorem 4.12, which states that every finite 푠,푛,푚 function can be computed by some NAND-CIRC program. ■

P Theorem 5.9 is simple but important. Make sure you understand what this theorem means, and why it is a corollary of Theorem 4.12.

5.4.1 Efficient universal programs Theorem 5.9 establishes the existence of a NAND-CIRC program for computing EVAL , but it provides no explicit bound on the size of this program. Theorem 4.12, which we used to prove Theo- 푠,푛,푚 rem 5.9, guarantees the existence of a NAND-CIRC program whose size can be as large as exponential in the length of its input. This would mean that even for moderately small values of (for example ), computing EVAL might require a Figure 5.6:A universal circuit is a circuit that gets as NAND program with more lines than there are푠, atoms 푛, 푚 in the observ- input the description of an arbitrary (smaller) circuit 푠,푛,푚 푛 = 100, 푠 = 300, 푚 = 1 as a binary string, and an input푈 , and outputs the able universe! Fortunately, we can do much better than that. In fact, string which is the evaluation of on . We can for every there exists a NAND-CIRC program for comput- also푃 think of as a straight-line program푥 that gets as ing EVAL with size that is polynomial in its input length. This is input the푃 (푥) code of a straight-line program푃 and푥 an input , and outputs푈 . shown in푠, the 푛, following 푚 theorem. 푃 푠,푛,푚 푥 푃 (푥) Theorem 5.10 — Efficient bounded universality of NAND-CIRC programs. For every there is a NAND-CIRC program of at most log lines that computes the function EVAL 2푠, 푛, 푚 ∈ ℕ 푠,푛,푚 푂(푠 푠) ∶ 196 introduction to theoretical computer science

defined above (where is the number of bits needed푆+푛 to represent푚 programs of lines). {0, 1} → {0, 1} 푆 푠

P If you haven’t done so already, now might be a good time to review notation in Section 1.4.8. In particu- lar, an equivalent way to state Theorem 5.10 is that it says that there 푂exists some number such that for every , there exists a NAND-CIRC program of at most log lines that computes푐 > 0 the function EVAL푠, 푛, 푚. ∈2 ℕ 푃 푐푠 푠 푠,푛,푚 Unlike Theorem 5.9, Theorem 5.10 is not a trivial corollary of the fact that every finite function can be computed by some circuit. Prov- ing Theorem 5.9 requires us to present a concrete NAND-CIRC pro- gram for computing the function EVAL . We will do so in several stages. 푠,푛,푚 1. First, we will describe the algorithm to evaluate EVAL in “pseudo code”. 푠,푛,푚 2. Then, we will show how we can write a program to compute EVAL in Python. We will not use much about Python, and a reader that has familiarity with programming in any language 푠,푛,푚 should be able to follow along.

3. Finally, we will show how we can transform this Python program into a NAND-CIRC program.

This approach yields much more than just proving Theorem 5.10: we will see that it is in fact always possible to transform (loop free) code in high level languages such as Python to NAND-CIRC pro- grams (and hence to Boolean circuits as well).

5.4.2 A NAND-CIRC interpeter in “pseudocode” To prove Theorem 5.10 it suffices to give a NAND-CIRC program of log lines that can evaluate NAND-CIRC programs of lines. Let2 us start by thinking how we would evaluate such programs if we 푂(푠weren’t restricted푠) to only performing NAND operations. That푠 is, let us describe informally an algorithm that on input , a list of triples , and a string , evaluates the program represented by on the string . 푛 푛, 푚, 푠 퐿 푥 ∈ {0, 1}

(푛, 푚,P 퐿) 푥 It would be highly worthwhile for you to stop here and try to solve this problem yourself. For example, you can try thinking how you would write a program code as data, data as code 197

NANDEVAL(n,m,s,L,x) that computes this function in the programming language of your choice.

We will now describe such an algorithm. We assume that we have access to a bit that can store for every a bit . Specifically, if Table is a variable holding this data structure, then we assume we can perform the operations:푖 ∈ [푡] 푖 푇 ∈ {0, 1} • GET(Table,i) which retrieves the bit corresponding to i in Table. The value of i is assumed to be an integer in .

• Table = UPDATE(Table,i,b) which updates[푡]Table so the the bit corresponding to i is now set to b. The value of i is assumed to be an integer in and b is a bit in .

[푡] {0, 1} Algorithm 5.11 — Eval NAND-CIRC programs. Input: Numbers and , as well as a list of triples of numbers in , and a string . Output: Evaluation푛, 푚, of 푠 the program푡 ≤ 3푠 represented by푛 퐿 푠 on the [푡] 푥 ∈ {0, 1} Input: . 1: Let(푛, 푚,Vartable 퐿) 푛be table of size 2: for 푥in ∈ {0,do 1} 3: Vartable = UPDATE(Vartable,푡 , ) 4: end푖 for[푛] 푖 5: for in do 푖 푥 6: GET(Vartable, ) 7: (푖, 푗,GET(Vartable, 푘) 퐿 ) 8: Vartable푎 ← = UPDATE(Vartable,푗 ,NAND( , )) 9: end푏 for ← 푘 10: for in do 푖 푎 푏 11: GET(Vartable, ) 12: end푗 for[푚] 푗 13: return푦 ← 푡 − 푚 + 푗

0 푚−1 Algorithm 5.11푦 , …evaluates , 푦 the program given to it as input one line at a time, updating the Vartable table to contain the value of each variable. At the end of the execution it outputs the variables at posi- tions which correspond to the input variables.

5.4.3 A푡 − NAND 푚, 푡 − interpreter 푚 + 1, … in , 푡 Python− 1 To make things more concrete, let us see how we implement Algo- rithm 5.11 in the Python programming language. (There is nothing special about Python. We could have easily presented a corresponding 198 introduction to theoretical computer science

function in JavaScript, C, OCaml, or any other programming lan- guage.) We will construct a function NANDEVAL that on input will output the result of evaluating the program represented by on . To keep things simple, we will not worry about푛, the 푚,퐿, case 푥 that does not represent a valid program of inputs and outputs. (푛,The 푚, code 퐿) is presented푥 in Fig. 5.7. Accessing퐿 an element of the array Vartable푛 at a given index푚 takes a constant number of basic operations. Hence (since and 5 5 Python does not distinguish between lists and ), the program above will use basic operations. arrays, but allows constant time random access to an 푛, 푚 ≤ 푠 indexed elements to both of them. One could argue 푡5.4.4 ≤ 3푠 Constructing the NAND-CIRC interpreter푂(푠) in NAND-CIRC that if we allowed programs of truly unbounded length (e.g., larger than ) then the price would We now turn to describing the proof of Theorem 5.10. To prove the not be constant but logarithmic64 in the length of the theorem it is not enough to give a Python program. Rather, we need to array/lists, but the difference2 between and show how we compute the function EVAL using a NAND-CIRC log will not be important for our discussions. 푂(푠) program. In other words, our job is to transform, for every , the 푠,푛,푚 푂(푠 푠) Python code of Section 5.4.3 to a NAND-CIRC program that computes the function EVAL . 푠, 푛, 푚 푠,푛,푚 푈 푠,푛,푚 P Before reading further, try to think how you could give a “constructive proof” of Theorem 5.10. That is, think of how you would write, in the programming lan- guage of your choice, a function universal(s,n,m) that on input outputs the code for the NAND- CIRC program such that computes EVAL . There푠, 푛, 푚is a subtle but crucial difference 푠,푛,푚 푠,푛,푚 between this function푈 and the Python푈 NANDEVAL pro- gram described푠,푛,푚 above. Rather than actually evaluating a given program on some input , the function universal should output the code of a NAND-CIRC program that computes푃 the map 푤 .

(푃 , 푥) ↦ 푃 (푥) Our construction will follow very closely the Python implementa- tion of EVAL above. We will use variables Vartable[ ], ,Vartable[ ], where log to store our variables. However, NAND doesn’tℓ have integer-valued variables, so we cannot write code0 such… as 2 − 1Vartable[i]ℓ =for ⌈ some3푠⌉ variable i. However, we can implement the function GET(Vartable,i) that outputs the i-th bit of the array Vartable. Indeed, this is nothing but the function LOOKUP that we have seen in Theorem 4.10! ℓ

P Please make sure that you understand why GET and LOOKUP are the same function.

ℓ We saw that we can compute LOOKUP in time for our choice of . ℓ ℓ 푂(2 ) = 푂(푠) ℓ code as data, data as code 199

Figure 5.7: Code for evaluating a NAND-CIRC program given in the list-of-tuples representation def NANDEVAL(n,m,L,X): # Evaluate a NAND-CIRC program from list of tuple representation. s = len(L) # num of lines t = max(max(a,b,c) for (a,b,c) in L)+1 # max index in L + 1 Vartable = [0] * t # initialize array

# helper functions def GET(V,i): return V[i] def UPDATE(V,i,b): V[i]=b return V

# load input values to Vartable: for i in range(n): Vartable = UPDATE(Vartable,i,X[i])

# Run the program for (i,j,k) in L: a = GET(Vartable,j) b = GET(Vartable,k) c = NAND(a,b) Vartable = UPDATE(Vartable,i,c)

# Return outputs Vartable[t-m], Vartable[t-m+1],....,Vartable[t-1] return [GET(Vartable,t-m+j) for j in range(m)]

# Test on XOR (2 inputs, 1 output) L = ((2, 0, 1), (3, 0, 2), (4, 1, 2), (5, 3, 4)) print(NANDEVAL(2,1,L,(0,1))) # XOR(0,1) # [1] print(NANDEVAL(2,1,L,(1,1))) # XOR(1,1) # [0] 200 introduction to theoretical computer science

For every , let UPDATE correspond to the ℓ ℓ UPDATE function for arrays of length2 +ℓ+1. That is, on2 input , ℓ ℓ , ℓ , UPDATE∶ {0, 1} ℓ is→ equal {0, 1} to such2 ℓ that ℓ 2 ′ 푉 ∈ {0,2 1} ℓ 푖 ∈ {0, 1} 푏 ∈ {0, 1} (푉 , 푏, 푖) 푉 ∈ {0, 1} (5.10) ′ ⎧ 푗 푗 {푉 푗 ≠ 푖 푉 = where we identify the string ⎨ with a number in ⎩{푏 푗 = 1 using the binary representation. We canℓ compute UPDATE usingℓ an log line NAND-CIRC푖 ∈ {0, 1} program as as follows:{0, … , 2 − 1} ℓ ℓ 푂(21. Forℓ) every= (푠 푠) , there is an line NAND-CIRC program to compute the functionℓ EQUALS that on input outputs if푗 and ∈ [2 only] if is equal푂(ℓ) to (theℓ binary representation of) . 푗 (We leave verifying this as Exercise∶ {0, 5.2 1}and→Exercise {0, 1} 5.3.) 푖 1 푖 푗 2. We have seen that we can compute the function IF such that IF equals if and if . 3 ∶ {0, 1} → {0, 1} UPDATE Together, this(푎, means 푏, 푐) that we푏 can푎 = compute 1 푐 푎 = 0 (using some “syntactic sugar” for bounded length loops) as follows: def UPDATE_ell(V,i,b): # Get V[0]...V[2^ell-1], i in {0,1}^ell, b in {0,1} # Return NewV[0],...,NewV[2^ell-1] # updated array with NewV[i]=b and all # else same as V for j in range(2**ell): # j = 0,1,2,....,2^ell -1 a = EQUALS_j(i) NewV[j] = IF(a,b,V[j]) return NewV

Since the loop over j in UPDATE is run times, and computing EQUALS_j takes lines, the total numberℓ of lines to compute UP- DATE is log . Once we can2 compute GET and UPDATE, the rest of theℓ implementation푂(ℓ) amounts to “book keeping” that needs to be done푂(2 carefully,⋅ ℓ) = 푂(푠 but is푠) not too insightful, and hence we omit the full details. Since we run GET and UPDATE times, the total number of lines for computing EVAL is log log . This completes (up to the omitted details)2 푠 the2 proof of Theorem2 5.10. 푠,푛,푚 푂(푠 ) + 푂(푠 푠) = 푂(푠 푠)

R Remark 5.12 — Improving to quasilinear overhead (ad- vanced optional note). The NAND-CIRC program above is less efficient than its Python counterpart, since NAND does not offer arrays with efficient ran- dom access. Hence for example the LOOKUP operation code as data, data as code 201

on an array of bits takes lines in NAND even though it takes steps (or maybe log steps, depending on푠 how we count)Ω(푠) in Python. It turns out that푂(1) it is possible to improve푂( the푠) bound of Theorem 5.10, and evaluate line NAND-CIRC programs using a NAND-CIRC program of log lines. The key is to consider the푠 description of NAND- CIRC programs as circuits, and in particular푂(푠 as di- 푠) rected acyclic graphs (DAGs) of bounded in-degree. A universal NAND-CIRC program for line pro- grams will correspond to a universal graph for such 푠 vertex DAGs. We can think of such푈 a graph푠 as 푠 fixed “wiring” for a communication퐻 network, that 푠 should푠 be able to accommodate any arbitrary푈 pattern of communication between vertices (where this pat- tern corresponds to an line NAND-CIRC program). It turns out that such efficient푠 routing networks exist that allow embedding any푠 vertex circuit inside a uni- versal graph of size log , see the bibliographical notes Section 5.9 for more on푠 this issue. 푂(푠 푠)

5.5 A PYTHON INTERPRETER IN NAND-CIRC (DISCUSSION)

To prove Theorem 5.10 we essentially translated every line of the Python program for EVAL into an equivalent NAND-CIRC snip- pet. However none of our reasoning was specific to the particu- lar function EVAL. It is possible to translate every Python program into an equivalent NAND-CIRC program of comparable efficiency. (More concretely, if the Python program takes operations on inputs of length at most then there exists NAND-CIRC program of log lines that agrees with the Python푇 (푛) program on inputs of length .) Actually doing푛 so requires taking care of many details 푂(푇and (푛)is beyond푇 (푛)) the of this book, but let me try to convince you why you should푛 believe it is possible in principle. For starters, one can use CPython (the reference implementation for Python), to evaluate every Python program using a C program. We can combine this with a C compiler to transform a Python program to various flavors of “machine language”. So, to transform a Python program into an equivalent NAND-CIRC program, it is enough to show how to transform a machine language program into an equiva- lent NAND-CIRC program. One minimalistic (and hence convenient) family of machine languages is known as the ARM architecture which powers many mobile devices including essentially all Android de- 6 ARM stands for “Advanced RISC Machine” where vices.6 There are even simpler machine languages, such as the LEG RISC in turn stands for “Reduced instruction set computer”. architecture for which a backend for the LLVM compiler was imple- mented (and hence can be the target of compiling any of the large and growing list of languages that this compiler supports). Other ex- 202 introduction to theoretical computer science

amples include the TinyRAM architecture (motivated by interactive proof systems that we will discuss in Chapter 22) and the teaching- oriented Ridiculously Simple . Going one by one over the instruction sets of such computers and translating them to NAND snippets is no fun, but it is a feasible thing to do. In fact, ultimately this is very similar to the transformation that takes place in converting our high level code to actual silicon gates that are not so different from the operations of a NAND-CIRC program. Indeed, tools such as MyHDL that transform “Python to Silicon” can be used to a Python program to a NAND-CIRC program. The NAND-CIRC programming language is just a teaching tool, and by no means do I suggest that writing NAND-CIRC programs, or compilers to NAND-CIRC, is a practical, useful, or enjoyable activity. What I do want is to make sure you understand why it can be done, and to have the confidence that if your life (or at least your grade) depended on it, then you would be able to do this. Understanding how programs in high level languages such as Python are eventually transformed into concrete low-level representation such as NAND is fundamental to computer science. The astute reader might notice that the above paragraphs only outlined why it should be possible to find for every particular Python- , a particular comparably efficient NAND-CIRC program that computes . But this still seems to fall short of our goal of writing a “Python푓 interpreter in NAND” which would mean that for every푃 parameter ,푓 we come up with a single NAND-CIRC program UNIV such that given a description of a Python program , a particular input , and푛 a bound on the number of operations 푠 (where the lengths of and and the value of are all at most ) 푃returns the result of executing푥 on 푇for at most steps. After all, the transformation above푃 takes푥 every Python program푇 into a different푠 NAND-CIRC program, and so푃 does푥 not yield “one푇 NAND-CIRC pro- gram to rule them all” that can evaluate every Python program up to some given complexity. However, we can in fact obtain one NAND- CIRC program to evaluate arbitrary Python programs. The reason is that there exists a Python interpreter in Python: a Python program that takes a bit string, interprets it as Python code, and then runs that code. Hence, we only need to show a NAND-CIRC program that푈 computes the same function as the particular Python program ∗ , and this will give us a way to evaluate all Python programs. 푈 What we are seeing time and again is the notion of universality푈 or self reference of computation, which is the sense that all reasonably rich models of computation are expressive enough that they can “simulate themselves”. The importance of this phenomenon to both the theory and practice of computing, as well as far beyond it, including the code as data, data as code 203

foundations of mathematics and basic questions in science, cannot be overstated.

5.6 THE PHYSICAL EXTENDED CHURCH-TURING THESIS (DISCUS- SION)

We’ve seen that NAND gates (and other Boolean operations) can be implemented using very different systems in the physical world. What about the reverse direction? Can NAND-CIRC programs simulate any physical computer? We can take a leap of faith and stipulate that Boolean circuits (or equivalently NAND-CIRC programs) do actually encapsulate every computation that we can think of. Such a statement (in the realm of infinite functions, which we’ll encounter in Chapter 7) is typically attributed to and Alan Turing, and in that context is known as the Church Turing Thesis. As we will discuss in future lectures, the Church-Turing Thesis is not a mathematical theorem or conjecture. Rather, like theories in physics, the Church-Turing Thesis is about mathematically modeling the real world. In the context of finite functions, we can make the following informal hypothesis or prediction:

“Physical Extended Church-Turing Thesis” (PECTT): If a function can be computed in the physical world using amount of “physical푛 resources”푚 then it can be computed by a Boolean circuit program of roughly퐹 ∶ {0, 1}gates.→ {0, 1} 푠

A priori푠 it might seem rather extreme to hypothesize that our mea- ger model of NAND-CIRC programs or Boolean circuits captures all possible physical computation. But yet, in more than a century of computing technologies, no one has yet built any scalable computing device that challenges this hypothesis. We now discuss the “fine print” of the PECTT in more detail, as well as the (so far unsuccessful) challenges that have been raised against it. There is no single universally-agreed-upon formalization of “roughly physical resources”, but we can approximate this notion by considering the size of any physical computing device and the time it takes푠 to compute the output, and ask that any such device can be simulated by a Boolean circuit with a number of gates that is a polynomial (with not too large exponent) in the size of the system and the time it takes it to operate. In other words, we can phrase the PECTT as stipulating that any function that can be computed by a device that takes a certain volume of space and requires time to complete the computation, must be computable by a Boolean circuit with a number of gates that is polynomial푉 in and . 푡 푝(푉 , 푡) 푉 푡 204 introduction to theoretical computer science

The exact form of the function is not universally agreed upon but it is generally accepted that if is an exponentially hard function, in the sense푝(푉 , 푡) that it has푛 no NAND-CIRC program of fewer than, say, lines, then푓 ∶ a {0, demonstration 1} → {0, 1} of a phys- ical device that can compute in푛/2 the real world for moderate input lengths (e.g., ) would2 be a violation of the PECTT. 푓

R 푛 = 500 Remark 5.13 — Advanced note: making PECTT concrete (advanced, optional). We can attempt a more exact phrasing of the PECTT as follows. Suppose that is a physical system that accepts binary stimuli and has a binary output, and can be enclosed in a sphere푍 of volume . We say that the system푛 computes a function within seconds if when- ever we set푉 the stimuli푛 to some value 푍 , if we measure푓 ∶ the{0, 1} output→ {0, after 1} seconds푡 then we obtain푛 . 푥 ∈ {0, 1} One can then phrase the PECTT푡 as stipulating that if 푓(푥)there exists such a system that computes within seconds, then there exists a NAND-CIRC program that computes and has at푍 most lines,퐹 where 푡 is some normalization constant. (We2 can also con- sider variants where퐹 we use surface훼(푉 area 푡) instead of훼 volume, or take to a different power than . However, none of these choices makes a qualitative difference to the (푉discussion 푡) below.) In particular,2 suppose that is a function that requires 푛 lines for any NAND-CIRC program (such푛 푓 a ∶ function {0, 1}0.8푛 exists→ {0, by 1}Theorem 5.3). Then the2 PECTT/(100푛) would > 2 imply that either the volume or the time of a system that computes will have to be at least . Since this quantity grows expo- nentially in 0.2푛, it is not hard to set parameters퐹 so that √ even for moderately2 / 훼 large values of , such a system could not fit푛 in our universe. To fully make the PECTT concrete, we푛 need to decide on the units for measuring time and volume, and the normalization constant . One conservative choice is to assume that we could squeeze computation to the absolute physical limits훼 (which are many orders of magnitude beyond current technology). This corre- sponds to setting and using the Planck units for volume and time. The Planck length (which is, roughly speaking,훼 the = shortest 1 distance that can the- 푃 oretically be measured) is roughly ℓ meters. The Planck time (which is the time it takes−120 for light to travel one Planck length) is about 2 seconds. In the 푃 above setting,푡 if a function takes,−150 say, 1KB of input (e.g., roughly bits, which can encode2 a by bitmap image), and4 requires퐹 at least 4 NAND lines to10 compute, then any physical0.8푛 100 system0.8⋅10100 that computes it would require either2 volume= of 2 code as data, data as code 205

Planck length cubed, which is more than 4 meters0.2⋅10 cubed or take at least Planck Time units,1500 4 which2 is larger than seconds.0.2⋅10 To get a sense2 of how big that number1500 is, note2 that the universe is only about seconds old,2 and its observable radius is only roughly60 meters. The above discussion sug- gests that2 it is possible90 to empirically falsify the PECTT by presenting2 a smaller-than-universe-size system that computes such a function. There are of course several hurdles to refuting the PECTT in this way, one of which is that we can’t actu- ally test the system on all possible inputs. However, it turns out that we can get around this issue using notions such as interactive proofs and program checking that we might encounter later in this book. Another, perhaps more salient problem, is that while we know many hard functions exist, at the moment there is no single explicit function for which we can prove an (let alone 푛 ) lower bound on the number of lines퐹 that ∶ {0, a NAND-CIRC 1} 푛→ {0, 1} program needs to compute휔(푛) it. Ω(2 /푛)

5.6.1 Attempts at refuting the PECTT One of the admirable traits of mankind is the refusal to accept limita- tions. In the best case this is manifested by people achieving long- standing “impossible” challenges such as heavier-than-air flight, putting a person on the moon, circumnavigating the globe, or even resolving Fermat’s Last Theorem. In the worst case it is manifested by people continually following the footsteps of previous failures to try to do proven-impossible tasks such as build a perpetual motion machine, trisect an angle with a compass and straightedge, or refute Bell’s in- equality. The Physical Extended Church Turing thesis (in its various forms) has attracted both types of people. Here are some physical devices that have been speculated to achieve computational tasks that cannot be done by not-too-large NAND-CIRC programs:

• Spaghetti sort: One of the first lower bounds that Computer Sci- ence students encounter is that sorting numbers requires making log comparisons. The “spaghetti sort” is a description of a proposed “mechanical computer” that would푛 do this faster. The Ω(푛idea is that푛) to sort numbers , we could cut spaghetti noodles into lengths , and then if we simply hold them 1 푛 together in our hand푛 and bring푥 them, … , 푥 down to a flat 푛surface, they 1 푛 will emerge in sorted푥 order., … , 푥 There are a great many reasons why this is not truly a challenge to the PECTT hypothesis, and I will not ruin the reader’s fun in finding them out by her or himself. 206 introduction to theoretical computer science

• Soap bubbles: One function that is conjectured to require a large number of NAND lines푛 to solve is the Euclidean Steiner Tree problem. This is the퐹 ∶ problem {0, 1} → where {0, 1} one is given points in the plane (say with integer coordi- nates ranging from till , and hence the list can be represented푚 1 1 푚 푚 as a string of (푥 ,log 푦 ), … ,size) (푥 , and 푦 ) some number . The goal is to figure out whether1 푚 it is possible to connect all the pointsby line segments푛 of = total 푂(푚 length푚) at most . This function is퐾 conjec- tured to be hard because it is NP complete - a concept that we’ll en- counter later in this course - and it is퐾 in fact reasonable to conjecture that as grows, the number of NAND lines required to compute this function grows exponentially in , meaning that the PECTT would predict푚 that if is sufficiently large (such as few hundreds or so) then no physical device could푚 compute . Yet, some people claimed that there is in푚 fact a very simple physical device that could solve this problem, that can be constructed using퐹 some wooden pegs and soap. The idea is that if we take two glass plates, and put wooden pegs between them in the locations then bubbles will form whose edges touch those pegs in a way that 1 1 푚 푚 푚will minimize the total energy which turns out(푥 to be, 푦 a), function … , (푥 , 푦of ) the total length of the line segments. The problem with this device is that nature, just like people, often gets stuck in “local optima”. That is, the resulting configuration will not be one that achieves the absolute minimum of the total energy but rather one that can’t be improved with local changes. Aaronson has carried out actual experiments (see Fig. 5.8), and saw that while this device often is successful for three or four pegs, it starts yielding suboptimal results once the number of pegs grows beyond that.

• DNA computing. People have suggested using the properties of DNA to do hard computational problems. The main advantage of DNA is the ability to potentially encode a lot of information in a relatively small physical space, as well as compute on this infor- mation in a highly parallel manner. At the time of this writing, it was demonstrated that one can use DNA to store about bits of information in a region of radius about a millimeter, as opposed16 Figure 5.8 10 : Scott Aaronson tests a candidate device for to about bits with the best known hard disk technology. This computing Steiner trees using soap bubbles. does not posit10 a real challenge to the PECTT but does suggest that one should10 be conservative about the choice of constant and not as-

sume that current hard disk + silicon technologies are the absolute 7 We were extremely conservative in the suggested 7 best possible. parameters for the PECTT, having assumed that as many as bits could potentially be • Continuous/real computers. The physical world is often described stored in a−2 millimeter−6 radius61 region. ℓ푃 10 ∼ 10 using continuous quantities such as time and space, and people code as data, data as code 207

have suggested that analog devices might have direct access to computing with real-valued quantities and would be inherently more powerful than discrete models such as NAND machines. Whether the “true” physical world is continuous or discrete is an open question. In fact, we do not even know how to precisely phrase this question, let alone answer it. Yet, regardless of the answer, it seems clear that the effort to measure a continuous quantity grows with the level of accuracy desired, and so there is no “free lunch” or way to bypass the PECTT using such machines (see also this paper). Related to that are proposals known as “hypercomputing” or “Zeno’s computers” which attempt to use the continuity of time by doing the first operation in one second, the second one inhalfa second, the third operation in a quarter second and so on.. These fail for a similar reason to the one guaranteeing that Achilles will eventually catch the tortoise despite the original Zeno’s paradox.

• Relativity computer and time travel. The formulation above as- sumed the notion of time, but under the theory of relativity time is in the eye of the observer. One approach to solve hard problems is to leave the computer to run for a lot of time from his perspective, but to ensure that this is actually a short while from our perspective. One approach to do so is for the user to start the computer and then go for a quick jog at close to the before checking on its status. Depending on how fast one goes, few seconds from the point of view of the user might correspond to centuries in com- puter time (it might even finish updating its Windows !). Of course the catch here is that the energy required from the user is proportional to how close one needs to get to the speed of light. A more interesting proposal is to use time travel via closed timelike curves (CTCs). In this case we could run an arbitrarily long computation by doing some calculations, remembering the current state, and then travelling back in time to continue where we left off. Indeed, if CTCs exist then we’d probably have to revise the PECTT (though in this case I will simply travel back in time and edit these notes, so I can claim I never conjectured it in the first place…)

• Humans. Another computing system that has been proposed as a counterexample to the PECTT is a 3 pound computer of about 0.1m radius, namely the human brain. Humans can walk around, talk, feel, and do other things that are not commonly done by NAND-CIRC programs, but can they compute partial functions that NAND-CIRC programs cannot? There are certainly compu- tational tasks that at the moment humans do better than computers (e.g., play some video games, at the moment), but based on our current understanding of the brain, humans (or other animals) 208 introduction to theoretical computer science

have no inherent computational advantage over computers. The brain has about neurons, each operating at a speed of about operations per11 seconds. Hence a rough first approximation is 10 that a Boolean circuit of about gates could simulate one second 8 8 This is a very rough approximation that could 1000of a brain’s activity. Note that the14 fact that such a circuit (likely) be wrong to a few orders of magnitude in either exists does not mean it is easy to10find it. After all, constructing this direction. For one, there are other structures in the circuit took evolution billions of years. Much of the recent efforts brain apart from neurons that one might need to simulate, hence requiring higher overhead. On the in artificial intelligence research is focused on finding programs other hand, it is by no mean clear that we need to that replicate some of the brain’s capabilities and they take massive fully the brain in order to achieve the same computational tasks that it does. computational effort to discover, these programs often turn out to be much smaller than the pessimistic estimates above. For example, at the time of this writing, Google’s neural network for machine translation has about nodes (and can be simulated by a NAND- CIRC program of comparable4 size). Philosophers, priests and many others have since time10 immemorial argued that there is something about humans that cannot be captured by mechanical devices such as computers; whether or not that is the case, the evidence is thin

that humans can perform computational tasks that are inherently 9 9 There are some well known scientists that have impossible to achieve by computers of similar complexity. advocated that humans have inherent computational advantages over computers. See also this. • Quantum computation. The most compelling attack on the Physi- cal Extended Church Turing Thesis comes from the notion of quan- tum computing. The idea was initiated by the observation that sys- tems with strong quantum effects are very hard to simulate ona computer. Turning this observation on its head, people have pro- posed using such systems to perform computations that we do not know how to do otherwise. At the time of this writing, scalable quantum computers have not yet been built, but it is a fascinating possibility, and one that does not seem to contradict any known law of nature. We will discuss quantum computing in much more detail in Chapter 23. Modeling quantum computation involves extending the model of Boolean circuits into Quantum circuits that have one more (very special) gate. However, the main takeaway is that while quantum computing does suggest we need to amend the PECTT, it does not require a complete revision of our worldview. Indeed, almost all of the content of this book remains the same regardless of whether the underlying computational model is Boolean circuits or quantum circuits.

R Remark 5.14 — Physical Extended Church-Turing Thesis and Cryptography. While even the precise phrasing of the PECTT, let alone understanding its correctness, is still a subject of active research, some variants of it are code as data, data as code 209

already implicitly assumed in practice. Governments, companies, and individuals currently rely on cryptog- raphy to protect some of their most precious assets, including state secrets, control of weapon systems and critical infrastructure, securing commerce, and protecting the confidentiality of personal information. In applied cryptography, one often encounters state- ments such as “cryptosystem provides 128 bits of security”. What such a statement really means is that (a) it is conjectured that there푋 is no Boolean circuit (or, equivalently, a NAND-CIRC program) of size much smaller than that can break , and (b) we assume that no other128 physical mechanism can do bet- ter, and hence it would2 take roughly a 푋 amount of “resources” to break . We say “conjectured”128 and not “proved” because, while we can phrase2 the statement that breaking the system푋 cannot be done by an -gate circuit as a precise mathematical conjecture, at the moment we are unable to prove such a statement푠 for any non-trivial cryptosystem. This is related to the P vs NP question we will discuss in future chapters. We will explore Cryptography in Chapter 21.

Chapter Recap

• We can think of programs both as describing a pro- ✓ cess, as well as simply a list of symbols that can be considered as data that can be fed as input to other programs. • We can write a NAND-CIRC program that evalu- ates arbitrary NAND-CIRC programs (or equiv- alently a circuit that evaluates other circuits). Moreover, the efficiency loss in doing so is nottoo large. • We can even write a NAND-CIRC program that evaluates programs in other programming lan- guages such as Python, C, Lisp, , Go, etc. • By a leap of faith, we could hypothesize that the number of gates in the smallest circuit that com- putes a function captures roughly the amount of physical resources required to compute . This statement is known푓 as the Physical Extended Church- Turing Thesis (PECTT). 푓 • Boolean circuits (or equivalently AON-CIRC or NAND-CIRC programs) capture a surprisingly wide array of computational models. The strongest currently known challenge to the PECTT comes from the potential for using quantum mechanical effects to speed-up computation, a model known as quantum computers. 210 introduction to theoretical computer science

Figure 5.9: A finite computational task is specified by a function . We can model

a computational process푛 using Boolean푚 circuits (of varying gate푓 sets)∶ {0, 1}or straight-line→ {0, 1} program. Every function can be computed by many programs. We say that SIZE if there exists a NAND circuit of at most gates (equivalently a NAND-CIRC program푓 of ∈ at most푛,푚lines)(푠) that computes . Every function 푠 can be computed by a circuit of 푛푠 gates.푚 Many functions푓 such as multiplication,푓 ∶ {0, 1} addition,푛→ {0, 1}solving linear equations, computing푂(푚 the shortest ⋅ 2 /푛) path in a graph, and others, can be computed by circuits of much fewer gates. In particular there is an log -size circuit

that computes the map 2 where is a string describing a circuit푂(푠 of gates.푠) However, the counting argument shows퐶, 푥 ↦ there 퐶(푥) do exist some퐶 functions 푠 that require gates to푛 compute. 푚 푛 푓 ∶ {0, 1} → {0, 1} Ω(푚 ⋅ 2 /푛) 5.7 RECAP OF PART I: FINITE COMPUTATION

This chapter concludes the first part of this book that deals with finite computation (computing functions that map a fixed number of Boolean inputs to a fixed number of Boolean outputs). The main take-aways from Chapter 3, Chapter 4, and Chapter 5 are as follows (see also Fig. 5.9):

• We can formally define the notion of a function being computable using basic operations. Whether푛 these operations푚 are AND/OR/NOT, NAND, or some푓 other ∶ {0, universal 1} → {0,basis 1} does not make much difference.푠 We can describe such a com- putation either using a circuit or using a straight-line program.

• We define SIZE to be the set of functions that are computable by NAND circuits of at most gates. This set is equal to the set of 푛,푚 functions computable(푠) by a NAND-CIRC program of at most lines and up to a constant factor in푠 (which we will not care about); this is also the same as the set of functions that are computable푠 by a Boolean circuit of at most푠 AND/OR/NOT gates. The class SIZE is a set of functions, not of programs/circuits. 푠 • Every푛,푚function can be computed using a (푠) circuit of at most 푛 gates. Some푚 functions require at least gates.푓 ∶ We{0, 1} define푛 →SIZE {0, 1} to be the set of functions from 푛 to 푂(푚 ⋅ 2that/푛) can be computed using at most gates. 푛,푚 Ω(푚 ⋅ 2 /푛)푛 푚 (푠) • We can describe a circuit/program as a string. For every , there {0, 1} {0, 1} 푠 is a universal circuit/program that can evaluate programs of length given their description as strings.푃 We can use this repre-푠 푠 sentation also to count the number푈 of circuits of at most gates and 푠 푠 code as data, data as code 211

hence prove that some functions cannot be computed by circuits of smaller-than-exponential size.

• If there is a circuit of gates that computes a function , then we can build a physical device to compute using basic components (such as transistors).푠 The “Physical Extended Church-Turing푓 The- sis” postulates that the reverse direction푓 is true푠 as well: if is a function for which every circuit requires at least gates then that every physical device to compute will require about “physical푓 resources”. The main challenge to the PECTT is푠quantum computing, which we will discuss in Chapter푓 23. 푠

Sneak preview: In the next part we will discuss how to model compu- tational tasks on unbounded inputs, which are specified using functions (or ) that can take an unbounded∗ number of∗ Boolean inputs.∗ 퐹 ∶ {0, 1} → {0, 1} 퐹 ∶ {0, 1} → {0, 1} 5.8 EXERCISES

Exercise 5.1 Which one of the following statements is false: a. There is an line NAND-CIRC program that given as input program of 3lines in the list-of-tuples representation computes the output of푂(푠 when) all its input are equal to . 푃 푠 b. There is an 푃 line NAND-CIRC program that1 given as input program of 3characters encoded as a string of bits using the ASCII encoding,푂(푠 ) computes the output of when all its input are equal to 푃. 푠 7푠 푃 c. There is1 an line NAND-CIRC program that given as input program of lines in the list-of-tuples representation computes √ the output of푂( when푠) all its input are equal to . 푃 푠 푃 1 ■ Exercise 5.2 — Equals function. For every , show that there is an line NAND-CIRC program that computes the function EQUALS where EQUALS 푘 ∈ ℕ if and only if . 푂(푘) 푘 2푘 ′ ′ ∶ ■ {0, 1} → {0, 1} (푥, 푥 ) = 1 푥 = 푥 Exercise 5.3 — Equal to constant function. For every and , show that there is an line NAND-CIRC program that computes′ 푘 the function EQUALS that푘 on ∈ input ℕ 푥 ∈ {0, 1} outputs if and only푂(푘) if′ . 푘 푘 푥 ∶ {0,′ 1} → {0, 1} 푥 ∈ {0, 1} ■ 1 푥 = 푥 Exercise 5.4 — Counting lower bound for multibit functions. Prove that there exists a number such that for every there exists a function

훿 > 0 푛, 푚 212 introduction to theoretical computer science

that requires at least NAND gates to 10 10 How many functions from to exist? compute.푛 See footnote푚 for hint. 푛 푛 푚 푓 ∶ {0, 1} → {0, 1} 훿푚 ⋅ 2 /푛 ■ {0, 1} {0, 1}

Exercise 5.5 — Size hierarchy theorem for multibit functions. Prove that there exists a number such that for every and there exists a function SIZE SIZE . See footnote푛 for 11 Follow the proof of Theorem 5.5, replacing the use hint.11 퐶 푛, 푚⧵ 푛+푚 < 푠 < 푚⋅2 /(퐶푛) of the counting argument with Exercise 5.4. 푛,푚 푛,푚 푓 ∈ (퐶 ⋅ 푠) (푠) ■

Exercise 5.6 — Efficient representation of circuits and a tighter counting upper bound. Use the ideas of Remark 5.4 to show that for every and sufficiently large , 휖 > 0 log log log SIZE푠, 푛, 푚 (5.11) (2+휖)푠 푠+푛 푛+푚 푠 Conclude that the implicit푛,푚 constant in Theorem 5.2 can be made arbi- | (푠)| < 2 . 12 12 Using the adjacency list representation, a graph trarily close to . See footnote for hint. with in-degree zero vertices and in-degree two ■ vertices can be represented using roughly log 5 푛 log bits. The labeling푠 of the input Exercise 5.7 — Tighter counting lower bound. Prove that for every , if and output vertices can be specified by2푠 a list(푠 of + is sufficiently large then there exists a function labels푛) ≤ 2푠( in and푠 + 푂(1))labels in . 푛 13 13 Hint:푚 Use the results of Exercise 5.6 and the fact that푛 such that SIZE . See footnote for hint. 푛 훿 < 1/2 푛 in this regime[푛] 푚 and [푚]. 푛 훿2 푓 ∶ {0, 1} → {0, 1} ■ 푛,1 푛 푓 ∉ ( ) 푚 = 1 푛 ≪ 푠 Exercise 5.8 — Random functions are hard. Suppose and that we choose a function at random, choosing for every the value to푛 be the result of tossing푛 > 1000 an independent unbiased coin.푛 Prove퐹 ∶ that {0, 1} the→ probability {0, 1} that there is a 14 푥 ∈ {0, 1} 퐹 (푥) 14 Hint: An equivalent way to say this is that you line program that computes is at most . 푛 need to prove that the set of functions that can be −100 2 /(1000푛) ■ computed using at most lines has fewer 퐹 2 than elements. Can푛 you see why? Exercise 5.9 푛 The following is a tuple representing a NAND program: −100 2 2 /(1000푛) 2 . 2

(3,1. 1,Write ((3, 2,a table 2), (4, with 1, 1), the (5,eight 3, 4), (6, values 2, 1), (7, 6, 6),, (8, 0, 0),, (9, 7, 8),, (10, 5,, 0), (11, 9, 10))) , , , in this order. 푃 (000) 푃 (001) 푃 (010) 푃 (011) 2.푃 Describe (100) 푃 what (101) the푃 (110) programs푃 (111) does in words.

Exercise 5.10 — EVAL with XOR. For every sufficiently large , let be the function that takes an -length string that 2 푛 encodes푛 a pair where and is2 a NAND푛 program퐸 ∶ {0,of 1}inputs,→ {0,a single 1} output, and at most푛 lines,푛 and returns the 15 15 Note that if is big enough, then it is easy to output of on(푃. , 푥)That is, 푥 ∈ {0, 1} 1.1 .푃 represent such a pair using bits, since we can 푛Prove that for every sufficiently large 푛, there does not exist an XOR represent the program푛 using log bits, and 푛 2 circuit that푃 computes푥 the function퐸 (푃 , 푥) = 푃, where (푥) a XOR circuit has the we can always pad our representation푛 1.1 to have exactly length. 푂(푛 푛) XOR gate as well as the constants and 푛(see Exercise 3.5). That is, 푛 2 퐶 퐸 푛 0 1 code as data, data as code 213

prove that there is some constant such that for every and XOR circuit of inputs and a single output, there exists a pair 0 0 such that 2 푛 . 푛 > 푛 퐶 푛 ■ 푛 (푃 , 푥) 퐶(푃 , 푥) ≠ 퐸 (푃 , 푥) Exercise 5.11 — Learning circuits (challenge, optional, assumes more background). (This exercise assumes background in probability theory and/or machine learning that you might not have at this point. Feel free to come back to it at a later point and in particular after going over Chapter 18.) In this exercise we will use our bound on the number of circuits of size to show that (if we ignore the cost of computation) every such circuit can be learned from not too many training samples. Specifically, if푠 we find a size- circuit that classifies correctly a training set of log samples from some distribution , then it is guaran- teed to do well on the whole푠 distribution . Since Boolean circuits model푂(푠 very many푠) physical processes (maybe even퐷 all of them, if the (controversial) physical extended Church-Turing퐷 thesis is true), this shows that all such processes could be learned as well (again, ignor- ing the computation cost of finding a classifier that does well on the training data). Let be any probability distribution over and let be a NAND circuit with inputs, one output, and size 푛 . Prove that there is퐷 some constant such that with probability{0, 1} at least 퐶 the following holds: if 푛 log and are푠 ≥ chosen 푛 indepen- dently from , then for푐 every circuit such that 0.999 on 0 푚−1 every , Pr 푚 = 푐푠 푠 푥 ′, …. , 푥 ′ 푖 푖 In other words,퐷 if ′ is a so called “empirical퐶 risk퐶 minimizer”(푥 ) = 퐶(푥 that) 푥∼퐷 agrees푖 with ∈ [푚] on all[퐶 the′ (푥) training ≤ 퐶(푥)] examples ≤ 0.99 , then it will also agree with with퐶 high probability for samples drawn from the 0 푛−1 퐶 푥 , … , 푥 distribution (i.e., it “generalizes”, to use Machine-Learning lingo). 16 Hint: Use our bound on the number of program- 16 See footnote for퐶 hint. s/circuits of size (Theorem 5.2), as well as the 퐷 ■ Chernoff Bound ( Theorem 18.12) and the union bound. 푠

5.9 BIBLIOGRAPHICAL NOTES

The EVAL function is usually known as a universal circuit. The imple- mentation we describe is not the most efficient known. Valiant [Val76] first showed a universal circuit of log size where is the size of the input. Universal circuits have seen in recent years new motivations due to their applications for cryptography,푂(푛 푛) see [LMS16; 푛GKS17]. While we’ve seen that “most” functions mapping bits to one bit require circuits of exponential size , we actually do not know of any explicit function for which we can푛 prove that it푛 requires, say, at least or even size. At theΩ(2 moment,/푛) the strongest such lower bound100 we know is that there are quite simple and explicit -variable 푛 100푛 푛 214 introduction to theoretical computer science

functions that require at least lines to compute, see this paper of Iwama et al as well as this more recent work of Kulikov et al. Proving lower bounds for restricted(5 − 표(1))푛 models of circuits is an extremely interesting research area, for which Jukna’s book [Juk12] (see also Wegener [Weg87]) provides a very good introduction and overview. I learned of the proof of the size hierarchy theorem (Theorem 5.5) from Sasha Golovnev. Scott Aaronson’s blog post on how information is physical is a good discussion on issues related to the physical extended Church-Turing Physics. Aaronson’s survey on NP complete problems and physical reality [Aar05] discusses these issues as well, though it might be easier to read after we reach Chapter 15 on NP and NP-completeness. II UNIFORM COMPUTATION

Learning Objectives: • Define functions on unbounded length inputs, that cannot be described by a finite size table of inputs and outputs. • Equivalence with the task of deciding membership in a language. • Deterministic finite automatons (optional): A simple example for a model for unbounded computation 6 • Equivalence with regular expressions. Functions with Infinite domains, Automata, and Regular ex- pressions

“An algorithm is a finite answer to an infinite number of questions.”, At- tributed to Stephen Kleene.

The model of Boolean circuits (or equivalently, the NAND-CIRC programming language) has one very significant drawback: a Boolean circuit can only compute a finite function , and in particular since every gate has two inputs, a size circuit can compute on an input of length at most . This does not capture푓 our intuitive notion of an algorithm as a single recipe to compute푠 a potentially infinite function. For example, the standard2푠 elementary school multiplication algorithm is a single algorithm that multiplies numbers of all lengths, but yet we cannot express this algorithm as a single circuit, but rather need a different circuit (or equivalently, a NAND-CIRC program) for every input length (see Fig. 6.1). In this chapter we extend our definition of computational tasks to consider functions with the unbounded domain of . We focus on the question of defining what tasks to compute, mostly∗ leaving the question of how to do so to later chapters, where{0, 1}we will see Turing Machines and other computational models for computing on unbounded inputs. In this chapter we will see however one example for a simple and restricted model of computation - deterministic finite automata (DFAs). Figure 6.1: Once you know how to multiply multi- digit numbers, you can do so for every number This chapter: A non-mathy overview of digits, but if you had to describe multiplication using Boolean circuits or NAND-CIRC programs,푛 In this chapter we discuss functions that as input strings of you would need a different program/circuit for every arbitary length. Such functions could have outputs that are length of the input. long strings as well, but the case of Boolean functions, where 푛 the output is a single bit, is of particular importance. The task of computing a Boolean function is equivalent to the task

Compiled on 8.26.2020 18:10 218 introduction to theoretical computer science

of deciding a language. We will also define the restriction of a function over unbounded length strings to a finite length, and how this restriction can be computed by Boolean circuits. In the second half of this chapter we discuss finite automata, which is a model for computing functions of unbounded length. This model is not as powerful as Python or other general-purpose programming languages, but can serve as an introduction to these more general models. We also show a beautiful result - the functions computable by finite automata are exactly the ones that correspond to regular expressions. However, the reader can also feel free to skip automata and go straight to our discussion of Turing Machines in Chapter 7.

6.1 FUNCTIONS WITH INPUTS OF UNBOUNDED LENGTH

Up until now, we considered the computational task of mapping some string of length into a string of length . However, in gen- eral computational tasks can involve inputs of unbounded length. For example, the following푛 Python function computes푚 the function XOR , where XOR equals iff the number of ’s in is odd. (In∗ other words, XOR mod for every ∶ {0, 1}.) As→ simple {0, 1} as it is, the XOR(푥) function|푥|−11 cannot be com- 1 푖=0 푖 puted푥 by a∗ Boolean circuit. Rather,(푥) for = every ∑ , we푥 can compute2 XOR 푥(the ∈ restriction {0, 1} of XOR to ) using a different circuit (e.g., see 푛 Fig. 6.2). 푛 푛 {0, 1} def XOR(X): '''Takes list X of 0's and 1's Outputs 1 if the number of 1's is odd and outputs 0 otherwise'''

↪ result = 0 for i in range(len(X)): result += X[i] % 2 return result

Previously in this book we studied the computation of finite func- tions . Such a function can always be described by listing all the푛 values푚 it takes on inputs . In this chap- ter we푓 consider ∶ {0, 1} functions→푛 {0, 1} such as XOR that take푓 inputs푛 of unbounded size. While we can2 describe XOR using a finite푥 ∈ {0,number 1} of symbols (in fact we just did so above), it takes infinitely many possible inputs Figure 6.2: The NAND circuit and NAND-CIRC program for computing the XOR of bits. Note how and so we cannot just write down all of its values. The same is true the circuit for XOR merely repeats four times the for many other functions capturing important computational tasks circuit to compute the XOR of bits.5 5 2 functions with infinite domains, automata, and regular expressions 219

including addition, multiplication, sorting, finding paths in graphs, fitting curves to points, and so on and so forth. To contrast withthe finite case, we will sometimes call a function (or ) infinite. However, this does not mean∗ that taks as input∗ strings of∗ infinite length! It just퐹 ∶ {0,means 1} that →can {0, 1}take as 퐹input ∶ {0, a string1} → of {0, that 1} can be arbitrarily long, and so we cannot simply퐹 write down a table of all the outputs of on different inputs.퐹

 Big Idea 8 A function 퐹 specifies the computa- tional task mapping an input ∗ into the∗ output . 퐹 ∶ {0, 1} →∗ {0, 1} 푥 ∈ {0, 1} 퐹 (푥) As we’ve seen before, restricting attention to functions that use binary strings as inputs and outputs does not detract from our gener- ality, since other objects, including numbers, lists, matrices, images, videos, and more, can be encoded as binary strings. As before, it is important to differentiate between specification and implementation. For example, consider the following function:

s.t. are primes and TWINP (6.1) otherwise푝∈ℕ {⎧1 ∃ 푝, 푝 + 2 푝 > |푥| This is a mathematically(푥) = well-defined function. For every , {⎨0 TWINP has a unique⎩ value which is either or . However, at the moment no one knows of a Python program that computes푥 this function.(푥) The Twin prime conjecture posits that0 for1 every there exists such that both and are primes. If this conjecture is true, then is easy to compute indeed - the program def푛 T(x): return푝 > 1 will 푛 do the trick. However,푝 푝 + mathematicians 2 have tried unsuccesfully푇 to prove this conjecture since 1849. That said, whether or not we know how to implement the function TWINP, the definition above provides its specification.

6.1.1 Varying inputs and outputs Many of the functions we are interested in take more than one input. For example the function

MULT (6.2) that takes the binary representation of a pair of integers (푥, 푦) = 푥 ⋅ 푦 and outputs the binary representation of their product . How- ever, since we can represent a pair of strings as a single string,푥, 푦 we ∈ willℕ consider functions such as MULT as mapping to푥 ⋅ 푦 . We will typically not be concerned with low-level details∗ such as the∗ pre- cise way to represent a pair of integers as a string,{0, 1} since essentially{0, 1} all choices will be equivalent for our purposes. 220 introduction to theoretical computer science

Another example for a function we want to compute is

PALINDROME (6.3) otherwise푖∈[|푥|] 푖 |푥|−푖 {⎧1 ∀ 푥 = 푥 PALINDROME has a single(푥) bit = as output. Functions with a single {⎨0 bit of output are known as Boolean functions⎩ . Boolean functions are central to the theory of computation, and we will come discuss them often in this book. Note that even though Boolean functions have a single bit of output, their input can be of arbitrary length. Thus they are still infinite functions, that cannot be described via a finite tableof values. “Booleanizing” functions. Sometimes it might be convenient to ob- tain a Boolean variant for a non-Boolean function. For example, the following is a Boolean variant of MULT.

bit of BMULT (6.4) 푡ℎ otherwise {⎧푖 푥 ⋅ 푦 푖 < |푥 ⋅ 푦| If we can compute(푥,BMULT 푦, 푖) =via any programming language such {⎨0 as Python, C, Java, etc. then we⎩ can compute MULT as well, and vice versa. Solved Exercise 6.1 — Booleanizing general functions. Show that for every function there exists a Boolean function BF such∗ that a Python∗ program to compute BF can be transformed∗ 퐹 ∶ into {0, 1} a program→ {0, 1} to compute and vice versa. ∶ {0, 1} → {0, 1} ■ 퐹 Solution: For every , we can define ∗ ∗ 퐹 ∶ {0, 1} → {0, 1} BF (6.5) 푖 ⎧퐹 (푥) 푖 < |퐹 (푥)|, 푏 = 0 { (푥, 푖, 푏) = ⎨1 푖 < |퐹 (푥)|, 푏 = 1 to be the function that on{ input out- ⎩0 푖 ≥ |푥| puts the bit of if and . If∗ then BF outputs 푡ℎiff and hence this푥 ∈ allows {0, 1} , to 푖 compute∈ ℕ, 푏 ∈ {0,the 1} length of . 푖 퐹 (푥) 푏 = 0 푖 < |푥| 푏 = 1 (푥, 푖, 푏) Computing1 푖

def퐹 F(x): res = [] functions with infinite domains, automata, and regular expressions 221

i = 0 while BF(x,i,1): res.apppend(BF(x,i,0)) i += 1 return res

6.1.2 Formal Languages For every Boolean function , we can define the set of strings on∗ which outputs . Such sets are known as languages. This퐹 name ∶ {0, 1}is rooted→ {0, in 1} theory 퐹 as pursued퐿 = {푥|퐹 by linguists (푥) = 1} such as Noam Chomsky.퐹 A formal1 language is a subset (or more generally for some finite alphabet ). The membership∗ or decision problem for∗ a language , is the task of퐿 determining, ⊆ {0, 1} given ,퐿 whether ⊆ Σ or not . If we can computeΣ the function then we can∗ decide membership퐿 in the language and vice versa. Hence,푥 ∈ {0, many 1} texts such as [Sip97푥 ∈] 퐿 refer to the task of computing a Boolean퐹 function as “deciding a language”. 퐹 In this book퐿 we mostly describe computational task using the function notation, which is easier to generalize to computation with more than one bit of output. However, since the language terminology is so popular in the literature, we will sometimes mention it.

6.1.3 Restrictions of functions If is a Boolean function and then the re- striction of ∗ to inputs of length , denoted as , is the finite function 퐹 ∶ {0, 1} → {0,such 1} that for every푛 ∈ ℕ . That 푛 is, is the푛 퐹 finite function that푛 is only defined퐹 on inputs in 푛 , but 푓agrees ∶ {0, 1} with→ {0,on 1} those inputs.푓(푥) Since = 퐹 (푥)is a finite function,푥 ∈ {0, 1} it 푛can be 푛 computed퐹 by a Boolean circuit, implying the following theorem:{0, 1} 푛 퐹 퐹 Theorem 6.1 — Circuit collection for every infinite function. Let . Then there is a collection of circuits such that∗ for every , computes the restriction of 퐹to ∶ {0, inputs 1} → of 푛 푛∈{1,2,…} {0,length 1} . {퐶 } 푛 푛 푛 > 0 퐶 퐹 퐹 Proof. This푛 is an immediate corollary of the universality of Boolean circuits. Indeed, since maps to , Theorem 4.15 implies that there exists a Boolean circuit to푛 compute it. In fact. the size of 푛 this circuit is at most 퐹 gates{0, 1}for some{0, constant 1} . 푛 ■ 푛 퐶 푐 ⋅ 2 /푛 푐 ≤ 10 In particular, Theorem 6.1 implies that there exists such as circuit collection even for the TWINP function we described before,

푛 {퐶 } 222 introduction to theoretical computer science

despite the fact that we don’t know of any program to compute it. In- deed, this is not that surprising: for every particular , TWINP is either the constant zero function or the constant one function, both 푛 of which can be computed by very simple Boolean circuits.푛 ∈ ℕ Hence a collection of circuits that computes TWINP certainly exists. The difficulty in computing TWINP using Python or any other pro- 푛 gramming language arises{퐶 } from the fact that we don’t know for each particular what is the circuit in this collection.

푛 6.2 DETERMINISTIC푛 FINITE AUTOMATA퐶 (OPTIONAL)

All our computational models so far - Boolean circuits and straight- line programs - were only applicable for finite functions. In Chapter 7 we will present Turing Machines, which are the central models of com- putation for functions of unbounded input length. However, in this section we will present the more basic model of deterministic finite automata (DFA). Automata can serve as a good stepping-stone for Turing machines, though they will not be used much in later parts of this book, and so the reader can feel free to skip ahead fo Chapter 7. DFAs turn out to be equivalent in power to regular expressions, which are a powerful mechanism to specify patterns that is widely used in practice. Our treatment of automata is quite brief. There are plenty of resources that help you get more comfortable with DFA’s. In particu- lar, Chapter 1 of Sipser’s book [Sip97] contains an excellent exposition of this material. There are also many websites with online simula- tors for automata, as well as translators from regular expressions to automata and vice versa (see for example here and here). At a high level, an algorithm is a recipe for computing an output from an input via a combination of the following steps:

1. Read a bit from the input 2. Update the state (working memory) 3. Stop and produce an output

For example, recall the Python program that computes the XOR function: def XOR(X): '''Takes list X of 0's and 1's Outputs 1 if the number of 1's is odd and outputs 0 otherwise'''

↪ result = 0 for i in range(len(X)): result += X[i] % 2 return result functions with infinite domains, automata, and regular expressions 223

In each step, this program reads a single bit X[i] and updates its state result based on that bit (flipping result if X[i] is and keeping it the same otherwise). When its done transversing the input, the program outputs result. In computer science, such a program1 is called a single-pass constant-memory algorithm since it makes a single pass over the input and its working memory is of finite size. (Indeed in this case result can either be or .) Such an algorithm is also known as a Deterministic Finite Automaton or DFA (another name for DFA’s is a finite state machine). We0 can1 think of such an algorithm as a “machine” that can be in one of states, for some constant . The machine starts in some initial state, and then reads its input one bit at a time. Whenever the machine퐶 reads a bit 퐶, it ∗ transitions into a new state based on and its prior state. The푥 ∈output {0, 1} of the machine is based on the final state. Every single-pass휎 ∈ {0, 1} constant- memory algorithm corresponds to such휎 a machine. If an algorithm uses bits of memory, then the contents of its memory are a string of length . Since there are such strings, at any point in the execution, such푐 an algorithm can be in푐 one of states. We can푐 specify a DFA2 of states by푐 a list of rules. Each rule will be of the form “If the DFA is in2 state and the bit read from the input is then the new state퐶 is ”. At the end of퐶 the⋅ 2 computation we will also have a rule of the form′ “If the푣 final state is one of the following휎 … then output , otherwise푣 output ”. For example, the Python program above can be represented by a two state automata for computing XOR of the following1 form: 0

• Initialize in state • For every state and input bit read, if then change to state , otherwise0 stay in state . • At the end output푠 ∈ {0,iff 1} . 휎 휎 = 1 1 − 푠 푠 We can also describe1 a 푠 =-state 1 DFA as a labeled graph of vertices. For every state and bit , we add a directed edge labeled with between and the state 퐶such that if the DFA is at state and퐶 reads then it transitions푠 to . (If휎′ the state stays the same then this edge휎 will be a self loop;푠 similarly,′ if푠 transitions to in both the case푠 and휎 then the graph푠 will contain two parallel′ edges.) We also label the set of states on which푠 the automate푠 will output at the휎 end = 0 of 휎the = computation. 1 This set is known as the set of accepting states. See Fig. 6.3풮for the graphical representation of the XOR automaton.1 Formally, a DFA is specified by (1) the table of the rules, which can be represented as a transition function that maps a state and bit to the state which the DFA will퐶 ⋅ 2 transition to Figure 6.3: A deterministic finite automaton that from state on input and (2)′ the set of푇 accepting states. This푠 ∈ leads [퐶] computes the XOR function. It has two states and , to the following휎 ∈ {0, 1} definition. 푠 ∈ [퐶] and when it observes it transitions from to . 푐 휎 풮 0 1 휎 푣 푣 ⊕ 휎 224 introduction to theoretical computer science

Definition 6.2 — Deterministic Finite Automaton. A deterministic finite automaton (DFA) with states over is a pair with and . The finite function is known as the transition function퐶of the DFA and{0, the 1} set is known(푇 , 풮) as the set푇 ∶ of [퐶]accepting × {0, 1} states → [퐶]. 풮 ⊆ [퐶] 푇 Let be a Boolean function풮 with the infinite domain . We∗ say that computes a function if퐹 for ∶ every{0,∗ 1} → {0,and 1} , if we define ∗and {0, 1} for every (푇 ,, 풮) then 푛 퐹 ∶ {0, 1} → 0 {0, 1} 푛 ∈ ℕ 푥 ∈ {0, 1} 푠 = 0 푖+1 푖 푖 푠 = 푇 (푠 , 푥 ) 푖 ∈ [푛] (6.6)

푛 푠 ∈ 풮 ⇔ 퐹 (푥) = 1

P Make sure not to confuse the transition function of an automaton ( in Definition 6.2), which is a finite function specifying the table of “rules” which it fol- lows, with the function푇 the automaton computes ( in Definition 6.2) which is an infinite function. 퐹

R Remark 6.3 — Definitions in other texts. Deterministic finite automata can be defined in several equivalent ways. In particular Sipser [Sip97] defines a DFAs as a five-tuple where is the set of states, is the alphabet, is the transition function, is 0 the initial state,(푄, Σ, and 훿, 푞 , 퐹is ) the set푄 of accepting states. 0 ΣIn this book the set훿 of states is always of the form푞 and퐹 the initial state is always , but this makes no difference to the computational 0 power푄 = {0, of … these , 퐶 − models. 1} Also, we restrict our attention푞 = 0 to the case that the alphabet is equal to .

Σ {0, 1} Solved Exercise 6.2 — DFA for . Prove that there is a DFA that com- putes the following function :∗ (010) 퐹 divides and (6.7) otherwise 푖∈[|푥|/3] 푖 푖+1 푖+2 {⎧1 3 |푥| ∀ 푥 푥 푥 = 010 퐹 (푥) = ⎨ ■ ⎩{0 Solution: When asked to construct a deterministic finite automaton, it helps to start by thinking of it a single-pass constant-memory al- functions with infinite domains, automata, and regular expressions 225

gorithm (for example, a Python program) and then translate this program into a DFA. Here is a simple Python program for comput- ing :

def퐹F(X): '''Return 1 iff X is a concatenation of zero/more copies of [0,1,0]'''

if↪ len(X) % 3 != 0: return False ultimate = 0 penultimate = 1 antepenultimate = 0 for idx, b in enumerate(X): antepenultimate = penultimate penultimate = ultimate ultimate = b if idx % 3 == 2 and ((antepenultimate, penultimate, ultimate) != (0,1,0)):

↪ return False return True

Since we keep three Boolean variables, the working memory can be in one of configurations, and so the program above can be directly translated3 into an state DFA. While this is not needed to solve the question,2 = 8 by examining the resulting DFA, we can see that we can merge some states8 together and obtain a state au- tomaton, described in Fig. 6.4. See also Fig. 6.5, which depicts the execution of this DFA on a particular input. 4 ■

6.2.1 Anatomy of an automaton (finite vs. unbounded) Now that we consider computational tasks with unbounded input sizes, it is crucially important to distinguish between the components of our algorithm that are fixed in size, and the components that grow with the size of the input. For the case of DFAs these are the follow- ing: Figure 6.4: A DFA that outputs only on inputs Constant size components: Given a DFA , the following quantities are that are a concatenation of zero or more copies of . The state is both the starting state fixed independent of the input size: ∗ 1 and the only accepting state. The table denotes the 퐴 푥 ∈ {0, 1} transition010 function of ,0 which maps the current state • The number of states in . and symbol read to the new symbol. 푇

퐶 퐴 • The transition function (which has inputs, and so can be speci- fied by a table of rows, each entry in which is a number in ). 푇 2퐶 2퐶 [퐶] 226 introduction to theoretical computer science

• The set of accepting states. There are at most such states, each of which can be described by a string in specifiying퐶 which state풮 ⊆ is [퐶] in and which isn’t 퐶 2 {0, 1} Together the above풮 means that we can fully describe an automaton using finitely many symbols. This is a property we require outofany notion of “algorithm”: we should be able to write down a completely specification of how it produces an output from an input.

Components of unbounded size: The following quantities relating to a DFA are not bounded by any constant. We stress that these are still finite for any given input.

• The length of the input that the DFA is provided with. It is always finite, but not bounded.∗ 푥 ∈ {0, 1} • The number of steps that the DFA takes can grow with the length of the input. Indeed, a DFA makes a single pass on the input and so it takes exactly steps on an input . ∗ |푥| 푥 ∈ {0, 1} Figure 6.5: Execution of the DFA of Fig. 6.4. The num- ber of states and the size of the transition function are bounded, but the input can be arbitrarily long. If the DFA is at state and observes the value then it moves to the state . At the end of the execution the DFA accepts iff푠 the final state isin . 휎 푇 (푠, 휎) 풮

6.2.2 DFA-computable functions We say that a function is DFA computable if there exists some DFA that computes ∗ . In Chapter 4 we saw that every finite function is computable퐹 ∶ {0, 1} →by {0, some 1} Boolean circuit. Thus, atthis point you might expect that every퐹 infinite function is computable by some DFA. However, this is very much not the case. We will soon see some simple examples of infinite functions that are not computable by DFAs, but for starters, let’s prove that such functions exist. functions with infinite domains, automata, and regular expressions 227

Theorem 6.4 — DFA-computable functions are countable. Let DFACOMP be the set of all Boolean functions such that there exists a DFA computing . Then DFACOMP∗ is countable. 퐹 ∶ {0, 1} → {0, 1}

Proof Idea: 퐹 Every DFA can be described by a finite length string, which yields an onto map from to DFACOMP: namely the function that maps a string describing∗ an automaton to the function that it com- putes. {0, 1} 퐴

Proof⋆ of Theorem 6.4. Every DFA can be described by a finite string, representing the transition function and the set of accepting states, and every DFA computes some function . Thus we can define the following function푇 ∗ DFACOMP: 퐴 퐹 ∶ {0, 1}∗ → {0, 1} 푆푡퐷퐶 ∶ {0, 1} → represents automaton and is the function computes ONE otherwise ⎧ {퐹 푎 퐴 퐹 (6.8)퐴 푆푡퐷퐶(푎) = ⎨ where ONE⎩{ is the constant function that outputs on all inputs (and∗ is a member of DFACOMP). Since by definition, every function∶ {0,in 1}DFACOMP→ {0, 1} is computable by some automaton, 1 is an onto function from to DFACOMP, which means that DFACOMP퐹is countable (see Section∗ 2.4.2). 푆푡퐷퐶 {0, 1} ■

Since the set of all Boolean functions is uncountable, we get the following corollary:

Theorem 6.5 — Existence of DFA-uncomputable functions. There exists a Boolean function that is not computable by any DFA. ∗ 퐹 ∶ {0, 1} → {0, 1} Proof. If every Boolean function is computable by some DFA then DFACOMP equals the set ALL of all Boolean functions, but by Theo- rem 2.12, the latter set is uncountable,퐹 contradicting Theorem 6.4. ■

6.3 REGULAR EXPRESSIONS

Searching for a piece of text is a common task in computing. At its heart, the search problem is quite simple. We have a collection of strings (e.g., files on a hard-drive, or student records in a ), and the user wants to find out the subset of푋 all the = 0 푘 {푥 , … , 푥 } 푥 ∈ 푋 228 introduction to theoretical computer science

that are matched by some pattern (e.g., all files whose names end with the string .txt). In full generality, we can allow the user to specify the pattern by specifying a (computable) function , where corresponds to the pattern matching .∗ That is, the user provides a program in some Turing-complete퐹 ∶ {0, programming 1} → {0, 1} language퐹 (푥) such = as 1 Python, and the system will return all푥 the such that . For푃 example, one could search for all text files that contain the string important document or perhaps (letting푥 ∈ 푋 correspond푃 (푥) to a = neural-network 1 based classifier) all images that con- tain a cat. However, we don’t want our system to get into an infinite푃 loop just trying to evaluate the program ! For this reason, typical systems for searching files or databases do not allow users to specify the patterns using full-fledged programming푃 languages. Rather, such systems use restricted computational models that on the one hand are rich enough to capture many of the queries needed in practice (e.g., all filenames ending with .txt, or all phone numbers of the form (617) xxx-xxxx), but on the other hand are restricted enough so the queries can be evaluated very efficiently on huge files and in particular cannot result in an infinite loop. One of the most popular such computational models is regular expressions. If you ever used an advanced text editor, a command line , or have done any kind of manipulation of text files, then you have probably come across regular expressions. A regular expression over some alphabet is obtained by combin- ing elements of with the operation of concatenation, as well as (corresponding to or) and (correspondingΣ to repetition zero or more times). (CommonΣ implementations of regular expressions in| programming languages and∗ shells typically include some extra oper- ations on top of and , but these operations can be implemented as “syntactic sugar” using the operators and .) For example, the fol- lowing regular expression| ∗ over the alphabet corresponds to the set of all strings where every| digit∗ is repeated at least twice: ∗ {0, 1} 푥 ∈ {0, 1} (6.9) ∗ ∗ ∗ The following regular expression(00(0 )|11(1 over)) the. alphabet corresponds to the set of all strings that consist of a sequence of one or more of the letters - followed by a sequence of one{푎, or … more , 푧, 0, digits … , 9} (without a leading zero): 푎 푑

(6.10) ∗ ∗ Formally,(푎|푏|푐|푑)(푎|푏|푐|푑) regular expressions(1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9) are defined by the following. recursive definition: functions with infinite domains, automata, and regular expressions 229

Definition 6.6 — Regular expression. A regular expression over an al- phabet is a string over "" that has one of the following forms: ∅ 푒 Σ Σ ∪ {(, ), |, ∗, , } 1. where

2. 푒 = 휎 where휎 ∈ Σ are regular expressions. ′ ″ ′ ″ 3. 푒 = (푒 |푒 ) where푒 , 푒 are regular expressions. (We often drop the parentheses′ ″ when′ there″ is no danger of confusion and so푒 write = this (푒 )(푒 as ) .) 푒 , 푒 ′ ″ 4. where푒 푒is a regular expression. ′ ∗ ′ Finally푒 = (푒 ) we also allow푒 the following “edge cases”: and "". These are the regular expressions corresponding to accept-∅ ing no strings, and accepting only the empty string respectively.푒 = 푒 = We will drop parenthesis when they can be inferred from the context. We also use the convention that OR and concatenation are left-associative, and we give highest precedence to , then concate- nation, and then OR. Thus for example we write instead of . ∗∗ Every∗ regular expression corresponds to a function00 |11 ((0)(0 where))|((1)(1)) if matches the regular expression. For∗ exam- 푒 ple, if then 푒 but Φ(can∶ Σ you→ see 푒 {0,why?). 1} Φ (푥)∗ = 1 푥 푒 푒 푒 = (00|11) Φ (110011) = 1 Φ (101) = 0

P The formal definition of is one of those definitions that is more cumbersome to write than to grasp. Thus 푒 it might be easier for youΦ to first work it out on your own and then check that your definition matches what is written below.

Definition 6.7 — Matching a regular expression. Let be a regular expres- sion over the alphabet . The function is defined as follows: 푒 ∗ 푒 Σ Φ ∶ Σ → {0, 1} 1. If then iff .

푒 2. If 푒 = 휎 thenΦ (푥) = 1 푥 = 휎 where is the OR op- erator. ′ ″ ′ ″ 푒 푒 푒 푒 = (푒 |푒 ) Φ (푥) = Φ (푥)∨Φ (푥) ∨ 230 introduction to theoretical computer science

3. If then iff there is some such that is the′ concatenation″ of and and ′ ″ ∗ 푒 . 푒 = (푒 )(푒 ) Φ (푥) = 1′ ″ ′ 푥′, 푥 ∈″ Σ ″ 푒 푒 푥 푥 푥 Φ (푥 ) = Φ (푥 ) = 4.1 If then iff there is some and some ′ such that is the concatenation and 푒 푒 = (푒 )∗for every∗ Φ (푥) =. 1 푘 ∈ ℕ 0 푘−1 0 푘−1 푥 ,′ … , 푥 ∈ Σ 푥 푥 ⋯ 푥 푒 푖 5.Φ Finally,(푥 ) = for 1 the edge cases푖 ∈ [푘] is the constant zero function, and "" is the function that only∅ outputs on the empty string "". Φ ΦWe say that a regular expression over1 matches a string if . ∗ 푒 Σ 푥 ∈ Σ 푒 Φ (푥) = 1

P The definitions above are not inherently difficult, but are a bit cumbersome. So you should pause here and go over it again until you understand why it corre- sponds to our intuitive notion of regular expressions. This is important not just for understanding regular expressions themselves (which are used time and again in a great many applications) but also for get- ting better at understanding recursive definitions in general.

A Boolean function is called “regular” if it outputs on precisely the set of strings that are matched by some regular expression. That is, 1 Definition 6.8 — Regular functions / languages. Let be a finite set and be a Boolean function. We say that is regular if ∗for some regular expression . Σ 퐹Similarly, ∶ Σ → for {0, every 1} formal language , we say퐹 that is reg- 푒 퐹ular = ifΦ and only if there is a regular expression푒 ∗ such that iff matches . 퐿 ⊆ Σ 퐿 푒 푥 ∈ 퐿 푒 푥 ■ Example 6.9 — A regular function. Let and be the function such that outputs iff consists of∗ one or more of the lettersΣ =- {푎,followed 푏, 푐, 푑, 0, by1, 2, a 3, sequence 4, 5, 6, 7, 8, 9} of one퐹 or ∶ more Σ → digits {0, (without 1} a leading zero). Then퐹 (푥) is a regular1 푥function, since where 푎 푑 퐹 푒 퐹 = Φ ∗ (6.11)∗ is the푒 = expression (푎|푏|푐|푑)(푎|푏|푐|푑) we saw in(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9) (6.10). functions with infinite domains, automata, and regular expressions 231

If we wanted to verify, for example, that , we can do so by noticing that the expression matches 푒 the string , matches , Φ (푎푏푐12078)matches = the 1 string , and the expression∗ (푎|푏|푐|푑)matches the string . Each푎 one(푎|푏|푐|푑) of those boils down푏푐 (0|1|2|3|4|5|6|7|8|9) to a simpler∗ expression. For ex- ample,1 the expression (0|1|2|3|4|5|6|7|8|9)matches the string because both 2078of the one-character strings and∗ are matched by the expression . (푎|푏|푐|푑) 푏푐 푏 푐 푎|푏|푐|푑Regular expression can be defined over any finite alphabet , but as usual, we will mostly focus our attention on the binary case, where . Most (if not all) of the theoretical and practical generalΣ insights about regular expressions can be gleaned from studying the Σbinary = {0, case. 1}

6.3.1 Algorithms for matching regular expressions Regular expressions would not be very useful for search if we could not actually evaulate, given a regular expression , whether a string is matched by . Luckily, there is an algorithm to do so. Specifically, there is an algorithm (think “Python program” though푒 later we will푥 formalize the notion푒 of algorithms using Turing machines) that on input a regular expression over the alphabet and a string , outputs iff matches (i.e., outputs ). Indeed,∗ Definition 6.7 actually푒 specifies a recursive{0, 1} algorithm푥 ∈for 푒 {0,computing 1} . Specifically,1 푒 each푥 one of ourΦ (푥)operations -concatenation, OR, and star- can be thought of as reducing the task of testing whether 푒 an expressionΦ matches a string to testing whether some sub- expressions of match substrings of . Since these sub-expressions are always shorter푒 than the original푥 expression, this yields a recursive algorithm for checking푒 if matches 푥which will eventually terminate at the base cases of the expressions that correspond to a single symbol or the empty string. 푒 푥 232 introduction to theoretical computer science

Algorithm 6.10 — Regular expression matching. Input: Regular expression over , Output: ∗ ∗ 1: procedure Match( , )푒 Σ 푥 ∈ Σ 푒 2: if Φ (푥) then return ; 3: if ∅ then return푒 푥 MatchEmpty ; 4: if 푒 = then return 0iff ; 5: if 푥 = "" then return Match (()푒)or Match ; 푒 ∈ Σ ′ ″ 1 푥 = 푒 ′ ″ 6: if 푒 = (푒 |푒 ) then (푒 , 푥) (푒 , 푥) 7: for ′ ″ do 8: 푒 = (푒if )(푒Match) and Match then return푖 ∈ [|푥|; + 1] ′ ″ 0 푖−1 푖 |푥|−1 9: end for (푒 , 푥 ⋯ 푥 ) (푒 , 푥 ⋯ 푥 ) 10: end if 1 11: if then 12: if ′ ∗ then return Match ; 13: 푒 = (푒′ ) # is the same as 14: for푒 = "" do ("", 푥)∗ 15: # ("") is shorter than"" 16: if푖 ∈Match [|푥|] and Match 0 푖−1 then return ; 푥 ⋯ 푥 ′ 푥 0 푖−1 푖 |푥|−1 17: end for (푒, 푥 ⋯ 푥 ) (푒 , 푥 ⋯ 푥 ) 18: end if 1 19: return 20: end procedure 0 We assume above that we have a procedure MatchEmpty that on input a regular expression outputs if and only if matches the empty string "". The key observation is that푒 in our recursive1 definition푒 of regular ex- pressions, whenever is made up of one or two expressions then these two regular expressions are smaller than . Eventually (when′ ″ they have size ) then푒 they must correspond to the non-recursive푒 , 푒 case of a single alphabet symbol. Correspondingly,푒 the recursive calls made in Algorithm1 6.10 always correspond to a shorter expression or (in the case of an expression of the form ) a shorter input string. Thus, we can prove the correctness of Algorithm′ ∗ 6.10 on inputs of the form by induction over min (푒 ) . The base case is when either "" or is a single alphabet symbol, "" or . In the case the expression(푒, 푥) is of the forrm {|푒|, |푥|}or ∅ then we make recursive푥 = calls푒 with the shorter′ expressions″ ′ .″ In the case the expression is of the form 푒 = (푒 |푒we) make푒 = recursive (푒′ )(푒″ ) calls with ′ ∗ 푒 , 푒 푒 = (푒 ) functions with infinite domains, automata, and regular expressions 233

either a shorter string and the same expression, or with the shorter expression and a string that is equal in length or shorter than . 푥 Solved Exercise′ 6.3 — Match the′ empty string. Give an algorithm that on 푒 푥 푥 input a regular expression , outputs if and only if "" . ■ 푒 푒 1 Φ ( ) = 1 Solution: We can obtain such a recursive algorithm by using the following observations:

1. An expression of the form "" or always matches the empty string. ′ ∗ (푒 ) 2. An expression of the form , where is an alphabet sym- bol, never matches the empty string. 휎 휎 ∈ Σ 3. The regular expression does not match the empty string. ∅ 4. An expression of the form matches the empty string if and only if one of or matches′ it.″ ′ ″ 푒 |푒 5. An expression푒 of the푒 form matches the empty string if and only if both and match′ ″ it. ′ ″ (푒 )(푒 ) Given the above푒 observations,푒 we see that the following algo- rithm will check if matches the empty string: procedure{MatchEmpty}{ } lIf { } return lendif lIf { ""} return 푒lendif lIf { or ∅} return lendif lIf { } return 푒 푒∅ or = 0 lendif LIf푒 { = ′ ″ 1 } return 푒 = ′ 푒 ∈ Σor 0 ″ lendif푒 = lIf (푒 { |푒 ) ′ }′ return푀푎푡푐ℎ퐸푚푝푡푦(푒lendif endprocedure) 푀푎푡푐ℎ퐸푚푝푡푦(푒′ ) ″ 푒 = (푒 ′)(푟∗ ) 푀푎푡푐ℎ퐸푚푝푡푦(푒 ) 푀푎푡푐ℎ퐸푚푝푡푦(푒 ■) 푒 = (푒 ) 1

6.4 EFFICIENT MATCHING OF REGULAR EXPRESSIONS (OP- TIONAL)

Algorithm 6.10 is not very efficient. For example, given an expression involving concatenation or the “star” operation and a string of length , it can make recursive calls, and hence it can be shown that in the worst case Algorithm 6.10 can take time exponential in the length of the푛 input string푛 . Fortunately, it turns out that there is a much more efficient algorithm that can match regular expressions in linear (i.e., ) time. Since푥 we have not yet covered the topics of time and space complexity, we describe this algorithm in high level terms, without making푂(푛) the computational model precise, using the colloquial no- tion of running time as is used in introduction to programming

푂(푛) 234 introduction to theoretical computer science

courses and whiteboard coding interviews. We will see a formal defi- nition of time complexity in Chapter 13.

Theorem 6.11 — Matching regular expressions in linear time. Let be a regular expression. Then there is an time algorithm that computes . 푒 푂(푛) 푒 The implicitΦ constant in the term of Theorem 6.11 depends on the expression . Thus, another way to state Theorem 6.11 is that for every expression , there is푂(푛) some constant and an algorithm that computes on푒 -bit inputs using at most steps. This makes sense, since in practice푒 we often want to compute푐 for a small퐴 푒 regular expressionΦ and푛 a large document . Theorem푐 ⋅ 푛 6.11 tells us 푒 that we can do so with running time that scales linearlyΦ (푥) with the size of the document, even푒 if it has (potentially)푥 worse dependence on the size of the regular expression. We prove Theorem 6.11 by obtaining more efficient recursive al- gorithm, that determines whether matches a string by reducing this task to determining whether a related expression 푛 matches . This will result푒 in an expression푥 for ∈ the {0, running1} ′ time of the form which solves to 푒 . 0 푛−2 푥 , … , 푥 푇 (푛) = 푇 (푛 − 1) + 푂(1) 푇 (푛) = 푂(푛)

Restrictions of regular expressions. The central definition for the algo- rithm behind Theorem 6.11 is the notion of a restriction of a regular expression. The idea is that for every regular expression and symbol in its alphabet, it is possible to define a regular expression such that matches a string if and only if matches the string푒 . For 휎example, if is the regular expression (i.e., one푒[휎] or more occurrences푒[휎] of ) then 푥 is equal to 푒 and will푥휎 be . (Can you see푒 why?) 01|(01) ∗ (01) ∅ Algorithm 6.1201 computes푒[1] the resriction0|(01) ∗given 0 a푒[0] regular ex- pression and an alphabet symbol . It always terminates, since the recursive calls it makes are always on expressions푒[휎] smaller than the input expression.푒 Its correctness can휎 be proven by induction on the length of the regular expression , with the base cases being when is "", , or a single alphabet symbol . ∅ 푒 푒 휏 functions with infinite domains, automata, and regular expressions 235

Algorithm 6.12 — Restricting regular expression. Input: Regular expression over , symbol Output: Regular expression such that for every 푒 ′ Σ 휎 ∈ Σ ′ 푒 1: procedure Restrict( ∗, )푒 = 푒[휎] Φ (푥) = 푒 2: Φ (푥휎)if or 푥 ∈then Σ return ; 3: if for ∅ then푒 휎 return∅ if and return otherwise푒 = "" ; 푒 = 4: ∅ if 푒 = 휏 휏then ∈ Σ return Restrict"" 휏 = 휎 Restrict ; ′ ″ ′ ″ 5: if 푒 = (푒 |푒 then) return ( Restrict(푒 , 휎)| ; (푒 , 휎)) 6: if ′ ∗ and ′ ∗ then′ return Restrict푒 = (푒 ) ′ ″ ; (푒 )″ ( (푒 , 휎)) 푒 7: ′ if 푒 = (푒″)(푒 ) and Φ ("") = 0 then return (푒 )(Restrict(푒 ′, 휎))″ Restrict″ ; 푒 8: end′ procedure푒 = (푒″)(푒 ) Φ ("")′ = 1 (푒 )( (푒 , 휎) | (푒 , 휎)) Using this notion of restriction, we can define the following recur- sive algorithm for regular expression matching:

Algorithm 6.13 — Regular expression matching in linear time. Input: Regular expression over , where Output: ∗ 푛 1: procedure FMatch( ,푒) Σ 푥 ∈ Σ 푛 ∈ ℕ 푒 2: if Φ (푥) then return MatchEmpty ; 3: Let Restrict푒 푥 4: return푥 =′ ""FMatch (()푒) 푛−2 5: end procedure푒 ← ′(푒, 푥 ) 0 푛−1 (푒 , 푥 ⋯ 푥 ) By the definition of a restriction, for every and , the expression matches if and only if matches .′ Hence∗ for every and , ′ 휎 ∈and Σ Algorithm푥′ ∈ Σ 6.13 does return the푒 correct푛 answer.푥 휎 The only remaining푒[휎] task푥 is to analyze 푒[푥푛−1] 0 푛−2 푒 its running푒 time푥 ∈. NoteΣ Φ that Algorithm(푥 ⋯ 푥 6.13) =uses Φ (푥) the MatchEmpty procedure of Solved Exercise 6.3 in the base case that "". However, this is OK since the running time of this procedure depends only on and is independent of the length of the original input.푥 = For simplicity, let us restrict our attention to the case that the al- 푒 phabet is equal to . Define to be the maximum number of operations that Algorithm 6.12 takes when given as input a regular expressionΣ over {0, 1}of at most 퐶(ℓ)symbols. The value can be shown to be polynomial in , though this is not important for this the- orem, since푒 we only{0, care 1} about theℓ dependence of the time퐶(ℓ) to compute ℓ 236 introduction to theoretical computer science

on the length of and not about the dependence of this time on the length of . 푒 Φ (푥)Algorithm 6.13 is a recursive푥 algorithm that input an expression and a string푒 , does computation of at most steps and then calls itself with푛 input some expression and a string of 푒length . It푥 will ∈ {0, terminate 1} after steps when′ it reaches퐶(|푒|) a string′ of length . So, the running time that it takes푒 for Algorithm푥 6.13 to compute푛 − 1 for inputs of length 푛satisfies the recursive equation: 0 푇 (푒, 푛) 푒 Φ 푛 max (6.12)

(In푇 the (푒, base 푛) = case {푇 (푒[0],, 푛 − 1),is 푇 equal (푒[1], to 푛 −some 1)} +constant 퐶(|푒|) depending only on .) To get some intuition for the expression Eq. (6.12), let us open up the recursion푛 = for 0 one푇 (푒, level, 0) writing as 푒 max 푇 (푒, 푛)

푇 (푒, 푛) = {푇 (푒[0][0], 푛 − 2) + 퐶(|푒[0]|), (6.13) 푇 (푒[0][1], 푛 − 2) + 퐶(|푒[0]|), 푇 (푒[1][0], 푛 − 2) + 퐶(|푒[1]|), Continuing this way, we can see that 푇 (푒[1][1], 푛 − 2) + 퐶(|푒[1]|)} + 퐶(|푒|) . where is the largest length of any expression that we encounter along the way. Therefore, the following푇 claim (푒, 푛) suffices ≤′ 푛 ⋅ 퐶(퐿) to +show 푂(1) that Algorithm퐿 6.13 runs in time: 푒

Claim: Let be a regular expression over , then there is a num- 푂(푛) ber , such that for every sequence of symbols , if we define 푒 (i.e., restricting{0, 1} to , and then 0 푛−1 and퐿(푒) so on ∈ and′ ℕ so forth), then . 훼 , … , 훼 0 1 푛−1 0 1 푒 = 푒[훼 ][훼 ] ⋯ [훼 ′] 푒 훼 훼 Proof of claim: For a regular expression|푒 | ≤ 퐿(푒) over and , we denote by the expression obtained by restrict-푚 ing to and then to and so on. We푒 let {0, 1} 훼 ∈ {0, 1} . 0 1 푚−1 We will prove푒[훼] the claim by showing푒[훼 that][훼 for] ⋯ [훼every] , the set is fi-∗ 0 1 nite,푒 and훼 hence so is the훼 number which푆(푒) is the = maximum {푒[훼]|훼 ∈ length {0, 1} } of for . 푒 푆(푒) We′ prove′ this by induction on the퐿(푒) structure of . If is a symbol, the 푒empty푒 string,∈ 푆(푒) or the empty set, then this is straightforward to show as the most expressions can contain are the푒 expression푒 itself, "", and . Otherwise we split to the two cases (i) and (ii) ,∅ where are smaller푆(푒) expressions (and hence′∗ by the induction hypothesis′ ″ ′ ″ and are finite). In the푒 =case (i) 푒, if 푒then = 푒 푒 is either푒 equal, 푒′ to ″ or it is simply the empty set if ′ ∗ . Since is푆(푒 in) the set푆(푒′ ∗)′, the number of distinct expressions푒 = (푒′ ) in ∅ 푒[훼] is′ at most (푒 ) .푒′ In[훼] the case (ii), if then all푒 the[훼] = restrictions푒 [훼] of to strings′ 푆(푒 )will either have the form′ ″ or the form 푆(푒) where|푆(푒 )| +is 1 some string such that푒 = 푒 푒 ′ ″ and ′ ″ ′ ′ 푒 ′ 훼 푒′ 푒″[훼] ″ ″ 푒 푒 [훼]|푒 [훼 ] 훼 훼 = 훼 훼 푒 [훼 ] functions with infinite domains, automata, and regular expressions 237

matches the empty string. Since and , the number of the possible distinct expressions″ of″ the form′ ′ is at′ most . This completes푒 [훼] ∈ 푆(푒 the proof) of푒 [훼 the] claim. ∈ 푆(푒 ) ″ ″ ′ 푒[훼] |푆(푒 )| + |푆(푒 )| ⋅ |푆(푒 )| The bottom line is that while running Algorithm 6.13 on a regular expression , all the expressions we ever encounter are in the finite set , no matter how large the input is, and so the running time of Algorithm 6.13푒 satisfies the equation for some 푆(푒)constant depending on . This solves푥 to where the′ implicit constant in′ the O notation can (and will)푇 (푛) depend = 푇 (푛 on− 1)but + 퐶 crucially, not on the퐶 length of the input푒 . 푂(푛) 푒 푥 6.4.1 Matching regular expressions using DFAs Theorem 6.11 is already quite impressive, but we can do even better. Specifically, no matter how long the string is, we can compute by maintaining only a constant amount of memory and moreover 푒 making a single pass over . That is, the algorithm푥 will scan the inputΦ (푥) once from start to finish, and then determine whether or not is matched by the expression푥 . This is important in the common case 푥of trying to match a short regular expression over a huge file 푥or docu- ment that might not even fit푒 in our computer’s memory. Of course, as we have seen before, a single-pass constant-memory algorithm is sim- ply a deterministic finite automaton. As we will see in Theorem 6.16, a function is can be computed by regular expression if and only if it can be computed by a DFA. We start with showing the “only if” direction:

Theorem 6.14 — DFA for regular expression matching. Let be a regular expression. Then there is an algorithm that on input computes while making a single pass over 푒and maintaining∗ a constant amount of memory. 푥 ∈ {0, 1} 푒 Φ (푥) 푥

Proof Idea: The single-pass constant-memory for checking if a string matches a regular expression is presented in Algorithm 6.15. The idea is to replace the recursive algorithm of Algorithm 6.13 with a dynamic pro- gram, using the technique of . If you haven’t taken yet an algorithms course, you might not know these techniques. This is OK; while this more efficient algorithm is crucial for the many practical applications of regular expressions, it is not of great importance for this book.

⋆ 238 introduction to theoretical computer science

Algorithm 6.15 — Regular expression matching by a DFA. Input: Regular expression over , where Output: ∗ 푛 1: procedure DFAMatch(푒 , ) Σ 푥 ∈ Σ 푛 ∈ ℕ 푒 2: LetΦ (푥) be the set as defined in the proof of [reglintimethm]().ref.푒 푥 ∗ 3: for 푆 ← 푆(푒)do {푒[훼]|훼 ∈ {0, 1} } 4: Let′ if and otherwise 5: end푒 for∈ 푆′ ′ ′ 푒 푒 푒 6: for 푣 do← 1 Φ ("") = 1 푣 ← 0 7: Let for all 8: Let푖 ∈ [푛] ′ ′ for all′ 푒 푒 9: end for푙푎푠푡′ ← 푣 ′ 푒 ∈′ 푆 푒 푒 [푥푖] 10: return 푣 ← 푙푎푠푡 푒 ∈ 푆 11: end procedure 푒 푣

Proof of Theorem 6.14. Algorithm 6.15 checks if a given string is matched by the regular expression . For every regular expres- ∗ sion , this algorithm has a constant number Boolean vari-푥 ∈ Σ ables ( for ), and it푒 makes a single pass over the input푒 ′ string.′ Hence′ it corresponds to a DFA.2|푆(푒)| We prove its cor- 푒 푒 rectness푣 by, 푙푎푠푡 induction푒 on∈ the 푆(푒) length of the input. Specifically, we will argue that before reading the -th bit of , the variable is equal to for every 푛 . In the case ′ this 푒 holds since′ we initialize ′푖 "" for all푥 . For푣 푒 0 푖−1 this holdsΦ by(푥 induction⋯ 푥 ) since′ the푒 inductive′ ∈ 푆(푒) hypothesis′ implies푖 = 0 that 푒 푒 for푣 all= Φ ( ) and by푒 the∈ definition 푆(푒) 푖 >of 0 the set ′ , for every′ and ′ , is in and 푒 푒 0 푖−2 푙푎푠푡′ = Φ (푥 ⋯′ 푥 ) 푒. ∈ 푆(푒) ″ ′ 푖−1 푖−1 푆(푒′ ) 푒 ∈″ 푆(푒) 푥 ∈ Σ 푒 = 푒 [푥 ] 푆(푒) ■ 푒 0 푖−1 푒 0 푖 Φ (푥 ⋯ 푥 ) = Φ (푥 ⋯ 푥 ) 6.4.2 Equivalence of regular expressions and automata Recall that a Boolean function is defined to be regular if it is equal to for some regular∗ expression . (Equivalently, a language is defined퐹 ∶ {0, to 1}be regular→ {0,if 1}there is a regular 푒 expression such thatΦ∗ matches iff .) The following푒 theorem is the central result퐿 ⊆ {0, of automata 1} theory: 푒 푒 푥 푥 ∈ 퐿 Theorem 6.16 — DFA and regular expression equivalency. Let . Then is regular if and only if there exists a DFA that∗ computes . 퐹 ∶ {0, 1} → {0, 1} 퐹 (푇 , 풮)

Proof Idea: 퐹 functions with infinite domains, automata, and regular expressions 239

One direction follows from Theorem 6.14, which shows that for every regular expression , the function can be computed by a DFA (see for example Fig. 6.6). For the other direction, we show that given 푒 a DFA for every 푒 we canΦ find a regular expression that would match if and only if the DFA starting in state , will end up(푇 in ,state 풮) after푣, reading∗ 푤 ∈ [퐶]. 푥 ∈ {0, 1} 푣 푤 푥 ⋆ Proof of Theorem 6.16. Since Theorem 6.14 proves the “only if” direc- tion, we only need to show the “if” direction. Let be a DFA with states that computes the function . We need to show that is regular. 퐴 = (푇 , 풮) For퐶 every , we let 퐹 be the function퐹 Figure 6.6: A deterministic finite automaton that that maps to if and only if the∗ DFA , starting at the computes the function . 푣,푤 state , will reach푣, 푤 ∈ the [퐶]∗ state if퐹 it reads∶ {0, the 1} input→ {0,. 1} We will prove that ∗ (01) is regular푥 ∈ for {0, every 1} 1 . This will prove the theorem,퐴 since by Φ Definition푣 6.2, is equal푤 to the OR of 푥for every . 푣,푤 Hence퐹 if we have a regular푣, 푤 expression for every function of the form 0,푤 then (using퐹 (푥) the operation) we can obtain퐹 (푥) a regular expression푤 ∈ 풮 for as well. 푣,푤 퐹 To give regular expressions| for the functions , we start by defining퐹 the following functions : for every and Figure 6.7: Given a DFA of states, for every 푣,푤 and number we define the function , if and only푡 if starting from퐹 and observ- 푣,푤 to퐶 output one on input푣, 푤 ∈ 퐹 푣, 푤 ∈ [퐶] ing , the automata푡 reaches with all intermediate states being in the set [퐶]푡 if∗ and푡 ∈only {0, if … when , 퐶} the DFA is initialized 푣,푤 in푣,푤 the state and is given the input , it will reach the 0 ≤ 푡 ≤ 퐶 퐹 (푥)(see =Fig. 1 6.7). That is, while themselves푣 might 퐹 ∶ {0,∗ 1} → {0, 1} state while going only through the intermediate 푥 푤 푥 ∈ {0, 1} be outside , if and only if throughout the execution of states 푣 . 푥 the[푡] = automaton {0, … , 푡 − on 1}푡 the input (when initiated at푣, 푤) it never enters any 푤 푣,푤 of the states[푡] outside퐹 (푥) =and 1 still ends up at . If then is the {0, … , 푡 − 1} empty set, and hence 푥 if and only if the푣 automaton reaches from directly on [푡], without0 any intermediate푤 푡 state. = 0 If [푡] then 푣,푤 all states are in , and퐹 hence(푥) = 1 . 푤 We will푣 prove the푥 theorem by푡 induction on , showing푡 that = 퐶 is 푣,푤 푣,푤 regular for every[푡] and . For퐹 the=base 퐹 case of , is regular푡 푣,푤 for every since it can be described as one of푡 the expressions0 퐹 "", , 푣,푤 , or . Specifically,푣, 푤 if 푡 then 푡 =if and 0 퐹 only if is∅ the empty푣, string. 푤 If then 0 if and only if consists 푣,푤 0of1 a single0|1 symbol 푣 =and 푤 0 퐹 (푥) =. Therefore 1 in this푥 case 푣,푤 corresponds to푣 one ≠ of푤 the four퐹 regular(푥) = 1 expressions 푥, , or , depending0 on whether휎 ∈ {0,transitions 1} 푇 (푣, to 휎) =from 푤 when it reads either∅ 푣,푤 퐹or , only one of these symbols, or neither. 0|1 0 1 Inductive step: Now퐴 that we’ve seen푤 the base푣 case, let’s prove the 0 general1 case by induction. Assume, via the induction hypothesis, that for every , we have a regular expression that computes . We′ need′ to prove that is regular for every푡 . If the 푣,푤 푡′ ′ 푣 , 푤 ∈ [퐶] 푡+1 푅 푣,푤 퐹푣 ,푤 퐹 푣, 푤 240 introduction to theoretical computer science

automaton arrives from to using the intermediate states , then it visits the -th state zero or more times. If the path labeled by causes the automaton to푣 get from푤 to without visiting the[푡-th + 1] state at all, then is matched푡 by the regular expression . If the path 푥 labeled by causes the automaton푣 to get푤 from to 푡 while visiting푡 the 푣,푤 -th state 푥 times then we can think of this path푅 as: 푥 푣 푤 • First travel from to using only intermediate states in . 푡 푘 > 0 • Then go from back to itself times using only intermediate 푣 푡 [푡 − 1] states in 푡 푘 − 1 • Then go from to using only intermediate states in . [푡 − 1] Therefore in this case the string is matched by the regular expres- 푡 푤 [푡 − 1] sion . (See also Fig. 6.8.) Therefore푡 푡 we∗ can푡 compute 푥using the regular expression 푣,푡 푡,푡 푡,푤 푅 (푅 ) 푅 푡+1 푣,푤 퐹 (6.14) 푡 푡 푡 ∗ 푡 This completes the proof of푣,푤 the inductive푣,푡 푡,푡 step푡,푤 and hence of the theo- 푅 | 푅 (푅 ) 푅 . rem. ■

Figure 6.8: If we have regular expressions

corresponding to for every 푡 ′ ,′ we can 푅푣 ,푤 obtain a regular expression푡′ ′ corresponding′ ′ to 퐹푣 ,푤 푣 , 푤 ∈ [퐶] . The key observation is that푡+1 a path from to 푣,푤 using푡+1 either does not푅 touch at all, in which 퐹case푣,푤 it is captured by the expression , or it푣 goes푤 from {0,to …, comes, 푡} back to zero or more푡푡 times, and then goes from to , in which case푅 it푣,푤 is captured by the expression푣 푡 푡 . 푡 푡 푤 푡 ∗ 푡 푅푣,푡(푅푡,푡) 푅푡,푤

6.4.3 Closure properties of regular expressions If and are regular functions computed by the expressions and respectively, then the expression computes the function defined퐹 퐺 as . Another way to say this is that푒 the set푓 of regular functions is closed under푒|푓 the OR operation. That is, if퐻 =and 퐹 ∨ 퐺 are regular퐻(푥) then so= 퐹 is (푥) ∨ 퐺(푥). An important corollary of Theorem 6.16 is that this set is also closed under the NOT operation: 퐹 퐺 퐹 ∨ 퐺 functions with infinite domains, automata, and regular expressions 241

Lemma 6.17 — Regular expressions closed under complement. If is regular then so is the function , where ∗for every . 퐹 ∶ {0, 1} → {0, 1} ∗ 퐹 퐹 (푥) = 1 − 퐹 (푥) Proof. 푥If ∈ {0,is regular1} then by Theorem 6.11 it can be computed by a DFA with some states. But then the DFA which does퐹 the same computation but where flips the set of accepted⧵ states퐴 will = (푇 compute , 풜) . By Theorem퐶 6.16 this implies that퐴 =is (푇 regular , [퐶] 풜) as well. 퐹 퐹 ■

Since , Lemma 6.17 implies that the set of regular functions is closed under the AND operation as well. Moreover, since OR, NOT푎 and ∧ 푏 AND = 푎 ∨ are푏 a universal basis, this set is also closed un- der NAND, XOR, and any other finite function. That is, we have the following corollary:

Theorem 6.18 — Closure of regular expressions. Let be any finite Boolean function, and let 푘 be regular functions. Then the function 푓 ∶ {0, 1}∗ → {0, 1} 0 푘−1 is regular. 퐹 , … , 퐹 ∶ {0, 1} → {0, 1} 0 1 푘−1 퐺(푥) = 푓(퐹 (푥), 퐹 (푥), … , 퐹 (푥)) Proof. This is a direct consequence of the closure of regular functions under OR and NOT (and hence AND) and Theorem 4.13, that states that every can be computed by a Boolean circuit (which is simply a combination of the AND, OR, and NOT operations). 푓 ■

6.5 LIMITATIONS OF REGULAR EXPRESSIONS AND THE PUMPING LEMMA

The efficiency of regular expression matching makes them very useful. This is the reason why operating systems and text editors often restrict their search interface to regular expressions and don’t allow searching by specifying an arbitrary function. But this efficiency comes at a cost. As we have seen, regular expressions cannot compute every function. In fact there are some very simple (and useful!) functions that they cannot compute. Here is one example: Lemma 6.19 — Matching parenthesis. Let and MATCHPAREN be the function that given a string of parentheses, out- puts∗ if and only if every opening parenthesisΣ = {⟨, ⟩} is matched by a corre-∶ spondingΣ → {0, closed1} one. Then there is no regular expression over that computes1 MATCHPAREN. Σ Lemma 6.19 is a consequence of the following result, which is known as the pumping lemma: 242 introduction to theoretical computer science

Theorem 6.20 — Pumping Lemma. Let be a regular expression over some alphabet . Then there is some number such that for ev- ery with and 푒 , we can write for 0 strings ∗ Σ satisfying the following conditions:푛 0 푒 푤 ∈ Σ |푤|∗ > 푛 Φ (푤) = 1 푤 = 푥푦푧 1. 푥, 푦,. 푧 ∈ Σ

2. |푦| ≥ 1 .

0 3. |푥푦| ≤ 푛 for every . 푘 푒 Φ (푥푦 푧) = 1 푘 ∈ ℕ Figure 6.9: To prove the “pumping lemma” we look at a word that is much larger than the regular expression that matches it. In such a case, part of must be푤 matched by some sub-expression of the form ,푒 since this is the only operator that allows matching푤 ′ ∗ words longer than the expression. If we look at(푒 the) “leftmost” such sub-expression and define to be the string that is matched by it, we obtain the

partition푘 needed for the pumping lemma. 푦

Proof Idea: The idea behind the proof the following. Let be twice the num- ber of symbols that are used in the expression , then the only way 0 that there is some with and 푛 is that contains the (i.e. star) operator and that there is a nonempty푒 substring of 0 푒 that was matched푤 by |푤|for > some 푛 sub-expressionΦ (푤) = 1 of푒 . We can now∗ repeat any number′ of∗ times and still get a matching′ string.푦 See 푤also Fig. 6.9. (푒 ) 푒 푒 푦

⋆ P The pumping lemma is a bit cumbersome to state, but one way to remember it is that it simply says the following: “if a string matching a regular expression is long enough, one of its substrings must be matched using the operator”.

∗ functions with infinite domains, automata, and regular expressions 243

Proof of Theorem 6.20. To prove the lemma formally, we use induction on the length of the expression. Like all induction proofs, this is going to be somewhat lengthy, but at the end of the day it directly follows the intuition above that somewhere we must have used the star oper- ation. Reading this proof, and in particular understanding how the formal proof below corresponds to the intuitive idea above, is a very good way to get more comfortable with inductive proofs of this form. Our inductive hypothesis is that for an length expression, satisfies the conditions of the lemma. The base case is when the 0 expression is a single symbol or that푛 the expression is 푛or= 2푛"". In all these cases the conditions of the lemma are satisfied∅ simply because there and there휎 is ∈ no Σ string of length larger than that is matched by the expression. 0 0 We now prove푛 = the 2 inductive step. Let be푥 a regular expression 푛 with symbols. We set and let be a string satisfying . Since has more than푒 one symbol,∗ it has one of 0 the the푛 forms > 1 (a) , (b), 푛 = 2푛, or (c) 푤where ∈ Σ in all these 0 cases the subexpressions|푤| > 푛 ′ ″ 푒 and′ ″have fewer′ symbols∗ than and hence satisfy the induction푒 |푒 ′ hypothesis.(푒 )(푒″ ) (푒 ) In the case (a), every string푒 푒matched by must be matched푒 by either or . If matches then, since , by the induction hypothesis′ there″ exist′ with푤 and푒 ′ such that 푒(and푒 therefore푒 also 푤 ) matches|푤| > 2|푒 |for every′ . The 0 same′ arguments works푥, in 푦, the 푧 case′ |푦| that″ ≥ 1 matches|푥푦|푘 ≤. 2|푒 | < 푛 In푒 the case (b), if is matched푒 = 푒 |푒 by ″ then푥푦 푧 we can write푘 where matches and matches′푒 ″ . We split푤 to subcases. If ′ ″ then′ by the푤 induction′ ″ hypothesis(푒 )(푒 )″ there exist with푤 = 푤 ′푤 , ′ 푒 푤 such푒 that 푤 and matches′ |푤for| every > 2|푒 | . This′ completes the proof′ since′ if we′ set푥, 푦, 푧 푘 ′ 0 |푦|then ≤ we 1 |푥푦| see that ≤ 2|푒 | < 푛 and푤 = 푥푦푧 푒matches 푥푦′ ″for푧 every 푘 ∈. Otherwise, ℕ ′ if″ then since′ ″ 푧 = 푧푘푤 푤, it = must 푤 푤 be=′ that 푥푦푧 ′ 푒 = (푒.)(푒 Hence) by the′ 푥푦 induction푧″ hypothesis푘 ∈′ ℕ there″ exist |푤such| ≤ 2|푒 that″ | ″ , |푤| = |푤 | +and |푤 | > 0 푛matches= 2(|푒 | + |푒for|) every′ . But|푤 now| > if 2|푒 we| set ′ ″we see that″ ′ 푘 푥 , 푦, 푧 |푦|and ≥ 1 on|푥 the푦| other ≤′ 2|푒′ hand| the푒 expression푥′ 푦 푧 ′ matches푘′ ∈ ℕ ″ for푥 = every 푤 푥 . 0 |푥푦|In ≤ case |푤 (c)| +, |푥 if 푦|′ is ≤ matched″ 2|푒 | + 2|푒 by |푘 = 푛then′ ′ 푘 where for every 푒, = (푒is)(푒 a nonempty) string푥푦 ′푧∗ =matched 푤 푥 푦 푧 by . If 푘 ∈ ℕ 0 푡 then we can use푤 the same approach(푒 ) as in the푤 concatenation = 푤 ′⋯ 푤 case above.′ 푖 0 Otherwise,푖 ∈ [푡] we푤 simply note that if is the empty string,푒 |푤 | > 2|푒, and| then and is matched by for every 0 . 푥 푘 ′푦∗ = 푤 1 푡 0 푧 = 푤 ⋯ 푤 |푥푦| ≤ 푛 푥푦 푧 (푒 ) ■ 푘 ∈ ℕ 244 introduction to theoretical computer science

R Remark 6.21 — Recursive definitions and inductive proofs. When an object is recursively defined (as in the case of regular expressions) then it is natural to prove properties of such objects by induction. That is, if we want to prove that all objects of this type have prop- erty , then it is natural to use an inductive steps that says that if etc have property then so is an object푃 that′ is obtained″ ‴ by composing them. 표 , 표 , 표 푃 표 Using the pumping lemma, we can easily prove Lemma 6.19 (i.e., the non-regularity of the “matching parenthesis” function):

Proof of Lemma 6.19. Suppose, towards the sake of contradiction, that there is an expression such that MATCHPAREN. Let be the number obtained from Theorem 6.20 and let (i.e., 푒 0 left parenthesis followed푒 by Φright= parenthesis). Then푛0 푛 we0 푛 see that if we write as in Lemma 6.19, the condition푤 = ⟨ ⟩ 0 0 implies푛 that consists solely of푛 left parenthesis. Hence the string 0 will contain푤 more = 푥푦푧 left parenthesis than right parenthesis.|푥푦| ≤ Hence 푛 MATCHPAREN2 푦 but by the pumping lemma , 푥푦contradicting푧 our assumption2 that MATCHPAREN. 2 푒 (푥푦 푧) = 0 Φ (푥푦 푧) = 1■ 푒 Φ =

The pumping lemma is a very useful tool to show that certain func- tions are not computable by a regular expression. However, it is not an “if and only if” condition for regularity: there are non regular functions that still satisfy the conditions of the pumping lemma. To understand the pumping lemma, it is important to follow the order of quantifiers in Theorem 6.20. In particular, the number in the state- ment of Theorem 6.20 depends on the regular expression (in the proof 0 we chose to be twice the number of symbols in the expression).푛 So, if we want to use the pumping lemma to rule out the existence of a 0 regular expression푛 computing some function , we need to be able to choose an appropriate input that can be arbitrarily large and satisfies 푒 . This makes sense∗ if you퐹 think about the intu- ition behind the pumping lemma:푤 ∈ we {0, need 1} to be large enough as to force the use퐹 of (푤) the = star 1 operator. 푤 Solved Exercise 6.4 — Palindromes is not regular. Prove that the following function over the alphabet is not regular: PAL if and only if where and denotes “reversed”: the string 푅 . (The {0,Palindrome 1, ; } ∗ function푅 is most(푤) often = defined 1 without푤 an = explicit 푢; 푢 separator푢 ∈ character {0, 1} , but푢 the version푢 with such a |푢|−1 0 separator is푢 a bit⋯ cleaner 푢 and so we use it here. This does not make ; functions with infinite domains, automata, and regular expressions 245

Figure 6.10: A cartoon of a proof using the pumping lemma that a function is not regular. The pumping lemma states that if is regular then there exists a number such that for every large enough with , there exists a partition of to satisfying certain conditions such that for every , . You can imagine a pumping-lemma퐹 based proof as a game between you and the adversary.퐹 Every there exists 0 quantifier corresponds푛 to an푘 object you are free푤 to choose퐹(푤) on = your 1 own (and base your푤 choice푤 on = previously 푥푦푧 chosen objects). Every for every quantifier corresponds to푘 an ∈ object ℕ 퐹(푥푦 the adversary푧) = 1 can choose arbitrarily (and again based on prior choices) as long as it satisfies the conditions. A valid proof corresponds to a strategy by which no matter what the adversary does, you can win the game by obtaining a contradiction which would be a choice of that would result in , hence violating the conclusion of the pumping lemma.

푘 푘 퐹(푥푦 푧) = 0 246 introduction to theoretical computer science

much difference, as one can easily encode the separator as aspecial binary string instead.) ■

Solution: We use the pumping lemma. Suppose towards the sake of con- tradiction that there is a regular expression computing PAL, and let be the number obtained by the pumping lemma (The- orem 6.20). Consider the string 푒 . Since the reverse 0 of the all푛 zero string is the all zero string, PAL푛0 푛0 . Now, by the pumping lemma, if PAL is computed푤 = by 0 ;, 0 then we can write such that , and(푤)PAL = 1 for every . In particular, it must hold that푒PAL 푘 , but this 0 푤is a = contradiction, 푥푦푧 since|푥푦| ≤ 푛 |푦| ≥ 1 and so(푥푦 its two푧) parts = 1 are not of푘 the ∈ same ℕ length and in particular푛0−|푦| are푛0 not the(푥푧) reverse = 1 of one another. 푥푧 = 0 ; 0 ■

For yet another example of a pumping-lemma based proof, see Fig. 6.10 which illustrates a cartoon of the proof of the non-regularity of the function which is defined as iff for some ∗ (i.e., consists of a string of consecutive zeroes,푛 followed푛 퐹 by ∶ {0, a string 1} → of {0, consecutive 1} ones of the same퐹 (푥) length). = 1 푥 = 0 1 푛 ∈ ℕ 푥

6.6 ANSWERING SEMANTIC QUESTIONS ABOUT REGULAR EX- PRESSIONS

Regular expressions have applications beyond search. For example, regular expressions are often used to define tokens (such as what is a valid variable identifier, or keyword) in the design of parsers, compilers and interpreters for programming languages. There other applications too: for example, in recent years, the world of networking moved from fixed to “software defined networks”. Such networks are routed by programmable that can implement policies such as “if packet is secured by SSL then forward it to A, otherwise for- ward it to B”. To represent such policies we need a language that is on one hand sufficiently expressive to capture the policies wewant to implement, but on the other hand sufficiently restrictive so that we can quickly execute them at network speed and also be able to answer questions such as “can C see the packets moved from A to B?”. The NetKAT network programming language uses a variant of regular expressions to achieve precisely that. For this application, it is im- portant that we are not able to merely answer whether an expression matches a string but also answer semantic questions about regu- lar expressions such as “do expressions and compute the same 푒 푥 ′ 푒 푒 functions with infinite domains, automata, and regular expressions 247

function?” and “does there exist a string that is matched by the ex- pression ?”. The following theorem shows that we can answer the latter question: 푥 푒 Theorem 6.22 — Emptiness of regular languages is computable. There is an algorithm that given a regular expression , outputs if and only if is the constant zero function. 푒 1 푒 ProofΦ Idea: The idea is that we can directly observe this from the structure of the expression. The only way a regular expression computes the constant zero function is if has the form or is obtained by concatenating with other expressions. ∅ 푒 ∅ 푒

Proof⋆ of Theorem 6.22. Define a regular expression to be “empty” ifit computes the constant zero function. Given a regular expression , we can determine if is empty using the following rules: 푒 • If has the form푒 or "" then it is not empty.

• If 푒 is not empty then휎 is not empty for every . ′ ′ • If 푒 is not empty then 푒|푒is not empty. 푒 ∗ • If 푒 and are both not푒 empty then is not empty. ′ ′ • 푒is empty.푒 푒 푒 ∅ Using these rules it is straightforward to come up with a recursive algorithm to determine emptiness. ■

Using Theorem 6.22, we can obtain an algorithm that determines whether or not two regular expressions and are equivalent, in the sense that they compute the same function. ′ 푒 푒 Theorem 6.23 — Equivalence of regular expressions is computable. Let REGEQ be the function that on input (a string representing) a pair∗ of regular expressions , REGEQ if and only∶ if {0, 1} → {0,. Then 1} there is an algorithm′ that computes′ REGEQ. ′ 푒, 푒 (푒, 푒 ) = 1 푒 푒 Φ = Φ Proof Idea: The idea is to show that given a pair of regular expressions and we can find an expression such that if and only if ′ . Therefore ″is the constant″ zero function if and푒 only 푒 푒 ′ 푒″ Φ (푥) = 1 푒 푒 푒 Φ (푥) ≠ Φ (푥) Φ 248 introduction to theoretical computer science

if and are equivalent, and thus we can test for emptiness of to determine′ equivalence of and . ″ 푒 푒 ′ 푒 푒 푒 Proof⋆ of Theorem 6.23. We will prove Theorem 6.23 from Theorem 6.22. (The two theorems are in fact equivalent: it is easy to prove Theo- rem 6.22 from Theorem 6.23, since checking for emptiness is the same as checking equivalence with the expression .) Given two regu- lar expressions and , we will compute an expression∅ such that if and only′ if . One can see that″ is equiva- lent″ to if and푒 only if푒 is empty. ′ 푒 푒 푒 푒 Φ We(푥) start =′ 1 with the observation″ Φ (푥) ≠ that Φ (푥) for every bit 푒 , if and only푒 if 푒 푎, 푏 ∈ {0, 1} 푎(6.15) ≠ 푏 Hence we need to construct such that for every , (푎 ∧ 푏) ∨ (푎 ∧ 푏) . ″ 푒 푥 (6.16)

″ ′ ′ To construct푒 the expression푒 푒 , we will푒 show how푒 given any pair of Φ (푥) = (Φ (푥) ∧ Φ (푥)) ∨ (Φ (푥) ∧ Φ (푥)) . expressions and , we can construct″ expressions and that compute the functions′ 푒 and respectively. (Computing′ the expression for푒 푒 is straightforward′ using the operation푒 ∧ 푒 of푒 regular 푒 푒 푒 expressions.) ′ Φ ∧ Φ Φ Specifically,푒 by∨Lemma 푒 6.17, regular functions are| closed under negation, which means that for every regular expression , there is an expression such that for every . Now, for every two expression and , the expression 푒 ∗ 푒 푒 푒 Φ (푥) = 1′ − Φ (푥) 푥 ∈ {0, 1} 푒 푒 (6.17) ′ ′ computes the AND of the two푒 ∧ expressions. 푒 = (푒|푒 ) Given these two transfor- mations, we see that for every regular expressions and we can find a regular expression satisfying (6.16) such that is empty′ if and only if and are equivalent.″ 푒″ 푒 ■ ′ 푒 푒 푒 푒

Chapter Recap

• We model computational tasks on arbitrarily large ✓ inputs using infinite functions . • Such functions take an arbitrarily long (but∗ still ∗ finite!) string as input, and퐹 cannot ∶ {0, 1} →be {0,described 1} by a finite table of inputs and ouptuts. • A function with a single bit of output is known as a Boolean function, and the tasks of computing it is equivalent to deciding a language . ∗ 퐿 ⊆ {0, 1} functions with infinite domains, automata, and regular expressions 249

• Deterministic finite automata (DFAs) are one simple model for computing (infinite) Boolean functions. • There are some functions that cannot be computed by DFAs. • The set of functions computable by DFAs is the same as the set of languages that can be recognized by regular expressions.

6.7 EXERCISES

Exercise 6.1 — Closure properties of regular functions. Suppose that are regular. For each one of the following defini- tions of∗ the function , either prove that is always regular퐹 or , 퐺give ∶ a {0,counterexample 1} → {0, 1} for regular that would make not regular. 퐻 퐻 1. . 퐹 , 퐺 퐻 2. 퐻(푥) = 퐹 (푥) ∨ 퐺(푥) 3. NAND . 퐻(푥) = 퐹 (푥) ∧ 퐺(푥) 4. where is the reverse of : for 퐻(푥) = (퐹 (푥), 퐺(푥)) . 푅 푅 푅 푛−1 푛−2 표 퐻(푥) = 퐹 (푥 ) 푥 푥 푥 = 푥 푥 ⋯ 푥 s.t. 5. 푛 = |푥| otherwise {⎧1 푥 = 푢푣 퐹 (푢) = 퐺(푣) = 1 퐻(푥) = {⎨0 s.t. 6. ⎩ otherwise {⎧1 푥 = 푢푢 퐹 (푢) = 퐺(푢) = 1 퐻(푥) = {⎨0 s.t. 7. ⎩ otherwise푅 {⎧1 푥 = 푢푢 퐹 (푢) = 퐺(푢) = 1 퐻(푥) = ■ ⎨ ⎩{0 Exercise 6.2 One among the following two functions that map to can be computed by a regular expression, and the other one∗ cannot. For the one that can be computed by a regular expression,{0, 1} write{0, 1}the expression that does it. For the one that cannot, prove that this cannot be done using the pumping lemma.

• if divides and otherwise. |푥|−1 푖 • 퐹 (푥) = 1 if4 and only if∑푖=0 푥 퐹 (푥)and = 0 otherwise. |푥|−1 푖 ■ 퐺(푥) = 1 ∑푖=0 푥 ≥ |푥|/4 퐺(푥) = 0 Exercise 6.3 — Non regularity. 1. Prove that the following function is not regular. For every , iff is of the∗ form for some . ∗ 퐹 ∶ 푖 {0, 1} → {0, 1} 3 푥 ∈ {0, 1} 퐹 (푥) = 1 푥 푥 = 1 푖 > 0 250 introduction to theoretical computer science

2. Prove that the following function is not regular. For every , iff ∗ for some . ∗ 퐹 ∶ {0, 1} 푖 → {0, 1} ■ 푗 푥 ∈ {0, 1} 퐹 (푥) = 1 ∑푗 푥 = 3 푖 > 0 6.8 BIBLIOGRAPHICAL NOTES

The relation of regular expressions with finite automata is a beautiful topic, on which we only touch upon in this text. It is covered more extensively in [Sip97; HMU14; Koz97]. These texts also discuss top- ics such as non deterministic finite automata (NFA) and the relation between context-free grammars and pushdown automata. The automaton of Fig. 6.4 was generated using the FSM simulator of Ivan Zuzak and Vedrana Jankovic. Our proof of Theorem 6.11 is closely related to the Myhill-Nerode Theorem. One direction of the Myhill-Nerode theorem can be stated as saying that if is a regular expression then there is at most a finite number of strings such that for every . 푒 0 푘−1 푧 , … , 푧 푒[푧푖] 푒[푧푗] Φ ≠ Φ 0 ≤ 푖 ≠ 푗 < 푘 Learning Objectives: • Learn the model of Turing machines, which can compute functions of arbitrary input lengths. • See a programming-language description of Turing machines, using NAND-TM programs, which add loops and arrays to NAND-CIRC. • See some basic syntactic sugar and equivalence of variants of Turing machines 7 and NAND-TM programs. Loops and infinity

“The bounds of arithmetic were however outstepped the moment the idea of applying the [punched] cards had occurred; and the Analytical Engine does not occupy common ground with mere ‘calculating machines.’ … In enabling mechanism to combine together general symbols, in successions of unlim- ited variety and extent, a uniting link is established between the operations of matter and the abstract mental processes of the most abstract branch of mathe- matical science.”, Ada Augusta, countess of Lovelace, 1843

As the quote of Chapter 6 says, an algorithm is “a finite answer to an infinite number of questions”. To express an algorithm we needto write down a finite set of instructions that will enable us to compute on arbitrarily long inputs. To describe and execute an algorithm we need the following components (see Fig. 7.1):

• The finite set of instructions to be performed.

• Some “local variables” or finite state used in the execution.

• A potentially unbounded working memory to store the input as well as any other values we may require later.

• While the memory is unbounded, at every single step we can only read and write to a finite part of it, and we need a way to address which are the parts we want to read from and write to.

• If we only have a finite set of instructions but our input can be arbitrarily long, we will need to repeat instructions (i.e., loop back). We need a mechanism to decide when we will loop and when we will halt. Figure 7.1: An algorithm is a finite recipe to compute on arbitrarily long inputs. The components of an This chapter: A non-mathy overview algorithm include the instructions to be performed, In this chapter we give a general model of an algorithm, finite state or “local variables”, the memory to store the input and intermediate computations, as well as which (unlike Boolean circuits) is not restricted to a fixed mechanisms to decide which part of the memory to input lengths, and (unlike finite automata) is not restricted to access, and when to repeat instructions and when to halt.

Compiled on 8.26.2020 18:10 252 introduction to theoretical computer science

a finite amount of working memory. We will see two waysto model algorithms:

• Turing machines, invented by Alan Turing in 1936, are an hypothetical abstract device that yields a finite description of an algorithm that can handle arbitrarily long inputs.

• The NAND-TM Programming language extends NAND- CIRC with the notion of loops and arrays to obtain finite programs that can compute a function with arbitrarily long inputs.

It turns out that these two models are equivalent, and in fact they are equivalent to many other computational models including programming languages such as C, Lisp, Python, JavaScript, etc. This notion, known as Turing equivalence or Turing completeness, will be discussed in Chapter 8. See Fig. 7.2 for an overview of the models presented in this chap- ter and Chapter 8.

Figure 7.2: Overview of our models for finite and unbounded computation. In the previous chapters we study the computation of finite functions, which are functions for some fixed

, and modeled computing푛 these푚 functions using circuits or straightline푓 ∶ {0, 1} programs.→ {0, 1} In this chapter we 푛,study 푚 computing unbounded functions of the form or .

We model∗ computing푚 these functions∗ using Turing∗ 퐹Machines ∶ {0, 1}or (equivalently)→ {0, 1} 퐹 NAND-TM ∶ {0, 1} programs→ {0, 1} which add the notion of loops to the NAND-CIRC programming language. In Chapter 8 we will show that these models are equivalent to many other models, including RAM machines, the calculus, and all the common programming languages including C, Python, Java, JavaScript, etc. 휆

7.1 TURING MACHINES “Computing is normally done by writing certain symbols on paper. We may suppose that this paper is divided into squares like a child’s arithmetic book.. The behavior of the [human] computer at any moment is determined by the symbols which he is observing, and of his ‘state of mind’ at that moment… We may suppose that in a simple operation not more than one symbol is altered.”, “We compare a man in the process of computing … to a machine which is only capable of a finite number of configurations… The machine is supplied witha ‘tape’ (the analogue of paper) … divided into sections (called ‘squares’) each capable of bearing a ‘symbol’ ”, Alan Turing, 1936 loops and infinity 253

“What is the difference between a Turing machine and the modern computer? It’s the same as that between Hillary’s ascent of Everest and the establishment of a Hilton hotel on its peak.” , Alan Perlis, 1982.

The “granddaddy” of all models of computation is the Turing Ma- chine. Turing machines were defined in 1936 by Alan Turing inan attempt to formally capture all the functions that can be computed by human “computers” (see Fig. 7.4) that follow a well-defined set of rules, such as the standard algorithms for addition or multiplication. Turing thought of such a person as having access to as much “scratch paper” as they need. For simplicity we can think of this scratch paper as a one dimensional piece of graph paper (or tape, as Figure 7.3: Aside from his many other achievements, it is commonly referred to), which is divided to “cells”, where each Alan Turing was an excellent long distance runner “cell” can hold a single symbol (e.g., one digit or letter, and more who just fell shy of making England’s olympic team. generally some element of a finite alphabet). At any point in time, the A fellow runner once asked him why he punished himself so much in training. Alan said “I have such person can read from and write to a single cell of the paper, and based a stressful job that the only way I can get it out of my on the contents can update his/her finite mental state, and/or move to mind is by running hard; it’s the only way I can get some release.” the cell immediately to the left or right of the current one. Turing modeled such a computation by a “machine” that maintains one of states. At each point in time the machine reads from its “work tape” a single symbol from a finite alphabet and uses that to up- date its푘 state, write to tape, and possibly move to an adjacent cell (see Fig. 7.7). To compute a function using thisΣ machine, we initialize the tape with the input and our goal is to ensure that the tape will contain the value at∗ the퐹 end of the computation. Specifically, a computation of a Turing푥 ∈ {0, Machine 1} with states and alphabet on Figure 7.4: Until the advent of electronic computers, input proceeds퐹 (푥) as follows: the word “computer” was used to describe a person ∗ 푀 푘 Σ that performed calculations. Most of these “human • Initially푥 ∈ {0, the 1} machine is at state (known as the “starting state”) computers” were women, and they were absolutely and the tape is initialized to . We use the essential to many achievements including mapping the stars, breaking the Enigma cipher, and the NASA symbol to denote the beginning0 of the tape,∅ ∅ and the symbol to space mission; see also the bibliographical notes. 0 푛−1 denote an empty cell. We will▷, always 푥 , … ,assume 푥 , , that,… the alphabet∅ is Photo from National Photo Company Collection; see also [Sob17]. a (potentially▷ strict) superset of . ∅ Σ • The location to which the machine points to is set to . {▷, , 0, 1} • At each step, the machine reads the symbol that is in the 푖 0 location of the tape, and based on this symbol and its state decides푡ℎ on: 휎 = 푇 [푖] 푖 푠 – What symbol to write on the tape – Whether to move′ Left (i.e., ), Right (i.e., ), Stay in place, or Halt휎 the computation. Figure 7.5: Steam-powered Turing Machine mural, 푖 ← 푖 − 1 푖 ← 푖 + 1 painted by CSE grad students at the University of – What is going to be the new state Washington on the night before spring qualifying examinations, 1987. Image from https://www.cs. • The set of rules the Turing machine follows푠 ∈ [푘] is known as its transi- washington.edu/building/art/SPTM. tion function. 254 introduction to theoretical computer science

• When the machine halts then its output is the binary string ob- tained by reading the tape from the beginning until the head posi- tion, dropping all symbols such as , , etc. that are not either or . ∅ ▷ 0 1 7.1.1 Extended example: A Turing machine for palindromes Let PAL (for palindromes) be the function that on input , outputs if and only if is an (even length) palindrome, in the sense∗ that for some and 푥 ∈ {0, 1} . We now1 show a Turing푥 Machine that computes PAL. To specify푛 0 푛−1 푛−1 푛−2 0 we푥 need= 푤 ⋯ to 푤 specify푤 (i)푤 ’s⋯ tape 푤 alphabet푛 ∈which ℕ should푤 ∈ {0, contain 1} at least the symbols , , and , and푀(ii) ’s transition function which Figure 7.6: The components of a Turing Machine. Note 푀determines what action 푀takes∅ when it readsΣ a given symbol while it how they correspond to the general components of is in a particular state.0 1 ▷ 푀 algorithms as described in Fig. 7.1. In our case, will use푀 the alphabet and will have states. Though the states are simply numbers∅ between and , for convenience푀 we will give them{0, the 1, ▷,following, ×} labels: 푘 = 14 0 푘State − 1 Label 0 START 1 RIGHT_0 2 RIGHT_1 3 LOOK_FOR_0 4 LOOK_FOR_1 5 RETURN 6 REJECT 7 ACCEPT 8 OUTPUT_0 9 OUTPUT_1 10 0_AND_BLANK 11 1_AND_BLANK 12 BLANK_AND_STOP

We describe the operation of our Turing Machine in words:

• starts in state START and will go right, looking for푀 the first sym- bol that is or . If we find before we hit such a symbol then we 푀will move to the OUTPUT_1 state∅ that we describe below. 0 1 • Once finds such a symbol , deletes from the tape by writing the symbol, it enters either the RIGHT_0 or RIGHT_1 mode푀 according to the value of푏 ∈and {0, 1} starts푀 moving푏 rightwards until it hits the× first or symbol. ∅ 푏 × loops and infinity 255

• Once we find this symbol we go into the state LOOK_FOR_0 or LOOK_FOR_1 depending on whether we were in the state RIGHT_0 or RIGHT_1 and make one left move.

• In the state LOOK_FOR_ , we check whether the value on the tape is . If it is, then we delete it by changing its value to , and move to the state RETURN. Otherwise,푏 we change to the OUTPUT_0 state. 푏 × • The RETURN state means we go back to the beginning. Specifically, we move leftward until we hit the first symbol that is not or , in which case we change our state to START. 0 1 • The OUTPUT_ states mean that we are going to output the value . In both these states we go left until we hit . Once we do so, we make a right푏 step, and change to the 1_AND_BLANK or 0_AND_BLANK푏 states respectively. In the latter states, we write▷ the corresponding value, and then move right and change to the BLANK_AND_STOP state, in which we write to the tape and halt. ∅ The above description can be turned into a table describing for each one of the combination of state and symbol, what the Turing machine will do when it is in that state and it reads that symbol. This table is known13 ⋅ 5 as the transition function of the Turing machine.

7.1.2 Turing machines: a formal definition

Figure 7.7: A Turing machine has access to a tape of unbounded length. At each point in the execution, the machine can read a single symbol of the tape, and based on that and its current state, write a new symbol, update the tape, decide whether to move left, right, stay, or halt.

The formal definition of Turing machines is as follows:

Definition 7.1 — Turing Machine. A (one tape) Turing machine with states and alphabet is represented by a transition function L ∅R S H . 푘 Σ ⊇ {0, 1, ▷, } 푀 훿 ∶[푘]×Σ→[푘]×Σ×{ , , , } 256 introduction to theoretical computer science

For every , the output of on input , denoted by , is the result of the following∗ process: 푥 ∈ {0, 1} 푀 푥 푀(푥)• We initialize to be the sequence , where . (That is, , for∅ ∅ , and 0 1 푛−1 for 푇 .) ▷, 푥 , 푥 , … , 푥 , , ,… 푖 ∅푛 = |푥| 푇 [0] = ▷ 푇 [푖 + 1] = 푥 푖 ∈ [푛] •푇 We [푖] also = initialize푖 > 푛 and .

• We then repeat the푖 = following 0 푠 = process: 0 1. Let . 2. Set ′ ′ , . 푀 3. If (푠 ,R 휎then′, 퐷) =set 훿 (푠,′ 푇 [푖]), if L then set max . (If 푠 →S 푠 then푇 [푖] we → keep 휎 the same.) 4. If 퐷 = H then halt.푖 → 푖+1 퐷 = 푖 → {푖−1, 0} 퐷 = 푖 • If the퐷 process = above halts, then ’s output, denoted by , is the string obtained by concatenating all the sym- bols in in positions∗ 푀 where is the final푀(푥) head position. 푦 ∈ {0, 1} {0, 1} 푇 [0], … , 푇 [푖] 푖 • If The Turing machine does not halt then we denote .

푀(푥) = ⊥

P You should make sure you see why this formal def- inition corresponds to our informal description of a Turing Machine. To get more intuition on Turing Machines, you can explore some of the online avail- able simulators such as Martin Ugarte’s, Anthony Morphett’s, or Paul Rendell’s.

One should not confuse the transition function of a Turing ma- chine with the function that the machine computes. The transition 푀 function is a finite function, with inputs and훿 outputs. (Can you푀 see why?) The machine can compute an infinite function 푀 that takes훿 as input a string 푘|Σ|of arbitrary length4푘|Σ| and might also produce an arbitrary length string∗ as output. 퐹 In our formal definition,푥∈ we {0,identified 1} the machine with its tran- sition function since the transition function tells us everything we need to know about the Turing machine, and hence푀 serves as a 푀 good mathematical훿 representation of it. This choice of representa- tion is somewhat arbitrary, and is based on our convention that the state space is always the numbers with as the starting state. Other texts use different conventions and so their mathematical {0, … , 푘 − 1} 0 loops and infinity 257

definition of a Turing machine might look superficially different, but these definitions describe the same computational process andhas the same computational powers. See Section 7.7 for a comparison be- tween Definition 7.1 and the way Turing Machines are defined in texts such as Sipser [Sip97]. These definitions are equivalent despite their superficial differences.

7.1.3 Computable functions We now turn to making one of the most important definitions in this book, that of computable functions.

Definition 7.2 — Computable functions. Let be a (total) function and let be a Turing machine. We∗ say that ∗ computes if for every , 퐹 ∶ {0,. 1} → {0, 1} We say that a function푀 is computable∗ if there exists a Turing푀 machine 퐹 that computes푥 ∈ it.{0, 1} 푀(푥) = 퐹 (푥) 퐹 Defining푀 a function “computable” if and only if it can be computed by a Turing machine might seem “reckless” but, as we’ll see in Chap- ter 8, it turns out that being computable in the sense of Definition 7.2 is equivalent to being computable in essentially any reasonable model of computation. This is known as the Church-Turing Thesis. (Unlike the extended Church-Turing Thesis which we discussed in Section 5.6, the Church-Turing thesis itself is widely believed and there are no candidate devices that attack it.)

 Big Idea 9 We can precisely define what it means for a function to be computable by any possible algorithm.

This is a good point to remind the reader that functions are not the same as programs:

Functions Programs (7.1) A Turing machine (or program) can compute some function ≠ . , but it is not the same as . In particular there can be more than one program to compute the same function.푀 Being computable is a 퐹property of functions, not of퐹 machines. We will often pay special attention to functions that have a single bit of output. Hence we give a special name∗ for the set of functions of this form that are computable. 퐹 ∶ {0, 1} → {0, 1}

Definition 7.3 — The class R. We define R be the set of all computable functions . ∗ 퐹 ∶ {0, 1} → {0, 1} 258 introduction to theoretical computer science

R Remark 7.4 — Functions vs. languages. As discussed in Section 6.1.2, many texts use the terminology of “languages” rather than functions to refer to compu- tational tasks. A Turing machine decides a language if for every input , outputs if and only if . This is equivalent푀∗ to computing 퐿the Boolean function푥 ∈ {0, 1} 푀(푥) defined1 as iff푥 ∈ 퐿. A language∗ is decidable if there is a Turing machine 퐹that ∶ {0, decides 1} → it. {0,For 1} historical 퐹reasons, (푥) = some 1 푥 texts ∈ 퐿also call such a퐿 language recursive (which is the reason푀 that the letter R is often used to denote the set of computable Boolean functions / decidable languages defined in Definition 7.3). In this book we stick to the terminology of functions rather than languages, but all definitions and results can be easily translated back and forth by using the equivalence between the function and the language ∗ . ∗퐹 ∶ {0, 1} → {0, 1} 퐿 = {푥 ∈ {0, 1} | 퐹 (푥) = 1}

7.1.4 Infinite loops and partial functions One crucial difference between circuits/straight-line programs and Turing machines is the following. Looking at a NAND-CIRC program , we can always tell how many inputs and how many outputs it has (by simply looking at the X and Y variables). Furthermore, we are guaranteed푃 that if we invoke on any input then some output will be produced. In contrast, given any Turing푃 machine , we cannot determine a priori the length of the output. In fact, we don’t even know if an output would be produced at all! For example,푀 it is very easy to come up with a Turing machine whose transition function never outputs H and hence never halts. If a machine fails to stop and produce an output on some an input , then it cannot compute any total function , since clearly on 1 A partial function from a set to a set is a input , will푀 fail to output . However, can still compute a function that is only defined on a subset of , (see partial푥 function.1 퐹 Section 1.4.3). We can퐹 also think퐴 of such a퐵 function as mapping to where is a special퐴 “failure” For푥 example,푀 consider the partial퐹 (푥) function DIV푀 that on input a pair symbol such that indicates the function of natural numbers, outputs if , and is undefined is not defined퐴 퐵 on ∪. {⊥} ⊥ otherwise. We can define a Turing machine that computes DIV on 퐹(푎) = ⊥ 퐹 푎 (푎,input 푏) by outputting the first ⌈푎/푏⌉ 푏 >such 0 that . If and then the machine will never halt,푀 but this is OK, since DIV is푎, undefined 푏 on such inputs.푐 = If 0, 1, 2,and … , the푐푏 machine ≥ 푎 푎 > 0 will푏 output = 0 , which is also OK,푀 since we don’t care about what the program outputs on inputs on which푎DIV = 0 is undefined.푏 = 0 Formally,푀 we define computability0 of partial functions as follows: loops and infinity 259

Definition 7.5 — Computable (partial or total) functions. Let be either a total or partial function mapping to and let be a Turing machine. We say that computes∗ if for∗ every퐹 on which is defined, {0, 1} .{0, We 1} say that a (partial푀 or∗ total) function is computable푀if there is a퐹 Turing machine푥 ∈ that {0, 1} computes퐹 it. 푀(푥) = 퐹 (푥) 퐹 Note that if is a total function, then it is defined on every and hence in this case, Definition 7.5 is identical to Defini- tion 7.2∗ . 퐹 푥 ∈ {0, 1}

R Remark 7.6 — Bot symbol. We often use as our spe- cial “failure symbol”. If a Turing machine fails to halt on some input then we⊥ denote this by . This does not mean∗ that outputs푀 some encoding of the symbol푥 ∈ {0,but 1} rather that enters 푀(푥)into an infinite= ⊥ loop when given as input.푀 If a partial function is⊥ undefined on then푀 we can also write . Therefore푥 one might think that Definition 7.5 can퐹 be simplified to푥 requiring that 퐹 (푥)for every = ⊥ , which would imply that for every , halts on if and∗ only if is de- fined푀(푥) = on 퐹. (푥) However this푥 is∈ not{0, 1} the case: for a Turing Machine to푥 compute푀 a partial푥 function 퐹it is not necessary푥for to enter an infinite loop on inputs on which푀 is not defined. All that is퐹 needed isfor to output 푀 on ’s on which is defined: on푥 other inputs it is퐹 OK for to output an arbitrary value such푀 as , , or anything퐹 (푥) 푥 else, or not to퐹 halt at all. To borrow a term from the C programming푀 language, on inputs on0 which1 is not defined, what does is “undefined behavior”. 푥 퐹 푀

7.2 TURING MACHINES AS PROGRAMMING LANGUAGES

The name “Turing machine”, with its “tape” and “head” evokes a physical object, while in contrast we think of a program as a piece of text. But we can think of a Turing machine as a program as well. For example, consider the Turing Machine of Section 7.1.1 that computes the function PAL such that PAL iff is a palindrome. We can also describe this machine as a program푀 using the Python-like pseudocode of the form below (푥) = 1 푥

# Gets an array Tape initialized to # [">", x_0 , x_1 , .... , x_(n-1), "฀", "฀", ...] # At the end of the execution, Tape[1] is equal to 1 # if x is a palindrome and is equal to 0 otherwise 260 introduction to theoretical computer science

def PAL(Tape): head = 0 state = 0 # START while (state != 12): if (state == 0 && Tape[head]=='0'): state = 3 # LOOK_FOR_0 Tape[head] = 'x' head += 1 # move right if (state==0 && Tape[head]=='1') state = 4 # LOOK_FOR_1 Tape[head] = 'x' head += 1 # move right ... # more if statements here

The particular details of this program are not important. What mat- ters is that we can describe Turing machines as programs. Moreover, note that when translating a Turing machine into a program, the tape 2 2 Most programming languages use arrays of fixed becomes a list or array that can hold values from the finite set . The size, while a Turing machine’s tape is unbounded. But head position can be thought of as an integer valued variable that can of course there is no need to store an infinite number hold integers of unbounded size. The state is a local register thatΣ can of symbols. If you want, you can think of the tape as∅ a list that starts off just long enough to store the hold one of a fixed number of values in . input, but is dynamically grown in size as the Turing More generally we can think of every Turing Machine as equiva- machine’s head explores new positions. lent to a program similar to the following:[푘] 푀 # Gets an array Tape initialized to # [">", x_0 , x_1 , .... , x_(n-1), "฀", "฀", ...] def M(Tape): state = 0 i = 0 # holds head location while (True): # Move head, modify state, write to tape # based on current state and cell at head # below are just examples for how program looks for a particular transition function

if↪ Tape[i]=="0" and state==7: # δ_M(7,"0")=(19,"1","R")

↪ i += 1 Tape[i]="1" state = 19 elif Tape[i]==">" and state == 13: # δ_M(13,">")=(15,"0","S")

↪ Tape[i]="0" state = 15 elif ...... loops and infinity 261

elif Tape[i]==">" and state == 29: # δ_M(29,">")=(.,.,"H")

↪ break # Halt

If we wanted to use only Boolean (i.e., / -valued) variables then we can encode the state variables using log bits. Similarly, we can represent each element of the alphabet0 1 using log bits and hence we can replace the -valued array⌈ Tape[]푘⌉ with Boolean- valued arrays Tape0[], , Tape []. Σ ℓ = ⌈ |Σ|⌉ Σ ℓ 7.2.1 The NAND-TM Programming… (ℓ language − 1) We now introduce the NAND-TM programming language, which aims to capture the power of a Turing machine in a programming language formalism. Just like the difference between Boolean circuits and Tur- ing Machines, the main difference between NAND-TM and NAND- CIRC is that NAND-TM models a single uniform algorithm that can compute a function that takes inputs of arbitrary lengths. To do so, we extend the NAND-CIRC programming language with two constructs:

• Loops: NAND-CIRC is a straight-line programming language- a NAND-CIRC program of lines takes exactly steps of computa- tion and hence in particular cannot even touch more than vari- ables. Loops allow us to capture푠 in a short program푠 the instructions for a computation that can take an arbitrary amount of time.3푠

• Arrays: A NAND-CIRC program of lines touches at most vari- ables. While we can use variables with names such as Foo_17 or Bar[22], they are not true arrays, since푠 the number in the identifier3푠 is a constant that is “hardwired” into the program.

Figure 7.8: A NAND-TM program has scalar variables that can take a Boolean value, array variables that hold a sequence of Boolean values, and a special index variable i that can be used to index the array variables. We refer to the i-th value of the array variable Spam using Spam[i]. At each iteration of the program the index variable can be incremented or decremented by one step using the MODANDJMP operation.

Thus a good way to remember NAND-TM is using the following informal equation:

NAND-TM NAND-CIRC loops arrays (7.2)

= + + 262 introduction to theoretical computer science

R Remark 7.7 — NAND-CIRC + loops + arrays = every- thing.. As we will see, adding loops and arrays to NAND-CIRC is enough to capture the full power of all programming languages! Hence we could replace “NAND-TM” with any of Python, C, Javascript, OCaml, etc. in the lefthand side of (7.2). But we’re getting ahead of ourselves: this issue will be discussed in Chapter 8.

Concretely, the NAND-TM programming language adds the fol- lowing features on top of NAND-CIRC (see Fig. 7.8):

• We add a special integer valued variable i. All other variables in NAND-TM are Boolean valued (as in NAND-CIRC).

• Apart from i NAND-TM has two kinds of variables: scalars and arrays. Scalar variables hold one bit (just as in NAND-CIRC). Array variables hold an unbounded number of bits. At any point in the computation we can access the array variables at the location in- dexed by i using Foo[i]. We cannot access the arrays at locations other than the one pointed to by i.

• We use the convention that arrays always start with a capital letter, and scalar variables (which are never indexed with i) start with lowercase letters. Hence Foo is an array and bar is a scalar variable.

• The input and output X and Y are now considered arrays with val- ues of zeroes and ones. (There are also two other special arrays X_nonblank and Y_nonblank, see below.)

• We add a special MODANDJUMP instruction that takes two boolean variables as input and does the following:

– If 푎,and 푏 then MODANDJUMP( ) increments i by one and jumps to the first line of the program. – If 푎 = 1 and 푏 = 1 then MODANDJUMP(푎, 푏) decrements i by one and jumps to the first line of the program. (If i is already equal to푎 =then 0 it stays푏 = at 1 .) 푎, 푏 – If and then MODANDJUMP( ) jumps to the first line of the0 program without0 modifying i. – If 푎 = 1 푏 =then 0 MODANDJUMP( )푎,halts 푏 execution of the program. 푎 = 푏 = 0 푎, 푏 • The MODANDJUMP instruction always appears in the last line of a NAND-TM program and nowhere else. loops and infinity 263

Default values. We need one more convention to handle “default val- ues”. Turing machines have the special symbol to indicate that tape location is “blank” or “uninitialized”. In NAND-TM∅ there is no such symbol, and all variables are Boolean, containing either or . All variables and locations of arrays are default to if they have not been initialized to another value. To keep track of whether a 0 in1 an array corresponds to a true zero or to an uninitialized0 cell, a programmer can always add to an array Foo a “companion array” Foo_nonblank0 and set Foo_nonblank[i] to whenever the i’th location is initial- ized. In particular we will use this convention for the input and out- put arrays X and Y. A NAND-TM1 program has four special arrays X, X_nonblank, Y, and Y_nonblank. When a NAND-TM program is exe- cuted on input of length , the first cells of the array X are initialized to ∗and the first cells of the array X_nonblank are initialized to푥 ∈. (All{0, 1} uninitialized푛 cells default푛 to .) The output of 0 푛−1 a NAND-TM푥 program, … , 푥 is the string Y[푛], , Y[ ] where is the smallest integer such1 that Y_nonblank[ ] . A NAND-TM0 program gets called with X and X_nonblank initialized0 … to푚 contain − 1 the input,푚 and writes to Y and Y_nonblank to produce푚 the= output. 0 Formally, NAND-TM programs are defined as follows:

Definition 7.8 — NAND-TM programs. A NAND-TM program consists of a sequence of lines of the form foo = NAND(bar,blah) ending with a line of the form MODANDJMP(foo,bar), where foo,bar,blah are either scalar variables (sequences of letters, digits, and under- scores) or array variables of the form Foo[i] (starting with capital letter and indexed by i). The program has the array variables X, X_nonblank, Y, Y_nonblank and the index variable i built in, and can use additional array and scalar variables. If is a NAND-TM program and is an input then an execution of on is the following process: ∗ 푃 푥 ∈ {0, 1} 1. The arrays푃X and푥 X_nonblank are initialized by X[ ] and X_nonblank[ ] for all . All other variables and cells 푖 are initialized to . The index variable i is also initialized푖 = to 푥 . 푖 = 1 푖 ∈ [|푥|] MODAND- 2. The program is executed0 line by line, when the last line 0 JMP(foo,bar) is executed then we do as follows:

a. If foo and bar then jump to the first line without mod- ifying the value of i. = 1 = 0 b. If foo and bar then increment i by one and jump to the first line. = 1 = 1 264 introduction to theoretical computer science

c. If foo and bar then decrement i by one (unless it is al- ready zero) and jump to the first line. = 0 = 1 d. If foo and bar then halt and output Y[ ], , Y[ ] where is the smallest integer such that Y_nonblank[ ] . = 0 = 0 0 … 푚 − 1 푚 푚 = 0 7.2.2 Sneak peak: NAND-TM vs Turing machines As the name implies, NAND-TM programs are a direct implemen- tation of Turing machines in programming language form. We will show the equivalence below but you can already see how the compo- nents of Turing machines and NAND-TM programs correspond to one another:

Table 7.2: Turing Machine and NAND-TM analogs

Turing Machine NAND-TM program State: single register that Scalar variables: Several variables takes values in such as foo, bar etc.. each taking values in . Tape: One tape containing[푘] Arrays: Several arrays such as Foo, values in a finite set . Bar etc.. for{0, each 1} such array Arr and Potentially infinite but index , the value of Arr at position defaults to for all locationsΣ is either or . The value defaults to that have not∅ been 푇 [푡] for position푗 that have not been 푗 accessed. written to.0 1 Head푡 location: A number Index0 variable: The variable i that can that encodes the be used to access the arrays. position of the head. Accessing푖 ∈ ℕ memory: At every Accessing memory: At every step a step the Turing machine has NAND-TM program has access to all access to its local state, but the scalar variables, but can only can only access the tape at access the arrays at the location i of the position of the current the index variable head location. Control of location: In each Control of index variable: In each step the machine can move iteration of its main loop the the head location by at most program can modify the index i by one position. at most one.

7.2.3 Examples We now present some examples of NAND-TM programs. loops and infinity 265

■ Example 7.9 — Increment in NAND-TM. The following is a NAND-TM program to compute the increment function. That is, INC such that for every , INC is the bit long∗ string∗ such that if 푛is the number represented∶ {0, 1} by→ {0,, then 1} is the (least-significant푥푛−1 ∈ {0, digit 1}푖 first)(푥) binary푛 +representation 1 of 푖=0 푖 the number푦 . 푋 = ∑ 푥 ⋅ 2 푥 We start푦 by showing the program using the “syntactic sugar” we’ve seen before푋 + 1 of using shorthand for some NAND-CIRC pro- grams we have seen before to compute simple functions such as IF, XOR and AND (as well as the constant one function as well as the function COPY that just maps a bit to itself). carry = IF(started,carry,one(started)) started = one(started) Y[i] = XOR(X[i],carry) carry = AND(X[i],carry) Y_nonblank[i] = one(started) MODANDJUMP(X_nonblank[i],X_nonblank[i])

The above is not, strictly speaking, a valid NAND-TM program. If we “open up” all of the syntactic sugar, we get the following “sugar free” valid program to compute the same function. temp_0 = NAND(started,started) temp_1 = NAND(started,temp_0) temp_2 = NAND(started,started) temp_3 = NAND(temp_1,temp_2) temp_4 = NAND(carry,started) carry = NAND(temp_3,temp_4) temp_6 = NAND(started,started) started = NAND(started,temp_6) temp_8 = NAND(X[i],carry) temp_9 = NAND(X[i],temp_8) temp_10 = NAND(carry,temp_8) Y[i] = NAND(temp_9,temp_10) temp_12 = NAND(X[i],carry) carry = NAND(temp_12,temp_12) temp_14 = NAND(started,started) Y_nonblank[i] = NAND(started,temp_14) MODANDJUMP(X_nonblank[i],X_nonblank[i]) 266 introduction to theoretical computer science

■ Example 7.10 — XOR in NAND-TM. The following is a NAND-TM pro- gram to compute the XOR function on inputs of arbitrary length. That is XOR such that XOR mod for every ∗ . Once again, we use a certain|푥|−1 “syn- 푖=0 푖 tactic sugar”. Specifically,∶ {0, 1} → we {0, ∗access 1} the arrays X and(푥)Y at = their ∑ 푥 zero-th2 entry, while푥 NAND-TM ∈ {0, 1} only allows access to arrays in the coordinate of the variable i.

temp_0 = NAND(X[0],X[0]) Y_nonblank[0] = NAND(X[0],temp_0) temp_2 = NAND(X[i],Y[0]) temp_3 = NAND(X[i],temp_2) temp_4 = NAND(Y[0],temp_2) Y[0] = NAND(temp_3,temp_4) MODANDJUMP(X_nonblank[i],X_nonblank[i])

To transform the program above to a valid NAND-TM program, we can transform references such as X[0] and Y[0] to scalar vari- ables x_0 and y_0 (similarly we can transform any reference of the form Foo[17] or Bar[15] to scalars such as foo_17 and bar_15). We then need to add code to load the value of X[0] to x_0 and similarly to write to Y[0] the value of y_0, but this is not hard to do. Using the fact that variables are initialized to zero by default, we can create a variable init which will be set to at the end of the first iteration and not changed since then. We can then addan array Atzero and code that will modify Atzero[i]1 to if init is and otherwise leave it as it is. This will ensure that Atzero[i] is equal to if and only if i is set to zero, and allow the program1 to 0know when we are at the zero-th location. Thus we can add code to read and1 write to the corresponding scalars x_0, y_0 when we are at the zero-th location, and also code to move i to zero and then halt at the end. Working this out fully is somewhat tedious, but can be a good exercise.

P Working out the above two examples can go a long way towards understanding the NAND-TM language. See our GitHub repository for a full specification of the NAND-TM language. loops and infinity 267

7.3 EQUIVALENCE OF TURING MACHINES AND NAND-TM PRO- GRAMS

Given the above discussion, it might not be surprising that Turing machines turn out to be equivalent to NAND-TM programs. Indeed, we designed the NAND-TM language to have this property. Never- theless, this is an important result, and the first of many other such equivalence results we will see in this book.

Theorem 7.11 — Turing machines and NAND-TM programs are equivalent. For every , is computable by a NAND-TM pro- gram if and only∗ if there is∗ a Turing Machine that computes . 퐹 ∶ {0, 1} → {0, 1} 퐹 푃 푀 Proof퐹 Idea: To prove such an equivalence theorem, we need to show two di- rections. We need to be able to (1) transform a Turing machine to a NAND-TM program that computes the same function as and (2) transform a NAND-TM program into a Turing machine 푀that computes the same function푃 as . 푀 The idea of the proof is illustrated in푃 Fig. 7.9. To show (1), given푀 a Turing machine , we will create푃 a NAND-TM program that will have an array Tape for the tape of and scalar (i.e., non array) variable(s) state 푀for the state of . Specifically, since the푃 state ofa Turing machine is not in but rather푀 in a larger set , we will use log variables state_ , , state_푀 log variables to store the representation of the state.{0, Similarly, 1} to encode the larger[푘] alphabet ⌈of the푘⌉ tape, we will use 0log… arrays⌈ Tape_푘⌉ − 1, , Tape_ log , such that the location of these arrays encodes the symbol in theΣ tape for every푡ℎ tape. Using⌈ the|Σ|⌉ fact that every function0 … can푡ℎ ⌈ be computed|Σ|⌉ − 1 by a NAND-CIRC푖 program, we will be able to compute푖 the transition function of , replacing moving left and right by decrementing and incrementing i respectively. We show푀(2) using very similar ideas. Given a program that uses array variables and scalar variables, we will create a Turing machine with about states to encode the values of scalar푃 variables, and an푎 alphabet of about푏 푏 so we can encode the arrays using our tape. (The reason the2 sizes푎 are only “about” and is that we will need to add some symbols2 and steps for bookkeeping푎 푏 purposes.) The Turing Machine will simulate each iteration2 of the2 program by updating its state and tape accordingly. 푀 푃

Proof⋆ of Theorem 7.11. We start by proving the “if” direction of The- orem 7.11. Namely we show that given a Turing machine , we can

푀 268 introduction to theoretical computer science

Figure 7.9: Comparing a Turing Machine to a NAND- TM program. Both have an unbounded memory component (the tape for a Turing machine, and the ar- rays for a NAND-TM program), as well as a constant local memory (state for a Turing machine, and scalar variables for a NAND-TM program). Both can only access at each step one location of the unbounded memory, this is the “head” location for a Turing machine, and the value of the index variable i for a NAND-TM program.

find a NAND-TM program such that for every input , if halts on input with output then . Since our goal is just to 푀 show such a program exists푃 , we don’t need to write out푥 the푀 full 푀 code of 푥 line by line,푦 and can푃 take(푥) advantage = 푦 of our various “syn- 푀 tactic sugar” in describing푃 it. 푀 The key푃 observation is that by Theorem 4.12 we can compute every finite function using a NAND-CIRC program. In particular, consider the transition function L R of our Turing Machine. We can encode the its components as follows: 푀 훿 ∶[푘]×Σ→[푘]×Σ×{ , }

• We encode using and using , where log ′ and log . ℓ ℓ ′ [푘] {0, 1} Σ {0, 1} ℓ = ⌈ 푘⌉ ℓ = ⌈ |Σ|⌉

• We encode the set L R S H using . We will choose the encode L , R , S , H 2 . (This conveniently corresponds to the{ semantics, , , } of the MODANDJUMP{0, 1} operation.) ↦ 01 ↦ 11 ↦ 10 ↦ 00

Hence we can identify with a function ′ , mapping strings of length to strings ofℓ+ℓ length ′ 푀 ℓ+ℓ +2. By Theorem 4.12훿 there exists a finite′ 푀 ∶length {0, 1} NAND-CIRC→ program{0, 1}′ ComputeM that computes this functionℓ + ℓ . The NAND-TM programℓ + ℓ + 2 to simulate will essentially be the following: 푀 푀 loops and infinity 269

Algorithm 7.12 — NAND-TM program to simulate TM . Input: 푀 Output: if ∗ halts on . Otherwise go into infinite loop푥 ∈ {0, 1} 1: # We푀(푥) use variables푀 state_푥 state_ to encode ’s state 2: # We use arrays Tape_ [] 0 …Tape_ ℓ −[] 1 to encode 푀’s tape ′ 3: # We omit the initial0 … and finalℓ −”book 1 keeping”푀 to copy Input: to Tape and copy Output: from Tape 4: # Use the fact that transition is finite and computable by NAND-CIRC program: 5: state_ state_ , Tape_ [i] Tape_ [i], dir0,dir1 TRANSITION( state_ state_′ , Tape_ [i]0 … Tape_ℓ − 1[i], dir00 ,dir1… ) ℓ − 1 6: MODANDJMP(dir0,dir1)← ′ 0 … ℓ − 1 0 … ℓ − 1 Every step of the main loop of the above program perfectly mimics the computation of the Turing Machine and so the program carries out exactly the definition of computation by a Turing Machine asper Definition 7.1. 푀 For the other direction, suppose that is a NAND-TM program with lines, scalar variables, and array variables. We will show that there exists a Turing machine ′ with푃 states and alphabet of size푠 ℓ that computes theℓ same functionsℓ as (where , ′ 푃 are some constants′ ℓ to be determined푀 later). 2 + 퐶 ′ Σ Specifically,퐶 + 2 consider the function 푃 퐶 퐶 ′ that on input the contents of ’s scalarℓ variablesℓ and the con-ℓ ′ tents ofℓ the array variables at location푃i ∶in {0, the 1} beginning× {0, 1} of→ an {0, itera- 1} × tion,{0, 1} outputs all the new values of these푃 variables at the last line of the iteration, right before the MODANDJUMP instruction is executed. If foo and bar are the two variables that are used as input to the MODANDJUMP instruction, then this means that based on the values of these variables we can compute whether i will increase, decrease or stay the same, and whether the program will halt or jump back to the beginning. Hence a Turing machine can simulate an execution of in one iteration using a finite function applied to its alphabet. The overall operation of the Turing machine will be as follows: 푃

1. The machine encodes the contents of the array variables of in its tape, and the contents of the scalar variables in (part of) its 푃 state. Specifically,푀 if has local variables and arrays, then the푃 state space of will be large enough to encode all assignments 푃 ℓ 푡 ℓ 푀 2 270 introduction to theoretical computer science

to the local variables and the alphabet of will be large enough to encode all assignments for the array variables at each location. The head location푡 corresponds to the indexΣ 푀 variable i. 2 2. Recall that every line of the program corresponds to reading and writing either a scalar variable, or an array variable at the location i. In one iteration of the value of i remains푃 fixed, and so the machine can simulate this iteration by reading the values of all array variables at i푃(which are encoded by the single symbol in the alphabet푀 located at the i-th cell of the tape) , reading the values of all scalar variables (which are encoded by the state), and updating both.Σ The transition function of can output L S R depending on whether the values given to the MODANDJMP operation are , or respectively. 푀 , ,

MODANDJMP 3. When01 the10 program11 halts (i.e., gets ) then the Turing machine will enter into a special loop to copy the results of the Y array into the output and then halt. We can achieve00 this by adding a few more states.

The above is not a full formal description of a Turing Machine, but our goal is just to show that such a machine exists. One can see that simulates every step of , and hence computes the same function as . 푃 푀 푃 ■ 푃

R Remark 7.13 — Running time equivalence (optional). If we examine the proof of Theorem 7.11 then we can see that every iteration of the loop of a NAND-TM pro- gram corresponds to one step in the execution of the Turing machine. We will come back to this question of measuring number of computation steps later in this course. For now the main take away point is that NAND-TM programs and Turing Machines are essen- tially equivalent in power even when taking running time into account.

7.3.1 Specification vs implementation (again) Once you understand the definitions of both NAND-TM programs and Turing Machines, Theorem 7.11 is fairly straightforward. Indeed, NAND-TM programs are not as much a different model from Turing Machines as they are simply a reformulation of the same model using programming language notation. You can think of the difference be- tween a Turing machine and a NAND-TM program as the difference between representing a number using decimal or binary notation. In loops and infinity 271

contrast, the difference between a function and a Turing machine that computes is much more profound: it is like the difference be- tween the equation and the number퐹 that is a solution for this equation.퐹 For2 this reason, while we take special care in distin- guishing functions from푥 +programs 푥 = 12 or machines, we will3 often identify the two latter concepts. We will move freely between describing an algorithm as a Turing machine or as a NAND-TM program (as well as some of the other equivalent computational models we will see in Chapter 8 and beyond).

Table 7.3: Specification vs Implementation formalisms

Setting Specification Implementation Finite com- Functions mapping to Circuits, Straightline putation 푛 programs Infinite Functions푚 mapping {0, 1} to Algorithms, Turing computa- {0, 1} or to . ∗ Machines, Programs tion ∗ {0, 1} {0, 1} {0, 1}

7.4 NAND-TM SYNTACTIC SUGAR

Just like we did with NAND-CIRC in Chapter 4, we can use “syntactic sugar” to make NAND-TM programs easier to write. For starters, we can use all of the syntactic sugar of NAND-CIRC, and so have access to macro definitions and conditionals (i.e., if/then). But we cango beyond this and achieve for example:

• Inner loops such as the while and for operations common to many programming language.

• Multiple index variables (e.g., not just i but we can add j, k, etc.).

• Arrays with more than one dimension (e.g., Foo[i][j], Bar[i][j][k] etc.)

In all of these cases (and many others) we can implement the new feature as mere “syntactic sugar” on top of standard NAND-TM, which means that the set of functions computable by NAND-TM with this feature is the same as the set of functions computable by standard NAND-TM. Similarly, we can show that the set of functions computable by Turing Machines that have more than one tape, or tapes of more dimensions than one, is the same as the set of functions computable by standard Turing machines. 272 introduction to theoretical computer science

7.4.1 “GOTO” and inner loops We can implement more advanced looping constructs than the simple MODANDJUMP. For example, we can implement GOTO.A GOTO statement corresponds to jumping to a certain line in the execution. For example, if we have code of the form

"start": do foo GOTO("end") "skip": do bar "end": do blah

then the program will only do foo and blah as when it reaches the line GOTO("end") it will jump to the line labeled with "end". We can achieve the effect of GOTO in NAND-TM using conditionals. In the code below, we assume that we have a variable pc that can take strings of some constant length. This can be encoded using a finite number of Boolean variables pc_0, pc_1, , pc_ , and so when we write below pc = "label" what we mean is something like pc_0 = 0,pc_1 = 1, (where the bits correspond… 푘 − to 1 the encoding of the finite string "label" as a string of length ). We also assume that we have access… to conditional (i.e.,0, 1,if …statements), which we can emulate using syntactic sugar in the same way as we푘 did in NAND-CIRC. To emulate a GOTO statement, we will first modify a program Pof the form do foo do bar do blah

to have the following form (using syntactic sugar for if): pc = "line1" if (pc=="line1"): do foo pc = "line2" if (pc=="line2"): do bar pc = "line3" if (pc=="line3"): do blah

These two programs do the same thing. The variable pc cor- responds to the “” and tells the program which line to execute next. We can see that if we wanted to emulate a GOTO("line3") then we could simply modify the instruction pc = "line2" to be pc = "line3". loops and infinity 273

In NAND-CIRC we could only have GOTOs that go forward in the code, but since in NAND-TM everything is encompassed within a large outer loop, we can use the same ideas to implement GOTO’s that can go backwards, as well as conditional loops.

Other loops. Once we have GOTO, we can emulate all the standard loop constructs such as while, do .. until or for in NAND-TM as well. For example, we can replace the code while foo: do blah do bar

with

"loop": if NOT(foo): GOTO("next") do blah GOTO("loop") "next": do bar

R Remark 7.14 — GOTO’s in programming languages. The GOTO statement was a staple of most early program- ming languages, but has largely fallen out of favor and is not included in many modern languages such as Python, Java, Javascript. In 1968, Edsger Dijsktra wrote a famous letter titled “Go to statement considered harm- ful.” (see also Fig. 7.10). The main trouble with GOTO is that it makes analysis of programs more difficult by making it harder to argue about invariants of the program. When a program contains a loop of the form:

for j in range(100): do something

do blah

you know that the line of code do blah can only be reached if the loop ended, in which case you know that j is equal to , and might also be able to argue other properties of the state of the program. In con- trast, if the program100 might jump to do blah from any other point in the code, then it’s very hard for you as the programmer to know what you can rely upon in this code. As Dijkstra said, such invariants are impor- tant because “our intellectual powers are rather geared to master static relations and .. our powers to visualize processes evolving in time are relatively poorly developed” 274 introduction to theoretical computer science

and so “we should … do …our utmost best to shorten the conceptual gap between the static program and the dynamic process.” That said, GOTO is still a major part of lower level lan- guages where it is used to implement higher level looping constructs such as while and for loops. For example, even though Java doesn’t have a GOTO statement, the Java Bytecode (which is a lower level representation of Java) does have such a statement. Similarly, Python bytecode has instructions such as POP_JUMP_IF_TRUE that implement the GOTO function- ality, and similar instructions are included in many assembly languages. The way we use GOTO to imple- ment a higher level functionality in NAND-TM is reminiscent of the way these various jump instructions are used to implement higher level looping constructs.

7.5 UNIFORMITY, AND NAND VS NAND-TM (DISCUSSION)

While NAND-TM adds extra operations over NAND-CIRC, it is not exactly accurate to say that NAND-TM programs or Turing machines Figure 7.10: XKCD’s take on the GOTO statement. are “more powerful” than NAND-CIRC programs or Boolean circuits. NAND-CIRC programs, having no loops, are simply not applicable for computing functions with an unbounded number of inputs. Thus, to compute a function using NAND-CIRC (or equivalently, Boolean circuits) we∗ need a collection∗ of programs/cir- cuits: one for every input퐹 ∶ length. {0, 1} ∶→ {0, 1} The key difference between NAND-CIRC and NAND-TM is that NAND-TM allows us to express the fact that the algorithm for com- puting parities of length- strings is really the same one as the al- gorithm for computing parities of length- strings (or similarly the fact that the algorithm for100 adding -bit numbers is the same for every , etc.). That is, one can think of the NAND-TM5 program for general parity as the “seed” out of which we푛 can grow NAND-CIRC programs for푛 length , length , or length parities as needed. This notion of a single algorithm that can compute functions of all input lengths10 is known100 as uniformity1000of computation and hence we think of Turing machines / NAND-TM as uniform model of com- putation, as opposed to Boolean circuits or NAND-CIRC which is a nonuniform model, where we have to specify a different program for every input length. Looking ahead, we will see that this uniformity leads to another crucial difference between Turing machines and circuits. Turing ma- chines can have inputs and outputs that are longer than the descrip- tion of the machine as a string and in particular there exists a Turing machine that can “self replicate” in the sense that it can print its own loops and infinity 275

code. This notion of “self replication”, and the related notion of “self reference” is crucial to many aspects of computation, as well of course to life itself, whether in the form of digital or biological programs. For now, what you ought to remember is the following differences between uniform and non uniform computational models:

• Non uniform computational models: Examples are NAND-CIRC programs and Boolean circuits. These are models where each indi- vidual program/circuit can compute a finite function . We have seen that every finite function can be computed푛 by some program/circuit.푚 To discuss computation of an infinite푓 ∶ {0,func- 1} → {0,tion 1} we need to allow a sequence of programs/circuits∗ (one for∗ every input length), but this does not 푛 푛∈ℕ capture퐹 ∶ the {0, notion1} → of {0, a 1}single algorithm to compute the function{푃 } .

• Uniform computational models: Examples are Turing machines퐹and NAND-TM programs. These are models where a single program/- machine can take inputs of arbitrary length and hence compute an infinite function . The number of steps that a program/machine takes∗ on some input∗ is not a priori bounded in advance and in퐹 particular ∶ {0, 1} → there {0, is1} a chance that it will enter into an infinite loop. Unlike the nonuniform case, we have not shown that every infinite function can be computed by some NAND-TM program/Turing Machine. We will come back to this point in Chap- ter 9.

Chapter Recap

• Turing machines capture the notion of a single al- ✓ gorithm that can evaluate functions of every input length. • They are equivalent to NAND-TM programs, which add loops and arrays to NAND-CIRC. • Unlike NAND-CIRC or Boolean circuits, the num- ber of steps that a Turing machine takes on a given input is not fixed in advance. In fact, a Turing ma- chine or a NAND-TM program can enter into an infinite loop on certain inputs, and not halt at all.

7.6 EXERCISES

Exercise 7.1 — Explicit NAND TM programming. Produce the code of a (syntactic-sugar free) NAND-TM program that computes the (un- bounded input length) Majority function where for every , if and only푃 if ∗ . We say “produce” rather∗ than “write” because푀푎푗 you ∶ do{0,|푥| not 1} have→ {0, to 1} write 푖 푥 ∈ {0, 1} 푀푎푗(푥) = 1 ∑푖=0 푥 > |푥|/2 276 introduction to theoretical computer science

the code of by hand, but rather can use the programming language of your choice to compute this code. 푃 ■

Exercise 7.2 — Computable functions examples. Prove that the following functions are computable. For all of these functions, you do not have to fully specify the Turing Machine or the NAND-TM program that computes the function, but rather only prove that such a machine or program exists:

1. INC which takes as input a representation of a natural number∗ and outputs∗ the representation of . ∶ {0, 1} → {0, 1} 2. ADD 푛 which takes as input a representation푛 + 1 of a pair of natural∗ numbers∗ and outputs the representation of .∶ {0, 1} → {0, 1} (푛, 푚) 3. 푛MULT + 푚 , which takes a representation of a pair of natural numbers∗ and∗ outputs the representation of . ∶ {0, 1} → {0, 1} 4. SORT (푛, 푚) which takes as input the representation푛 ̇푚 of a list of natural∗ numbers∗ and returns its sorted version ∶ {0, 1}such→ that{0, 1} for every there is some with 0 푛−1 and (푎 , …. , 푎 ) 0 푛−1 (푏 , … , 푏 ) 푖 ∈ [푛] 푗 ∈ [푛] 푖 푗 0 1 푛−1 푏 = 푎 푏 ≤ 푏 ≤ ⋯ ≤ 푏 ■ Exercise 7.3 — Two index NAND-TM. Define NAND-TM’ to be the variant of NAND-TM where there are two index variables i and j. Arrays can be indexed by either i or j. The operation MODANDJMP takes four variables and uses the values of to decide whether to increment j, decrement j or keep it in the same value (corresponding to , , and 푎, 푏,respectively). 푐, 푑 Prove that for푐, every 푑 function , is computable by a NAND-TM program if and only if ∗is computable01 10 ∗ 00by a NAND-TM’ program. 퐹 ∶ {0, 1} → {0, 1} 퐹 퐹 ■

Exercise 7.4 — Two tape Turing machines. Define a two tape Turing machine to be a Turing machine which has two separate tapes and two separate heads. At every step, the transition function gets as input the location of the cells in the two tapes, and can decide whether to move each head independently. Prove that for every function , is computable by a standard Turing Machine if and∗ only if is computable∗ by a two-tape Turing machine. 퐹 ∶ {0, 1} → {0, 1} 퐹 퐹■

Exercise 7.5 — Two dimensional arrays. Define NAND-TM’ ’ to be the vari- ant of NAND-TM where just like NAND-TM’ defined in Exercise 7.3 loops and infinity 277

there are two index variables i and j, but now the arrays are two di- mensional and so we index an array Foo by Foo[i][j]. Prove that for every function , is computable by a NAND-TM program if and only if ∗is computable∗ by a NAND-TM’ ’ program. 퐹 ∶ {0, 1} → {0, 1} 퐹 ■ 퐹 Exercise 7.6 — Two dimensional Turing machines. Define a two-dimensional Turing machine to be a Turing machine in which the tape is two dimen- sional. At every step the machine can move Up, Down, Left, Right, or Stay. Prove that for every function , is com- putable by a standard Turing Machine if and∗ only if is∗ computable by a two-dimensional Turing machine.퐹 ∶ {0, 1} → {0, 1} 퐹 퐹 ■

Exercise 7.7 Prove the following closure properties of the set R defined in Definition 7.3:

1. If R then the function is in R.

2. If 퐹 ∈ R then the function퐺(푥) = 1 − 퐹 (푥) is in R.

3. If 퐹 , 퐺 ∈R then the function 퐻(푥)in in =R 퐹where (푥) ∨ 퐺(푥)is defined as fol- lows: iff there exist∗ some strings ∗ such that 퐹 ∈ ∗ and 퐹 for every 퐹 . 0 푘−1 퐹 (푥) = 1 푤 , … , 푤 0 1 푘−1 푖 4.푥 If = 푤 R푤 then⋯ 푤 the function퐹 (푤 ) = 1 푖 ∈ [푘]

퐹 ∈ (7.3) |푥| 푦∈{0,1} otherwise {⎧∃ 퐹 (푥푦) = 1 퐺(푥) = ⎨ is in R. ⎩{0

Exercise 7.8 — Oblivious Turing Machines (challenging). Define a Turing Ma- chine to be oblivious if its head movements are independent of its input. That is, we say that is oblivious if there exists an infinite sequence푀 MOVE L R S such that for every , the movements of when given푀∞ input (up until the point it halts,∗ if such point exists)∈ are { given, , } by MOVE MOVE MOVE푥 ∈ {0, 1}. 3 You can use the sequence R, L,R, R, L, L, R,R,R, L, L, L, Prove that for푀 every function 푥 , if is com- . 0 1 2 putable then it is computable by an oblivious, ∗ Turing, machine.∗ ,… See … footnote for hint.3 퐹 ∶ {0, 1} → {0, 1} 퐹 ■

Exercise 7.9 — Single vs multiple bit. Prove that for every , the function is computable if and only if the following∗ func- ∗ 퐹 ∶ {0, 1} → {0, 1} 퐹 278 introduction to theoretical computer science

tion is computable, where is defined as follows: ∗ 퐺 ∶ {0, 1} → {0, 1} 퐺 푖 ⎧퐹 (푥) 푖 < |퐹 (푥)|, 휎 = 0 { ■ 퐺(푥, 푖, 휎) = ⎨1 푖 < |퐹 (푥)|, 휎 = 1 { Exercise 7.10 —⎩ Uncomputability0 푖 ≥ |퐹 (푥)| via counting. Recall that R is the set of all total functions from to that are computable by a Turing machine (see Definition ∗7.3). Prove that R is countable. That is, prove that there exists a one-to-one{0, 1} map{0, 1} R . You can use the equivalence between Turing machines and NAND-TM programs. 퐷푡푁 ∶ → ℕ ■

Exercise 7.11 — Not every function is computable. Prove that the set of all total functions from is not countable. You can use the results of Section 2.4. (We∗ will see an explicit uncomputable function in Chapter 9.) {0, 1} → {0, 1} ■

7.7 BIBLIOGRAPHICAL NOTES

Augusta Ada Byron, countess of Lovelace (1815-1852) lived a short but turbulent life, though is today most well known for her collabo- ration with Charles Babbage (see [Ste87] for a biography). Ada took an immense interest in Babbage’s analytical engine, which we men- tioned in Chapter 3. In 1842-3, she translated from Italian a paper of Menabrea on the engine, adding copious notes (longer than the paper itself). The quote in the chapter’s beginning is taken from Nota A in this text. Lovelace’s notes contain several examples of programs for the analytical engine, and because of this she has been called “the world’s first computer programmer” though it is not clear whether they were written by Lovelace or Babbage himself [Hol01]. Regardless, Ada was clearly one of very few people (perhaps the only one outside of Bab- bage himself) to fully appreciate how significant and revolutionary the idea of mechanizing computation truly is. The books of Shetterly [She16] and Sobel [Sob17] discuss the his- tory of human computers (who were female, more often than not) and their important contributions to scientific discoveries in astron- omy and space exploration. Alan Turing was one of the intellectual giants of the 20th century. He was not only the first person to define the notion of computation, but also invented and used some of the world’s earliest computational devices as part of the effort to break the Enigma cipher during World War II, saving millions of lives. Tragically, Turing committed suicide in 1954, following his conviction in 1952 for homosexual acts and a court-mandated hormonal treatment. In 2009, British prime minister loops and infinity 279

Gordon Brown made an official public apology to Turing, and in2013 Queen Elizabeth II granted Turing a posthumous pardon. Turing’s life is the subject of a great book and a mediocre movie. Sipser’s text [Sip97] defines a Turing machine as a seven tuple con- sisting of the state space, input alphabet, tape alphabet, transition function, starting state, accepting state, and rejecting state. Superfi- cially this looks like a very different definition than Definition 7.1 but it is simply a different representation of the same concept, just asa graph can be represented in either adjacency list or adjacency matrix form. One difference is that Sipser considers a general set of states that is not necessarily of the form for some natural number . Sipser also restricts his attention to Turing machines푄 that output only a single bit and푄 = therefore {0, 1, 2, … designates , 푘 − 1} two special halt- ing states푘: the > 0 “ halting state” (often known as the rejecting state) and the other as the “ halting state” (often known as the accepting state). Thus instead of0 writing or on an output tape, the machine will en- ter into one of these1 states and halt. This again makes no difference to the computational power,0 1 though we prefer to consider the more general model of multi-bit outputs. (Sipser presents the basic task of a Turing machine as that of deciding a language as opposed to computing a function, but these are equivalent, see Remark 7.4.) Sipser considers also functions with input in for an arbitrary alphabet (and hence distinguishes between the∗input alphabet which he denotes as and the tape alphabet which he denotesΣ as ), while we restrict attentionΣ to functions with binary strings as input. Again this is not a majorΣ issue, since we can always encode an elementΓ of using a binary string of length log . Finally (and this is a very minor point) Sipser requires the machine to either move left or right inΣ every step, without the Stay operation,⌈|Σ|⌉ though staying in place is very easy to emulate by simply moving right and then back left. Another definition used in the literature is that a Turing machine recognizes a language if for every , and for every , . A language is recursively enumerable if 푀there exists a Turing machine퐿 that recognizes푥 ∈ 퐿 푀(푥) it, and = the 1 set of all recursively푥 ∉ 퐿 enumerable푀(푥) ∈ {0, languages ⊥} is often퐿 denoted by RE. We will not use this terminology in this book.푀 One of the first programming-language formulations of Turing machines was given by Wang [Wan57]. Our formulation of NAND- TM is aimed at making the connection with circuits more direct, with the eventual goal of using it for the Cook-Levin Theorem, as well as results such as P P poly and BPP P poly. The website esolangs.org features a large variety of esoteric Turing-complete programming / / languages. One of⊆ the most famous⊆ of them is Brainf*ck.

Learning Objectives: • Learn about RAM machines and the λ calculus. • Equivalence between these and other models and Turing machines. • Cellular automata and configurations of Turing machines. • Understand the Church-Turing thesis. 8 Equivalent models of computation

“All problems in computer science can be solved by another level of indirec- tion”, attributed to David Wheeler.

“Because we shall later compute with expressions for functions, we need a distinction between functions and forms and a notation for expressing this distinction. This distinction and a notation for describing it, from which we deviate trivially, is given by Church.”, John McCarthy, 1960 (in paper describing the LISP programming language)

So far we have defined the notion of computing a function using Turing machines, which are not a close match to the way computation is done in practice. In this chapter we justify this choice by showing that the definition of computable functions will remain the same under a wide variety of computational models. This notion is known as Turing completeness or Turing equivalence and is one of the most fundamental facts of computer science. In fact, a widely believed claim known as the Church-Turing Thesis holds that every “reasonable” definition of computable function is equivalent to being computable by a Turing machine. We discuss the Church-Turing Thesis and the potential definitions of “reasonable” in Section 8.8. Some of the main computational models we discuss in this chapter include:

• RAM Machines: Turing Machines do not correspond to standard computing architectures that have Random Access Memory (RAM). The mathematical model of RAM machines is much closer to actual computers, but we will see that it is equivalent in power to Turing Machines. We also discuss a programming language variant of RAM machines, which we call NAND-RAM. The equivalence of Turing Machines and RAM machines enables demonstrating the Turing Equivalence of many popular programming languages, in- cluding all general-purpose languages used in practice such as C, Python, JavaScript, etc.

Compiled on 8.26.2020 18:10 282 introduction to theoretical computer science

• Cellular Automata: Many natural and artificial systems can be modeled as collections of simple components, each evolving ac- cording to simple rules based on its state and the state of its imme- diate neighbors. One well-known such example is Conway’s Game of Life. To prove that cellular automata are equivalent to Turing machines we introduce the tool of configurations of Turing Ma- chines. These have other applications, and in particular are used in Chapter 11 to prove Gödel’s Incompleteness Theorem: a central result in mathematics.

• calculus: The calculus is a model for expressing computation that originates from the 1930’s, though it is closely connected to 휆functional programming휆 languages widely used today. Showing the equivalence of calculus to Turing Machines involves a beauti- ful technique to eliminate recursion known as the “Y Combinator”. 휆

This chapter: A non-mathy overview In this chapter we study equivalence between models. Two computational models are equivalent (also known as Turing equivalent) if they can compute the same set of functions. For example, we have seen that Turing Machines and NAND-TM programs are equivalent since we can transform every Tur- ing Machine into a NAND-TM program that computes the same function, and similarly can transform every NAND- TM program into a Turing Machine that computes the same function. In this chapter we show this extends far beyond Turing Ma- chines. The techniques we develop allow us to show that all general-purpose programming languages (i.e., Python, C, Java, etc.) are Turing Complete, in the sense that they can simulate Turing Machines and hence compute all functions that can be computed by a TM. We will also show the other direction- Turing Machines can be used to simulate a pro- gram in any of these languages and hence compute any function computable by them. This means that all these pro- gramming language are Turing equivalent: they are equivalent in power to Turing Machines and to each other. This is a powerful principle, which underlies behind the vast reach of Computer Science. Moreover, it enables us to “have our cake and eat it too”- since all these models are equivalent, we can choose the model of our convenience for the task at hand. To achieve this equivalence, we define a new computational model known as RAM machines. RAM Machines capture the equivalent models of computation 283

architecture of modern computers more closely than Turing Machines, but are still computationally equivalent to Turing machines. Finally, we will show that Turing equivalence extends far beyond traditional programming languages. We will see that cellular automata which are a mathematical model of ex- tremely simple natural systems is also Turing equivalent, and also see the Turing equivalence of the calculus - a logical system for expressing functions that is the basis for functional programming languages such as Lisp, OCaml,휆 and more. See Fig. 8.1 for an overview of the results of this chapter.

Figure 8.1: Some Turing-equivalent models. All of these are equivalent in power to Turing Machines (or equivalently NAND-TM programs) in the sense that they can compute exactly the same class of functions. All of these are models for computing infinite functions that take inputs of unbounded length. In contrast, Boolean circuits / NAND-CIRC programs can only compute finite functions and hence are not Turing complete.

8.1 RAM MACHINES AND NAND-RAM

One of the limitations of Turing Machines (and NAND-TM programs) is that we can only access one location of our arrays/tape at a time. If the head is at position in the tape and we want to access the -th position then it will take us at least 923 steps to get there. In contrast, almost every programming22 language has a formalism for directly957 accessing memory locations. Actual physical computers also provide so called Random Access Memory (RAM) which can be thought of as a large array Memory, such that given an index (i.e., , or a pointer), we can read from and write to the location of Memory. (“Random access memory” is quite a misnomer푝 since푡ℎ it has nothing to do with probability, but since it is a standard term푝 in both the theory and practice of computing, we will use it as well.) The computational model that models access to such a memory is the RAM machine (sometimes also known as the Word RAM model), as depicted in Fig. 8.2. The memory of a RAM machine is an array of unbounded size where each cell can store a single word, which we think of as a string in and also (equivalently) as a num- ber in . For example, many푤 modern computing architectures 푤 {0, 1} [2 ] 284 introduction to theoretical computer science

use bit words, in which every memory location holds a string in which can also be thought of as a number between and 6464 . The parameter is known {0,as64 the 1} word size. In practice often is a fixed number such0 as , but 2when− 1doing = 18, theory 446, 744, we 073, model 709, 551,as a 615 parameter that can depend푤 on the input length or number of steps.푤 (You can think of as roughly64 corresponding to the largest memory푤 address that we use푤 in the com- putation.) In addition to the memory array, a RAM machine2 also contains a constant number of registers , each of which can also contain a single word. 0 푘−1 The operations a RAM machine can푟 carry, … , out 푟 include:

• Data movement: Load data from a certain cell in memory into a register or store the contents of a register into a certain cell of memory. RAM machine can directly access any cell of memory without having to move the “head” (as Turing machines do) to that

location. That is, in one step a RAM machine can load into register Figure 8.2:A RAM Machine contains a finite number of the contents of the memory cell indexed by register , or store local registers, each of which holds an integer, and an into the memory cell indexed by register the contents of register unbounded memory array. It can perform arithmetic 푖 푗 operations on its register as well as load to a register 푟 . 푟 the contents of the memory at the address indexed by 푗 푟 the number in register . 푟 • Computation:푖 RAM machines can carry out computation on regis- 푟 ′ ters such as arithmetic operations, logical operations, and compar- 푟 isons.

• Control flow: As in the case of Turing machines, the choice of what instruction to perform next can depend on the state of the RAM machine, which is captured by the contents of its register.

We will not give a formal definition of RAM Machines, though the bibliographical notes section (Section 8.10) contains sources for such definitions. Just as the NAND-TM programming language models Turing machines, we can also define a NAND-RAM programming lan- guage that models RAM machines. The NAND-RAM programming language extends NAND-TM by adding the following features:

• The variables of NAND-RAM are allowed to be (non negative) integer valued rather than only Boolean as is the case in NAND- TM. That is, a scalar variable foo holds a non negative integer in Figure 8.3: Different aspects of RAM machines and (rather than only a bit in ), and an array variable Bar holds Turing machines. RAM machines can store integers an array of integers. As in the case of RAM machines, we will notℕ in their local registers, and can read and write to {0, 1} their memory at a location specified by a register. allow integers of unbounded size. Concretely, each variable holds In contrast, Turing machines can only access their a number between and , where is the number of steps memory in the head location, which moves at most that have been executed by the program so far. (You can ignore one position to the right or left in each step. this restriction for now:0 if푇 we − want 1 to hold푇 larger numbers, we can simply execute dummy instructions; it will be useful in later chapters.) equivalent models of computation 285

• We allow indexed access to arrays. If foo is a scalar and Bar is an array, then Bar[foo] refers to the location of Bar indexed by the value of foo. (Note that this means we don’t need to have a special index variable i anymore.)

• As is often the case in programming languages, we will assume that for Boolean operations such as NAND, a zero valued integer is considered as false, and a nonzero valued integer is considered as true.

• In addition to NAND, NAND-RAM also includes all the basic arith- metic operations of addition, subtraction, multiplication, (integer) division, as well as comparisons (equal, greater than, less than, etc..).

• NAND-RAM includes conditional statements if/then as part of the language.

• NAND-RAM contains looping constructs such as while and do as part of the language.

A full description of the NAND-RAM programming language is in the appendix. However, the most important fact you need to know about NAND-RAM is that you actually don’t need to know much about NAND-RAM at all, since it is equivalent in power to Turing machines:

Theorem 8.1 — Turing Machines (aka NAND-TM programs) and RAM ma- chines (aka NAND-RAM programs) are equivalent. For every function , is computable by a NAND-TM program if and only if∗ is computable∗ by a NAND-RAM program. 퐹 ∶ {0, 1} → {0, 1} 퐹 Since NAND-TM퐹 programs are equivalent to Turing machines, and NAND-RAM programs are equivalent to RAM machines, Theorem 8.1 shows that all these four models are equivalent to one another.

Proof Idea:

Clearly NAND-RAM is only more powerful than NAND-TM, and Figure 8.4: Overview of the steps in the proof of The- so if a function is computable by a NAND-TM program then it can orem 8.1 simulating NANDRAM with NANDTM. We first use the inner loop syntactic sugar of Sec- be computed by a NAND-RAM program. The challenging direction is tion 7.4.1 to enable loading an integer from an array to transform a NAND-RAM퐹 program to an equivalent NAND-TM to the index variable i of NANDTM. Once we can do program . To describe the proof in full we will need to cover the full that, we can simulate indexed access in NANDTM. We then use an embedding of in to simulate two formal specification of the NAND-RAM푃 language, and show howwe dimensional bit arrays in NANDTM.2 Finally, we use can implement푄 every one of its features as syntactic sugar on top of the binary representation toℕ encodeℕ one-dimensional arrays of integers as two dimensional arrays of bits NAND-TM. hence completing the of NANDRAM with This can be done but going over all the operations in detail is rather NANDTM. tedious. Hence we will focus on describing the main ideas behind this 286 introduction to theoretical computer science

transformation. (See also Fig. 8.4.) NAND-RAM generalizes NAND- TM in two main ways: (a) adding indexed access to the arrays (ie.., Foo[bar] syntax) and (b) moving from Boolean valued variables to integer valued ones. The transformation has two steps:

1. Indexed access of bit arrays: We start by showing how to handle (a). Namely, we show how we can implement in NAND-TM the op- eration Setindex(Bar) such that if Bar is an array that encodes some integer , then after executing Setindex(Bar) the value of i will equal to . This will allow us to simulate syntax of the form Foo[Bar] by Setindex(Bar)푗 followed by Foo[i]. 푗 2. Two dimensional bit arrays: We then show how we can use “syntactic sugar” to augment NAND-TM with two dimensional arrays. That is, have two indices i and j and two dimensional arrays, such that we can use the syntax Foo[i][j] to access the (i,j)-th location of Foo.

3. Arrays of integers: Finally we will encode a one dimensional array Arr of integers by a two dimensional Arrbin of bits. The idea is simple: if is a binary (prefix-free) representation of Arr[ ], then Arrbin[ ][ ] will be equal to . 푖,0 푖,ℓ 푎 , … , 푎 푖,푗 Once푖 we have arrays of푖 integers,푗 we can use푎 our usual syntactic sugar for functions, GOTO etc. to implement the arithmetic and control flow operations of NAND-RAM.

The above approach is not the only way to obtain a proof of Theo- ⋆ rem 8.1, see for example Exercise 8.1

R Remark 8.2 — RAM machines / NAND-RAM and (optional). RAM machines correspond quite closely to actual microprocessors such as those in the series that also contains a large primary mem- ory and a constant number of small registers. This is of course no accident: RAM machines aim at modeling more closely than Turing machines the architecture of actual computing systems, which largely follows the so called as described in the report [Neu45]. As a result, NAND-RAM is sim- ilar in its general outline to assembly languages such as x86 or NIPS. These assembly languages all have instructions to (1) move data from registers to mem- ory, (2) perform arithmetic or logical computations on registers, and (3) conditional execution and loops (“if” and “goto”, commonly known as “branches” and “jumps” in the context of assembly languages). The main difference between RAM machines and actual microprocessors (and correspondingly between equivalent models of computation 287

NAND-RAM and assembly languages) is that actual microprocessors have a fixed word size so that all registers and memory cells hold numbers in (or equivalently strings in ). This number푤 푤can vary among different processors,푤 but common[2 ] values are either or . As a{0, theoretical 1} model, RAM푤 ma- chines do not have this limitation, but we rather let be the logarithm32 64 of our running time (which roughly corresponds to its value in practice as well). Actual푤 microprocessors also have a fixed number of registers (e.g., 14 general purpose registers in x86-64) but this does not make a big difference with RAM machines. It can be shown that RAM machines with as few as two registers are as powerful as full-fledged RAM ma- chines that have an arbitrarily large constant number of registers. Of course actual microprocessors have many features not shared with RAM machines as well, including parallelism, memory hierarchies, and many others. However, RAM machines do capture actual comput- ers to a first approximation and so (as we will see), the running time of an algorithm on a RAM machine (e.g., vs ) is strongly correlated with its practical efficiency. 2 푂(푛) 푂(푛 )

8.2 THE GORY DETAILS (OPTIONAL)

We do not show the full formal proof of Theorem 8.1 but focus on the most important parts: implementing indexed access, and simulating two dimensional arrays with one dimensional ones. Even these are already quite tedious to describe, as will not be surprising to anyone that has ever written a compiler. Hence you can feel free to merely skim this section. The important point is not for you to know all de- tails by heart but to be convinced that in principle it is possible to transform a NAND-RAM program to an equivalent NAND-TM pro- gram, and even be convinced that, with sufficient time and effort, you could do it if you wanted to.

8.2.1 Indexed access in NAND-TM In NAND-TM we can only access our arrays in the position of the in- dex variable i, while NAND-RAM has integer-valued variables and can use them for indexed access to arrays, of the form Foo[bar]. To im- plement indexed access in NAND-TM, we will encode integers in our arrays using some prefix-free representation (see Section 2.5.2)), and then have a procedure Setindex(Bar) that sets i to the value encoded by Bar. We can simulate the effect of Foo[Bar] using Setindex(Bar) followed by Foo[i]. Implementing Setindex(Bar) can be achieved as follows: 288 introduction to theoretical computer science

1. We initialize an array Arzero such that Atzero[ ] and Atzero[ ] for all . (This can be easily done in NAND-TM as all uninitialized variables default to zero.) 0 = 1 푗 = 0 푗 > 0 2. Set i to zero, by decrementing it until we reach the point where Atzero[i] .

3. Let Temp be an array encoding the number . = 1 4. We use GOTO to simulate an inner loop of of the form: while Temp 0 Bar, increment Temp. ≠ 5. At the end of the loop, i is equal to the value encoded by Bar. In NAND-TM code (using some syntactic sugar), we can imple- ment the above operations as follows: # assume Atzero is an array such that Atzero[0]=1 # and Atzero[j]=0 for all j>0

# set i to 0. LABEL("zero_idx") dir0 = zero dir1 = one # corresponds to i <- i-1 GOTO("zero_idx",NOT(Atzero[i])) ... # zero out temp #(code below assumes a specific prefix-free encoding in which 10 is the "end marker")

Temp[↪ 0] = 1 Temp[1] = 0 # set i to Bar, assume we know how to increment, compare LABEL("increment_temp") cond = EQUAL(Temp,Bar) dir0 = one dir1 = one # corresponds to i <- i+1 INC(Temp) GOTO("increment_temp",cond) # if we reach this point, i is number encoded by Bar ... # final instruction of program MODANDJUMP(dir0,dir1)

8.2.2 Two dimensional arrays in NAND-TM To implement two dimensional arrays, we want to embed them in a one dimensional array. The idea is that we come up with a one to one equivalent models of computation 289

function , and so embed the location of the two dimensional array Two in the location of the array One. Since the푒푚푏푒푑 set ∶ ℕ × ℕseems → ℕ “much bigger” than the set (푖,, a 푗) priori it might not be clear that such a one to one mapping푒푚푏푒푑(푖, exists. 푗) However, once you think aboutℕ × ℕ it more, it is not that hard to construct.ℕ For ex- ample, you could ask a child to use scissors and glue to transform a 10” by 10” piece of paper into a 1” by 100” strip. This is essentially a one to one map from to . We can generalize this to obtain a one to one map from to and more generally a one to one map from [10]to ×. Specifically,[10] [100] the2 following map would do (see Fig. 8.5): [푛] × [푛] [푛 ] ℕ × ℕ ℕ 푒푚푏푒푑 (8.1) 1 Exercise 8.3 asks you to prove that is indeed one to one, as 푒푚푏푒푑(푥, 푦) = 2 (푥 + 푦)(푥 + 푦 + 1) + 푥 . well as computable by a NAND-TM program. (The latter can be done by simply following the grade-school푒푚푏푒푑 algorithms for multiplication, addition, and division.) This means that we can replace code of the form Two[Foo][Bar] = something (i.e., access the two dimensional array Two at the integers encoded by the one dimensional arrays Foo and Bar) by code of the form:

Blah = embed(Foo,Bar) Figure 8.5: Illustration of the map for , one can Setindex(Blah) see1 that for every distinct pairs 푒푚푏푒푑(푥,and 푦) =, 2 Two[i] = something (푥 + 푦)(푥 + 푦 + 1) + 푥 푥,. 푦 ∈ [10] ′ ′ ′ ′ (푥, 푦) (푥 , 푦 ) 푒푚푏푒푑(푥, 푦) ≠ 푒푚푏푒푑(푥 , 푦 ) 8.2.3 All the rest Once we have two dimensional arrays and indexed access, simulating NAND-RAM with NAND-TM is just a matter of implementing the standard algorithms for arithmetic operations and comparisons in NAND-TM. While this is cumbersome, it is not difficult, and the end result is to show that every NAND-RAM program can be simulated by an equivalent NAND-TM program , thus completing the proof of Theorem 8.1. 푃 푄

R Remark 8.3 — Recursion in NAND-RAM (advanced). One concept that appears in many programming languages but we did not include in NAND-RAM programs is recursion. However, recursion (and function calls in general) can be implemented in NAND-RAM using the stack data structure.A stack is a data structure con- taining a sequence of elements, where we can “push” elements into it and “pop” them from it in “first in last out” order. We can implement a stack using an array of integers Stack and a scalar variable stackpointer that will 290 introduction to theoretical computer science

be the number of items in the stack. We implement push(foo) by

Stack[stackpointer]=foo stackpointer += one

and implement bar = pop() by

bar = Stack[stackpointer] stackpointer -= one

We implement a function call to by pushing the arguments for into the stack. The code of will “pop” the arguments from the stack,퐹 perform the com- putation (which퐹 might involve making recursive퐹 or non recursive calls) and then “push” its return value into the stack. Because of the “first in last out” na- ture of a stack, we do not return control to the calling procedure until all the recursive calls are done. The fact that we can implement recursion using a non- recursive language is not surprising. Indeed, machine languages typically do not have recursion (or function calls in general), and hence a compiler implements function calls using a stack and GOTO. You can find online tutorials on how recursion is implemented via stack in your favorite programming language, whether it’s Python , JavaScript, or Lisp/Scheme.

8.3 TURING EQUIVALENCE (DISCUSSION)

Any of the standard programming language such as C, Java, Python, Pascal, have very similar operations to NAND-RAM. (In- deed, ultimately they can all be executed by machines which have a fixed number of registers and a large memory array.) Hence using Theorem 8.1, we can simulate any program in such a programming language by a NAND-TM program. In the other direction, it is a fairly easy programming exercise to write an interpreter for NAND-TM in any of the above programming languages. Hence we can also simulate Figure 8.6: A punched card corresponding to a Fortran NAND-TM programs (and so by Theorem 7.11, Turing machines) us- statement. ing these programming languages. This property of being equivalent in power to Turing Machines / NAND-TM is called Turing Equivalent

(or sometimes Turing Complete). Thus all programming languages we 1 1 Some programming language have fixed (even if are familiar with are Turing equivalent. extremely large) bounds on the amount of memory they can access, which formally prevent them from 8.3.1 The “Best of both worlds” paradigm being applicable to computing infinite functions and hence simulating Turing machines. We ignore such The equivalence between Turing Machines and RAM machines allows issues in this discussion and assume access to some us to choose the most convenient language for the task at hand: storage device without a fixed upper bound on its capacity. • When we want to prove a theorem about all programs/algorithms, we can use Turing machines (or NAND-TM) since they are sim- equivalent models of computation 291

pler and easier to analyze. In particular, if we want to show that a certain function can not be computed, then we will use Turing machines.

• When we want to show that a function can be computed we can use RAM machines or NAND-RAM, because they are easier to pro- gram in and correspond more closely to high level programming languages we are used to. In fact, we will often describe NAND- RAM programs in an informal manner, trusting that the reader can fill in the details and translate the high level description tothe precise program. (This is just like the way people typically use in- formal or “pseudocode” descriptions of algorithms, trusting that their audience will know to translate these descriptions to code if needed.)

Our usage of Turing Machines / NAND-TM and RAM Machines / NAND-RAM is very similar to the way people use in practice high and low level programming languages. When one wants to produce a device that executes programs, it is convenient to do so for very simple and “low level” programming language. When one wants to describe an algorithm, it is convenient to use as high level a formalism as possible.

 Big Idea 10 Using equivalence results such as those between Turing and RAM machines, we can “have our cake and eat it too”. We can use a simpler model such as Turing machines when we want to prove something can’t be done, and use a feature-rich model such as RAM machines when we want to prove something can be done. Figure 8.7: By having the two equivalent languages NAND-TM and NAND-RAM, we can “have our cake and eat it too”, using NAND-TM when we want to 8.3.2 Let’s talk about abstractions prove that programs can’t do something, and using NAND-RAM or other high level languages when we “The programmer is in the unique position that … he has to be able want to prove that programs can do something. to think in terms of conceptual hierarchies that are much deeper than a single mind ever needed to face before.”, Edsger Dijkstra, “On the cruelty of really teaching computing science”, 1988.

At some point in any theory of computation course, the instructor and students need to have the talk. That is, we need to discuss the level of in describing algorithms. In algorithms courses, one typically describes algorithms in English, assuming readers can “fill in the details” and would be able to convert such an algorithm into an implementation if needed. For example, Algorithm 8.4 is a high level description of the breadth first search algorithm. 292 introduction to theoretical computer science

Algorithm 8.4 — Breadth First Search. Input: Graph , vertices Output: ”connected” when is connected to in , ”dis- connected”퐺 푢, 푣 1: Initialize empty queue 푢. 푣 퐺 2: Put in 3: while is not empty do푄 4: Remove푢 푄 top vertex from 5: if 푄 then return ”connected” 6: end if 푤 푄 7: Mark푤 = 푣 8: Add all unmarked neighbors of to . 9: end while푤 10: return ”disconnected” 푤 푄

If we wanted to give more details on how to implement breadth first search in a programming language such as Python orC(or NAND-RAM / NAND-TM for that matter), we would describe how we implement the queue data structure using an array, and similarly how we would use arrays to mark vertices. We call such an “interme- diate level” description an implementation level or pseudocode descrip- tion. Finally, if we want to describe the implementation precisely, we would give the full code of the program (or another fully precise rep- resentation, such as in the form of a list of tuples). We call this a formal or low level description.

Figure 8.8: We can describe an algorithm at different levels of /detail and precision. At the highest level we just write the idea in words, omitting all details on representation and implementation. In the intermediate level (also known as implemen- tation or pseudocode) we give enough details of the implementation that would allow someone to de- rive it, though we still fall short of providing the full code. The lowest level is where the actual code or mathematical description is fully spelled out. These different levels of detail all have their uses, and mov- ing between them is one of the most important skills for a computer scientist.

While we started off by describing NAND-CIRC, NAND-TM, and NAND-RAM programs at the full formal level, as we progress in this equivalent models of computation 293

book we will move to implementation and high level description. After all, our goal is not to use these models for actual computation, but rather to analyze the general phenomenon of computation. That said, if you don’t understand how the high level description translates to an actual implementation, going “down to the metal” is often an excellent exercise. One of the most important skills for a computer scientist is the ability to move up and down hierarchies of abstractions. A similar distinction applies to the notion of representation of objects as strings. Sometimes, to be precise, we give a low level specification of exactly how an object maps into a binary string. For example, we might describe an encoding of vertex graphs as length binary strings, by saying that we map a graph over the vertices2 to a string such that the푛 -th coordinate of푛 is if and 2 only if the edge 푛 is present in . We can퐺 also use an intermediate[푛] or implementation푥 ∈ {0, 1} level description, by푛 ⋅ simply 푖 + 푗 saying that we represent푥 1 a graph using the푖⃗⃗⃗⃗⃗⃗⃗⃗ adjacency 푗 matrix퐺 representation. Finally, because we are translating between the various represen- tations of graphs (and objects in general) can be done via a NAND- RAM (and hence a NAND-TM) program, when talking in a high level we also suppress discussion of representation altogether. For example, the fact that graph connectivity is a computable function is true re- gardless of whether we represent graphs as adjacency lists, adjacency matrices, list of edge-pairs, and so on and so forth. Hence, in cases where the precise representation doesn’t make a difference, we would often talk about our algorithms as taking as input an object (that can be a graph, a vector, a program, etc.) without specifying how is encoded as a string. 푋 푋 Defining ”Algorithms”. Up until now we have used the term “algo- rithm” informally. However, Turing Machines and the range of equiv- alent models yield a way to precisely and formally define algorithms. Hence whenever we refer to an algorithm in this book, we will mean that it is an instance of one of the Turing equivalent models, such as Turing machines, NAND-TM, RAM machines, etc. Because of the equivalence of all these models, in many contexts, it will not matter which of these we use.

8.3.3 Turing completeness and equivalence, a formal definition (optional) A computational model is some way to define what it means for a pro- gram (which is represented by a string) to compute a (partial) func- tion.A computational model is Turing complete, if we can map ev- ery Turing machine (or equivalently NAND-TM program) into a program for that computesℳ the same function as . It is Turing equivalent if the other direction holds as well (i.e., we can map푁 every program in푃 toℳ a Turing machine that computes the same푄 function).

ℳ 294 introduction to theoretical computer science

We can define this notion formally as follows. (This formal definition is not crucial for the remainder of this book so feel to skip it as long as you understand the general concept of Turing equivalence; This notion is sometimes referred to in the literature as Gödel numbering or admissible numbering.)

Definition 8.5 — Turing completeness and equivalence (optional). Let be the set of all partial functions from to .A computa- tional model is a map . ∗ ∗ ℱ We say that a program ∗ {0, 1}-computes{0, 1}a function if . ℳ ∶ {0, 1} → ℱ∗ A computational model푃 ∈is {0,Turing 1} ℳ complete if there is a com-퐹 ∈ ℱ putableℳ(푃 ) map = 퐹 ENCODE for every Turing machine (represented asℳ a string),∗ ENCODE∗ is equal ℳ to the partial function computed∶ {0, by 1} . → {0, 1} ℳ A computational푁 model is Turingℳ( equivalent if it(푁)) is Tur- ing complete and there exists a computable푁 map DECODE such thatℳ or every string , ℳ DECODE∗ is a∗ string representation of a Turing machine∗ that ∶ {0,computes 1} → the {0, function 1} . 푃 ∈ {0, 1} 푁 = ℳ (푃 ) Some examples of Turingℳ(푃 equivalent ) models (some of which we have already seen, and some are discussed below) include:

• Turing machines • NAND-TM programs • NAND-RAM programs • λ calculus • Game of life (mapping programs and inputs/outputs to starting and ending configurations) • Programming languages such as Python/C/Javascript/OCaml… (allowing for unbounded storage)

8.4 CELLULAR AUTOMATA

Many physical systems can be described as consisting of a large num- ber of elementary components that interact with one another. One way to model such systems is using cellular automata. This is a system that consists of a large (or even infinite) number of cells. Each cell only has a constant number of possible states. At each time step, a cell updates to a new state by applying some simple rule to the state of itself and its neighbors. A canonical example of a cellular automaton is Conway’s Game of Life. In this automata the cells are arranged in an infinite two di- mensional grid. Each cell has only two states: “dead” (which we can equivalent models of computation 295

Figure 8.9: Rules for Conway’s Game of Life. Image from this blog post.

encode as and identify with ) or “alive” (which we can encode as ). The next state of a cell depends∅ on its previous state and the states of its0 8 vertical, horizontal and diagonal neighbors (see Fig. 8.9). A dead1 cell becomes alive only if exactly three of its neighbors are alive. A live cell continues to live if it has two or three live neighbors. Even though the number of cells is potentially infinite, we can en- code the state using a finite-length string by only keeping track of the live cells. If we initialize the system in a configuration with a finite number of live cells, then the number of live cells will stay finite in all future steps. The Wikipedia page for the Game of Life contains some beautiful figures and animations of configurations that produce very interesting evolutions.

Figure 8.10: In a two dimensional cellular automaton every cell is in position for some integers . The state of a cell is some value for some finite alphabet . At a given푖, 푗 time step, the state푖, 푗 of ∈ the ℤ cell is adjusted according to some퐴푖,푗 function∈ Σ applied to the state of Σand all its neighbors . In a one dimensional cellular automaton every cell is in position (푖, 푗)and the state of at(푖 the ± next 1, 푗 ±time 1) step depends on its current state and the state of its two neighbors푖 ∈ ℤ and 퐴.푖 푖

푖 − 1 푖 + 1

Since the cells in the game of life are are arranged in an infinite two- dimensional grid, it is an example of a two dimensional cellular automa- ton. We can also consider the even simpler setting of a one dimensional cellular automaton, where the cells are arranged in an infinite line, see Fig. 8.10. It turns out that even this simple model is enough to achieve 296 introduction to theoretical computer science

Turing-completeness. We will now formally define one-dimensional cellular automata and then prove their Turing completeness.

Definition 8.6 — One dimensional cellular automata. Let be a finite set containing the symbol .A one dimensional cellular automation over alphabet is described∅ by a transition rule Σ , which satisfies . 3 A configurationΣ∅ ∅ ∅ of the∅ automaton is a function푟 ∶ Σ → Σ . If an automaton푟( , with, ) = rule is in configuration , then its next config- uration, denoted by NEXT 푟 , is the function퐴 ∶ ℤsuch → Σ that ′ 푟 for every퐴 . In other′ words, 푟 the′ next state of the automaton퐴 = at point(퐴) obtained by applying퐴 the퐴 (푖) rule = 푟(퐴(푖to the − values 1), 퐴(푖), of 퐴(푖at +and 1)) its two neighbors.푖 ∈ ℤ 푟 푖 푟 퐴 푖 Finite configuration. We say that a configuration of an automaton is finite if there is only some finite number of indices in such that . (That is, for every , 푟 .) 0 푗−1 Such a configuration∅ can be represented using a푖 finite, … , 푖 string that ∅ℤ 푗 0 푗−1 encodes the퐴(푖 indices) ≠ and the values푖 ∉ {푖 , … , 푖 } 퐴(푖) =. Since , if is a finite configuration then NEXT is finite 0 푛−1 0 푛−1 as∅ well.∅ ∅ We will∅ only푖 be, … interested , 푖 in studying퐴(푖 cellular), … , automata 퐴(푖 ) that 푟 푅(are initialized, , ) = in finite퐴 configurations, and hence remain(퐴) in afinite configuration throughout their evolution.

8.4.1 One dimensional cellular automata are Turing complete We can write a program (for example using NAND-RAM) that sim- ulates the evolution of any cellular automaton from an initial finite configuration by simply storing the values of the cells with statenot equal to and repeatedly applying the rule . Hence cellular au- tomata can∅ be simulated by Turing Machines. What is more surprising that the other direction holds as well. For example,푟 as simple as its rules seem, we can simulate a Turing machine using the game of life (see Fig. 8.11). In fact, even one dimensional cellular automata can be Turing com- plete:

Theorem 8.7 — One dimensional automata are Turing complete. For every Turing machine , there is a one dimensional cellular automaton that can simulate on every input . 푀 Figure 8.11 To make the notion of “simulating a Turing machine” more precise : A Game-of-Life configuration simulating 푀 푥 a Turing Machine. Figure by Paul Rendell. we will need to define configurations of Turing machines. We will do so in Section 8.4.2 below, but at a high level a configuration of a Turing machine is a string that encodes its full state at a given step in equivalent models of computation 297

its computation. That is, the contents of all (non empty) cells of its tape, its current state, as well as the head position. The key idea in the proof of Theorem 8.7 is that at every point in the computation of a Turing machine , the only cell in ’s tape that can change is the one where the head is located, and the value this cell changes to is a function of its current푀 state and the finite푀 state of . This observation allows us to encode the configuration of a Turing machine as a finite configuration of a cellular automaton , and 푀ensure that a one-step evolution of this encoded configuration under the rules푀 of corresponds to one step in the execution of the푟 Turing machine . 푟 8.4.2 Configurations푀 of Turing machines and the next-step function To turn the above ideas into a rigorous proof (and even statement!) of Theorem 8.7 we will need to precisely define the notion of config- urations of Turing machines. This notion will be useful for us in later chapters as well.

Figure 8.12:A configuration of a Turing machine with alphabet and state space encodes the state of at a particular step in its execution as a string푀 over the alphabetΣ [푘] . The string is of length푀 where is such that ’s tape contains 훼in all positions andΣ larger = Σ and × ({⋅}’s × head [푘]) is in a position∅ smaller than푡 . If 푡 ’s head is in푀 the -th position, then for , 푡encodes the value푀 of the -th cell of ’s tape, while 푡 encodes푀 both this value푖 as well as the current푗 ≠ 푖state훼푗 of . If the machine writes푗 the value푀 , changes state훼푖 to , and moves right, then in the next configuration푀 will contain at position the value 휏 and at position 푡 the value . 푖 (휏, ⋅) 푖 + 1 (훼푖+1, 푡)

Definition 8.8 — Configuration of Turing Machines.. Let be a Turing ma- chine with tape alphabet and state space .A configuration of is a string where that푀 satisfies that there is exactly one coordinate∗ Σfor which [푘] for some and푀 . For훼 all ∈ otherΣ coordinatesΣ = Σ ×, ({⋅} ∪ [푘]) for some . 푖 A configuration 푖 of corresponds훼 = (휎,′ 푠) to the following′휎 ∈ Σ state 푗 푠of ∈ its [푘] execution: ∗ 푗 훼 = (휎 , ⋅) 휎 ∈ Σ 훼 ∈ Σ 푀 • ’s tape contains for all and contains for all po- sitions that are at least , where we let be the value∅ such 푗,0 푀that with훼 푗and < |훼| . (In other words, 푗,0 |훼| 훼 휎 푗 훼 = (휎, 푡) 휎 ∈ Σ 푡 ∈ {⋅} ∪ [푘] 298 introduction to theoretical computer science

since is a pair of an alphabet symbol and either a state in or the symbol , is the first component of this pair.) 푗 훼 휎 [푘] 푗,0 • ’s head is in⋅ the훼 unique position for which휎 has the form for , and ’s state is equal to . 푖 푀 푖 훼 (휎, 푠) 푠 ∈ [푘] 푀 푠

P Definition 8.8 below has some technical details, but is not actually that deep or complicated. Try to take a moment to stop and think how you would encode as a string the state of a Turing machine at a given point in an execution. Think what are all the components that you need to know in order to be able to continue the execution from this point onwards, and what is a simple way to encode them using a list of finite symbols. In par- ticular, with an eye towards our future applications, try to think of an encoding which will make it as sim- ple as possible to map a configuration at step to the configuration at step . 푡 푡 + 1 Definition 8.8 is a little cumbersome, but ultimately a configuration is simply a string that encodes a snapshot of the Turing machine at a given point in the execution. (In operating-systems lingo, it is a “”.) Such a snapshot needs to encode the following components:

1. The current head position.

2. The full contents of the large scale memory, that is the tape.

3. The contents of the “local registers”, that is the state of the ma- chine.

The precise details of how we encode a configuration are not impor- tant, but we do want to record the following simple fact: Lemma 8.9 Let be a Turing machine and let NEXT be the function that maps a configuration of to the configuration∗ ∗ 푀 at the next step푀 of the execution. Then for every ,∶ theΣ value→ Σ of NEXT only depends on the coordinates푀 . 푖 ∈ ℕ (For푀 simplicity푖 of notation, above we use the푖−1 convention푖 푖+1 that if (훼) 훼 , 훼 , 훼 is “out of bounds”, such as or , then we assume that .) We leave proving Lemma 8.9 as Exercise 8.7. The idea푖 behind∅ the proof is simple:푖 if is neither |훼| in position nor 푖 positions훼 = ( , ⋅) and , then the next-step configuration at will be the same as it was before. Otherwise, we can “read off” the푖 state of the Turing machine푖 − 1 and푖the + 1 value of the tape at the head location푖 from the equivalent models of computation 299

configuration at or one of its neighbors and use that to update what the new state at should be. Completing the full proof is not hard, but doing it is a great푖 way to ensure that you are comfortable with the definition of 푖configurations.

Completing the proof of Theorem 8.7. We can now restate Theorem 8.7 more formally, and complete its proof:

Theorem 8.10 — One dimensional automata are Turing complete (formal state- ment). For every Turing Machine , if we denote by the alphabet of its configuration strings, then there is a one-dimensional cellular automaton over the alphabet 푀such that Σ ∗ 푟 NEXT Σ NEXT (8.2)

푀 푟 for every configuration( (훼))of = (again(훼) using the convention that we consider if is ”out∗ of bounds). ∅훼 ∈ Σ 푀 푖 훼 = 푖 Proof. We consider the element of to correspond to the element of the automaton . In this∅ case, by Lemma 8.9, the function∅ NEXT that maps a configuration( , ⋅) of Σinto the next one is in fact a valid rule for a one dimensional푟 automata. 푀 푀 ■

The automaton arising from the proof of Theorem 8.10 has a large alphabet, and furthermore one whose size that depends on the ma- chine that is being simulated. It turns out that one can obtain an automaton with an alphabet of fixed size that is independent ofthe program푀 being simulated, and in fact the alphabet of the automaton can be the minimal set ! See Fig. 8.13 for an example of such an Turing-complete automaton. {0, 1}

R Remark 8.11 — Configurations of NAND-TM programs. We can use the same approach as Definition 8.8 to define configurations ofa NAND-TM program. Such a configuration will need to encode:

1. The current value of the variable i. Figure 8.13: Evolution of a one dimensional automata. 2. For every scalar variable foo, the value of foo. Each row in the figure corresponds to the configura- 3. For every array variable Bar, the value Bar[ ] for tion. The initial configuration corresponds to the top every where is the largest row and contains only a single “live” cell. This figure corresponds to the “Rule 110” automaton of Stephen value that the index variable i ever achieved in the 푗 Wolfram which is Turing Complete. Figure taken computation.푗 ∈ {0, … , 푡 − 1} 푡 − 1 from Wolfram MathWorld. 300 introduction to theoretical computer science

8.5 AND FUNCTIONAL PROGRAMMING LAN- GUAGES

The λ calculus is another way to define computable functions. It was proposed by Alonzo Church in the 1930’s around the same time as Alan Turing’s proposal of the Turing Machine. Interestingly, while Turing Machines are not used for practical computation, the λ calculus has inspired functional programming languages such as LISP, ML and Haskell, and indirectly the development of many other programming languages as well. In this section we will present the λ calculus and show that its power is equivalent to NAND-TM programs (and hence also to Turing machines). Our Github repository contains a Jupyter notebook with a Python implementation of the λ calculus that you can experiment with to get a better feel for this topic.

The λ operator. At the core of the λ calculus is a way to define “anony- mous” functions. For example, instead of giving a name to a func- tion and defining it as 푓 (8.3) we can write it as 푓(푥) = 푥 × 푥 (8.4) and so . That is, you can think of , 휆푥.푥 × 푥 where is some expression as a way of specifying the anonymous function (휆푥.푥 × 푥)(7). Anonymous = 49 functions, using either휆푥.푒푥푝(푥), or푒푥푝 other closely related notation, appear in many programming languages.푥 ↦ For 푒푥푝(푥) example, in Python we can define the squaring휆푥.푓(푥) function푥 ↦ using푓(푥) lambda x: x*x while in JavaScript we can use x => x*x or (x) => x*x. In Scheme we would define it as (lambda (x) (* x x)). Clearly, the name of the argument to a function doesn’t matter, and so is the same as , as both correspond to the squaring function. 휆푦.푦Dropping × 푦 parenthesis. 휆푥.푥To reduce × 푥 notational clutter, when writing calculus expressions we often drop the parentheses for function evaluation. Hence instead of writing for the result of applying 휆the function to the input , we can also write this as simply . Therefore we can write 푓(푥). In this chapter, we will use both the 푓and notations푥 for function application. Function푓 푥 evaluations are associative(휆푥.푥 and × bind 푥)7 = from 49 left to right, and hence is the same푓(푥) as 푓. 푥 푓 푔 ℎ 8.5.1 Applying functions(푓푔)ℎ to functions A key feature of the λ calculus is that functions are “first-class objects” in the sense that we can use functions as arguments to other functions. equivalent models of computation 301

For example, can you guess what number is the following expression equal to?

(8.5)

(((휆푓.(휆푦.(푓 (푓 푦))))(휆푥.푥 × 푥)) 3) P The expression (8.5) might seem daunting, but before you look at the solution below, try to break it apart to its components, and evaluate each component at a time. Working out this example would go a long way toward understanding the λ calculus.

Let’s evaluate (8.5) one step at a time. As nice as it is for the λ calculus to allow anonymous functions, adding names can be very helpful for understanding complicated expressions. So, let us write and . Therefore (8.5) becomes 퐹 = 휆푓.(휆푦.(푓(푓푦))) 푔 = 휆푥.푥 × 푥 (8.6)

On input a function , outputs((퐹 푔) the 3) .function , or in other words is the function . Our function is simply and so is푓 the퐹 function that maps 휆푦.(푓(푓to 푦)) . Hence 2 퐹 푓 . 푦 ↦ 푓(푓(푦)) 2 2 푔 4 푔(푥) = 푥 (퐹 푔) 푦 (푦 ) = 푦 Solved Exercise4 8.1 What number does the following expression equal ((퐹 푔)3) = 3 = 81 to?

(8.7)

■ ((휆푥.(휆푦.푥)) 2) 9) .

Solution: is the function that on input ignores its input and outputs . Hence yields the function (or, using nota- tion,휆푦.푥 the function ). Hence (8.7)푦 is equivalent to . 푥 (휆푥.(휆푦.푥))2 푦 ↦ 2 휆 ■ 휆푦.2 (휆푦.2)9 = 2 8.5.2 Obtaining multi-argument functions via Currying In a λ expression of the form , the expression can itself involve the λ operator. Thus for example the function 휆푥.푒 푒 (8.8)

maps to the function . 휆푥.(휆푦.푥 + 푦) In particular, if we invoke the function (8.8) on to obtain some function 푥, and then invoke푦 ↦on 푥 +, 푦 we obtain the value . We 푎 푓 푓 푏 푎 + 푏 302 introduction to theoretical computer science

can see that the one-argument function (8.8) corresponding to can also be thought of as the two-argument function . Generally, we can use the λ expression 푎 ↦ (푏to simulate ↦ 푎 + 푏) the effect of a two argument function . This (푎,technique 푏) ↦ 푎 is+ known 푏 as Currying. We will use the shorthand휆푥.(휆푦.푓(푥, 푦)) for . If then corresponds(푥, 푦) to ↦ applying 푓(푥, 푦) and then invoking the resulting function on , obtaining the result휆푥, 푦.푒 of replacing휆푥.(휆푦.푒) in the푓 occurrences = 휆푥.(휆푦.푒) of with(푓푎)푏 and occurrences of with푓푎 . By our rules of associativity, this is the same푏 as which we’ll sometimes also푒 write as . 푥 푎 푏 푦 (푓푎푏) 8.5.3 Formal description of푓(푎, the 푏) λ calculus We now provide a formal description of the λ calculus. We start with “basic expressions” that contain a single variable such as or and build more complex expressions of the form and where are expressions and is a variable idenifier. Formally′ λ 푥expressions푦 ′ are defined as follows: (푒 푒 ) 휆푥.푒 푒, 푒 푥 Definition 8.12 — λ expression.. A λ expression is either a single variable identifier or an expression of the one of the following forms: Figure 8.14: In the “currying” transformation, we can create the effect of a two parameter function with the λ expression which on input • Application: 푒, where and are λ expressions. outputs a one-parameter function that has푓(푥, 푦) ′ ″ ′ ″ “hardwired” into it and휆푥.(휆푦.푓(푥, such that 푦)) . 푥 • Abstraction: 푒 = (푒 푒 ) where 푒 is a λ푒 expression. This푥 can be illustrated by a circuit diagram;푓 see 푥 Chelsea Voss’s site. 푓푥(푦) = 푓(푥, 푦) ′ ′ 푒 = 휆푥.(푒 ) 푒 Definition 8.12 is a recursive definition since we defined the concept of λ expressions in terms of itself. This might seem confusing at first, but in fact you have known recursive definitions since you were an elementary school student. Consider how we define an arithmetic expression: it is an expression that is either just a number, or has one of the forms , , , or , where and are other arithmetic expressions.′ ′ ′ ′ ′ Free and(푒 bound + 푒 ) variables.(푒 − 푒 ) Variables(푒 × 푒 ) in(푒 a ÷λ expression푒 ) 푒 can either푒 be free or bound to a operator (in the sense of Section 1.4.7). In a single- variable λ expression , the variable is free. The set of free and bound variables in휆 an application expression is the same as that of the underlying푣푎푟 expressions 푣푎푟and . In an′ abstraction″ ex- pression , all free occurences′ of ″푒 =in (푒 푒are) bound to the operator of . If′ you find the notion푒 of푒 free and′ bound variables confusing,푒 you= 휆푣푎푟.(푒 can avoid) all these issues by using푣푎푟 unique푒 identifiers for all휆 variables. 푒 Precedence and parenthesis. We will use the following rules to allow us to drop some parenthesis. Function application associates from left to right, and so is the same as . Function application has a higher precedence than the λ operator, and so is the same 푓푔ℎ (푓푔)ℎ 휆푥.푓푔푥 equivalent models of computation 303

as . This is similar to how we use the precedence rules in arithmetic operations to allow us to use fewer parenthesis and so write the휆푥.((푓푔)푥) expression as . As mentioned in Section 8.5.2, we also use the shorthand for and the shorthand for . This(7×3)+2 plays nicely7×3+2 with the “Currying” transformation of simulating multi-input휆푥, functions 푦.푒 using휆푥.(휆푦.푒) λ expressions. 푓(푥, 푦) (푓 푥) 푦 Equivalence of λ expressions. As we have seen in Solved Exercise 8.1, the rule that is equivalent to enables us to modify λ expressions and′ obtain simpler equivalent form′ for them. Another rule(휆푥.푒푥푝)푒푥푝 that we can use is that the parameter푒푥푝[푥 → does 푒푥푝 ] not matter and hence for example is the same as . Together these rules define the notion of equivalence of λ expressions: 휆푦.푦 휆푧.푧 Definition 8.13 — Equivalence of λ expressions. Two λ expressions are equivalent if they can be made into the same expression by repeated applications of the following rules:

1. Evaluation (aka reduction): The expression is equivalent to . ′ 훽 ′ (휆푥.푒푥푝)푒푥푝 2. Variable renaming푒푥푝[푥 → (aka 푒푥푝 ]conversion): The expression is equivalent to . 훼 휆푥.푒푥푝 휆푦.푒푥푝[푥 → 푦] If is a λ expression of the form then it naturally corre- sponds to the function that maps any input ′ to . Hence the λ푒푥푝 calculus naturally implies a computational휆푥.푒푥푝 model.′ Since in the λ calculus the inputs can themselves be functions,푧 푒푥푝 we need[푥 → to 푧] decide in what order we evaluate an expression such as

(8.9) There are two natural conventions for this: (휆푥.푓)(휆푦.푔푧) . • Call by name (aka “lazy evaluation”): We evaluate (8.9) by first plug- ging in the righthand expression as input to the lefthand side function, obtaining and then continue from there. (휆푦.푔푧) 푓[푥 → (휆푦.푔푧)] • Call by value (aka “eager evaluation”): We evaluate (8.9) by first evaluating the righthand side and obtaining , and then plugging this into the lefthandside to obtain . ℎ = 푔[푦 → 푧] Because the λ calculus has only pure functions,푓[푥 that → ℎ]do not have “side effects”, in many cases the order does not matter. In fact, itcan be shown that if we obtain a definite irreducible expression (for ex- ample, a number) in both strategies, then it will be the same one. 304 introduction to theoretical computer science

However, for concreteness we will always use the “call by name” (i.e., lazy evaluation) order. (The same choice is made in the programming language Haskell, though many other programming languages use eager evaluation.) Formally, the evaluation of a λ expression using “call by name” is captured by the following process:

Definition 8.14 — Simplification of λ expressions. Let be a λ expres- sion. The simplification of is the result of the following recursive process: 푒 푒 1. If is a single variable then the simplification of is .

2. If 푒 has the form 푥 then the simplification푒 푒 of is where is the simplification′ of . ′ 푒 ′ 푒 = 휆푥.푒 ′ 푒 휆푥.푓 3. (Evaluation푓 / reduction.) If has푒 the form then the simplification of is the simplification of ′ ,″ which denotes replacing훽 all copies of푒 in bound푒 to′ = the (휆푥.푒operator″푒 ) with 푒 ′ 푒 [푥 → 푒 ] ″ 푥 푒 휆 4. (Renaming푒 / conversion.) The canonical simplification of is obtained by taking the simplification of and renaming the vari- ables so that the훼 first bound variable in the expression is푒 , the second one is , and so on and so forth.푒 0 푣 1 We say that two푣 λ expressions and are equivalent, denoted by , if they have the same canonical simplification.′ ′ 푒 푒 Solved푒 ≅ 푒Exercise 8.2 — Equivalence of λ expressions. Prove that the following two expressions and are equivalent:

푒 푓 (8.10)

푒 = 휆푥.푥

(8.11)

푓 = (휆푎.(휆푏.푏))(휆푧.푧푧) ■

Solution: The canonical simplification of is simply . To do the canonical simplification of we first use reduction to plug in 0 0 instead of in but since푒 is not used휆푣 .푣 in this function at all, we simply obtained 푓 which simplifies훽 to as well. 휆푧.푧푧 푎 (휆푏.푏) 푎 ■ 0 0 휆푏.푏 휆푣 .푣 equivalent models of computation 305

8.5.4 Infinite loops in the λ calculus Like Turing machines and NAND-TM programs, the simplification process in the λ calculus can also enter into an infinite loop. For exam- ple, consider the λ expression

(8.12)

If we try to simplify (8.12) by invoking the lefthand function on 휆푥.푥푥 휆푥.푥푥 the righthand one, then we get another copy of (8.12) and hence this never ends. There are examples where the order of evaluation can matter for whether or not an expression can be simplified, see Exer- cise 8.9.

8.6 THE “ENHANCED” Λ CALCULUS

We now discuss the λ calculus as a computational model. We will start by describing an “enhanced” version of the λ calculus that con- tains some “superfluous features” but is easier to wrap your head around. We will first show how the enhanced λ calculus is equiva- lent to Turing machines in computational power. Then we will show how all the features of “enhanced λ calculus” can be implemented as “syntactic sugar” on top of the “pure” (i.e., non enhanced) λ calculus. Hence the pure λ calculus is equivalent in power to Turing machines (and hence also to RAM machines and all other Turing-equivalent models). The enhanced λ calculus includes the following set of objects and operations:

• Boolean constants and IF function: There are λ expressions , and IF that satisfy the following conditions: for every λ expression and , IF and IF . That is, IF is the function0 1 that given three arguments outputs if and if . 푒 푓 1 푒 푓 = 푒 0 푒 푓 = 푓 • Pairs: There is a λ expression푎, 푒, 푓 PAIR which푒 푎 we = 1 will think푓 푎 of = as 0 the pairing function. For every λ expressions , PAIR is the pair that contains as its first member and as its second member. We also have λ expressions HEAD푒, 푓and TAIL푒that 푓 extract the first⟨푒, 푓⟩ and second푒 member of a pair respectively.푓 Hence, for every 2 In Lisp, the PAIR, HEAD and TAIL functions are λ expressions , HEAD PAIR and TAIL PAIR .2 traditionally called cons, car and cdr.

• Lists and strings:푒, 푓 There is( λ expression푒 푓) = 푒NIL that corresponds( 푒 푓) to = 푓 the empty list, which we also denote by . Using PAIR and NIL we construct lists. The idea is that if is a element list of the form then for every λ expression⟨⊥⟩ we can obtain the element list 퐿 using푘 the expression 1 2 푘 0 PAIR⟨푒 , 푒 ., For … , 푒 example,, ⊥⟩ for every three λ expressions푒 , the 0 1 2 푘 푘 + 1 ⟨푒 , 푒 , 푒 , … , 푒 , ⊥⟩ 0 푒 퐿 푒, 푓, 푔 306 introduction to theoretical computer science

following corresponds to the three element list :

PAIR PAIR PAIR NIL⟨푒, 푓, 푔, ⊥⟩ (8.13)

The λ expression ISEMPTY푒 ( returns푓 ( on NIL푔 and)) returns . on every other list. A string is simply a list of bits. 1 0 • List operations: The enhanced λ calculus also contains the list-processing functions MAP, REDUCE, and FILTER. Given a list and a function , MAP ap- plies on every member of the list to obtain the new list 0 푛−1 퐿 = ⟨푥 , … , 푥 , ⊥⟩ . Given a list 푓as above퐿 and 푓 an expression′ 푓 whose output is either or , FILTER returns the 0 푛−1 list퐿 = ⟨푓(푥 ),containing … , 푓(푥 all), ⊥⟩ the elements of퐿 for which outputs . The function푓 REDUCE applies a “combining”0 1 operation퐿 푓 to a 푖 푓푥푖=1 list.⟨푥 For⟩ example, REDUCE will return퐿 the sum푓 of all the elements1 in the list . More generally, REDUCE takes a list , an operation (which we think퐿 of + as taking0 two arguments) and a λ expression (which퐿 we think of as the “neutral element” for퐿 the operation 푓, such as for addition and for multiplication). The output is defined푧 via 푓 0 1

NIL REDUCE HEAD REDUCE TAIL otherwise ⎧ {푧 퐿 = (8.14) 퐿 푓 푧 = ⎨ . See Fig. 8.16 for an⎩{ illustration푓 ( 퐿) of the( three list-processing( 퐿) 푓 푧) operations.

• Recursion: Finally, we want to be able to execute recursive func- tions. Since in λ calculus functions are anonymous, we can’t write a definition of the form where includes calls to . Instead we use functions that take an additional input as a parameter. The operator푓(푥)RECURSE = 푏푙푎ℎwill take푏푙푎ℎ such a function as 푓input and return a “recursive푓 version” of where all the calls푚푒 to are replaced by recursive calls to this function. That is, if we have푓 a function taking two parameters and푓 , then RECURSE will푚푒 be the function taking one parameter such that for every퐹 . 푚푒 푥 퐹 푓 푥 푓(푥) = 퐹 (푓, 푥) Solved Exercise푥 8.3 — Compute NAND using λ calculus. Give a λ expression such that NAND for every . ■ 푁 푁 푥 푦 = (푥, 푦) 푥, 푦 ∈ {0, 1} equivalent models of computation 307

Solution: The NAND of is equal to unless . Hence we can write 푥, 푦 1 푥 = 푦 = 1 IF IF (8.15)

■ 푁 = 휆푥, 푦. (푥, (푦, 0, 1), 1) Solved Exercise 8.4 — Compute XOR using λ calculus. Give a λ expression XOR such that for every list where for , XOR evaluates to mod . 0 푛−1 푖 퐿 = ⟨푥 , … , 푥 , ⊥⟩ 푥 ∈ {0, 1} ■ 푖 푖 ∈ [푛] 퐿 ∑ 푥 2 Solution: First, we note that we can compute XOR of two bits as follows:

NOT IF (8.16)

and = 휆푎. (푎, 0, 1) XOR IF NOT (8.17)

(We are using here a2 bit of syntactic sugar to describe the func- = 휆푎, 푏. (푏, (푎), 푎) tions. To obtain the λ expression for XOR we will simply replace the expression (8.16) in (8.17).) Now recursively we can define the XOR of a list as follows:

is empty XOR XOR HEAD XOR TAIL otherwise {⎧0 퐿 (퐿) = (8.18) ⎨ 2 This means that⎩{ XOR( is equal(퐿), to ( (퐿)))

RECURSE IF ISEMPTY XOR HEAD TAIL (8.19) 2 That is, XOR(휆푚푒,is 퐿. obtained( by applying(퐿), 0, the RECURSE( 퐿operator , 푚푒( 퐿)))) . to the function that on inputs , , returns if ISEMPTY and otherwise returns XOR applied to HEAD and to TAIL . We could have also computed푚푒 XOR퐿 using0 the REDUCE(퐿)opera- 2 tion, we leave working this out as an exercise(퐿) to the reader.푚푒( (퐿)) ■

8.6.1 Computing a function in the enhanced λ calculus An enhanced λ expression is obtained by composing the objects above with the application and abstraction rules. The result of simplifying a λ 308 introduction to theoretical computer science

Figure 8.15: A list in the λ calculus is con- structed from the tail up, building the pair NIL , then the pair ⟨푥0, 푥NIL1, 푥2⟩and finally the pair NIL . That is, a list is a pair⟨푥 where2, ⟩ the first element⟨푥1, ⟨푥 of2 , the ⟩⟩pair is the first element ofthe list⟨푥0, and ⟨푥1 the, ⟨푥2 second, ⟩⟩⟩ element is the rest of the list. The figure on the left renders this “pairs inside pairs” construction, though it is often easier to think of a list as a “chain”, as in the figure on the right, where the second element of each pair is thought of as a link, pointer or reference to the remainder of the list.

Figure 8.16: Illustration of the MAP, FILTER and REDUCE operations.

expression is an equivalent expression, and hence if two expressions have the same simplification then they are equivalent.

Definition 8.15 — Computing a function via λ calculus. Let ∗ We∗ say that computes if for every 퐹, ∶ {0, 1} → {0, 1} ∗ 푒푥푝 퐹 푥 ∈ {0, 1} (8.20)

where , 0 푛−1, and 0 , and푚−1 the notion of equiva- 푒푥푝⟨푥 , … , 푥 , ⊥⟩ ≅ ⟨푦 , … , 푦 , ⊥⟩ lence is defined as per Definition 8.14. 푛 = |푥| 푦 = 퐹 (푥) 푚 = |푦|

8.6.2 Enhanced λ calculus is Turing-complete The basic operations of the enhanced λ calculus more or less amount to the Lisp or Scheme programming languages. Given that, it is per- haps not surprising that the enhanced λ-calculus is equivalent to Turing machines:

Theorem 8.16 — Lambda calculus and NAND-TM. For every function , is computable in the enhanced λ calculus if and only if∗ it is computable∗ by a Turing machine. 퐹 ∶ {0, 1} → {0, 1} 퐹 equivalent models of computation 309

Proof Idea: To prove the theorem, we need to show that (1) if is computable by a λ calculus expression then it is computable by a Turing machine, and (2) if is computable by a Turing machine, then퐹 it is computable by an enhanced λ calculus expression. Showing퐹 (1) is fairly straightforward. Applying the simplification rules to a λ expression basically amounts to “search and replace” which we can implement easily in, say, NAND-RAM, or for that matter Python (both of which are equivalent to Turing machines in power). Showing (2) essentially amounts to simulating a Turing ma- chine (or writing a NAND-TM interpreter) in a functional program- ming language such as LISP or Scheme. We give the details below but how this can be done is a good exercise in mastering some functional programming techniques that are useful in their own right.

⋆ Proof of Theorem 8.16. We only sketch the proof. The “if” direction is simple. As mentioned above, evaluating λ expressions basically amounts to “search and replace”. It is also a fairly straightforward programming exercise to implement all the above basic operations in an imperative language such as Python or C, and using the same ideas we can do so in NAND-RAM as well, which we can then transform to a NAND-TM program. For the “only if” direction we need to simulate a Turing machine using a λ expression. We will do so by first showing that showing for every Turing machine a λ expression to compute the next-step function NEXT that maps a configuration of to the next one (see Section 8.4.2∗). 푀∗ 푀 A configuration∶ Σ of →isΣ a string for a finite set 푀. We can encode every symbol by a finite string∗ , and so we will encode a configuration푀 in the λ calculus훼 ∈ Σ as a list ℓ Σ where is an -length휎 string ∈ Σ (i.e., an -length{0, list 1} of ’s and ’s) en- 0 1 푚−1 coding a symbol in . 훼 ⟨훼 , 훼 , … , 훼 , ⊥⟩ 푖 By Lemma훼 8.9ℓ , for every , NEXTℓ is equal0 to1 forΣ some finite function∗ . Using our 푀 푖 encoding of as , we훼 can ∈ alsoΣ think of (훼)3as mapping to 푖−1 푖 푖+1 푟(훼 ,. 훼 By, 훼Solved) Exerciseℓ 8.3, we can compute푟 ∶ Σ the→ NANDΣ function,3ℓ and henceℓ everyΣ finite{0, 1} function, including , using푟 the λ calculus.{0, 1} {0,Using 1} this insight, we can compute NEXT using the λ calculus as follows. Given a list encoding the configuration푟 , we 푀 define the lists and encoding the configuration shifted 0 푚−1 by one step to the right퐿 and left respectively. The next훼 ⋯ configuration 훼 푝푟푒푣 푛푒푥푡 is defined 퐿as 퐿 where we let 훼 denote ′ ′ ′ 푖 푝푟푒푣 푛푒푥푡 훼 훼 = 푟(퐿 [푖], 퐿[푖], 퐿 [푖]) 퐿 [푖] 310 introduction to theoretical computer science

the -th element of . This can be computed by recursion (and hence using the enhanced λ′ calculus’ RECURSE operator) as follows: 푖 퐿 Algorithm 8.17 — using the λ calculus.

Input: List 푀 encoding a configura- 푁퐸푋푇 tion . 0 1 푚−1 Output: List퐿 =encoding ⟨훼 , 훼 , … , 훼 , ⊥⟩ 1: procedure훼 ComputeNext(′ ) 푀 2: if then퐿 푁퐸푋푇 (훼) 푝푟푒푣 푛푒푥푡 3: return 퐿 , 퐿, 퐿 푝푟푒푣 4: end if 퐼푆퐸푀푃 푇 푌 퐿 5: 푁퐼퐿 6: if then 푝푟푒푣 7: 푎 ← 퐻퐸퐴퐷 퐿 # Encoding of in 8: else 퐼푆퐸푀푃∅ 푇 푌 퐿 ∅ ℓ 9: 푏 ← {0, 1} 10: end if 11: if then푏 ← 퐻퐸퐴퐷 퐿 12: 푛푒푥푡 13: else 퐼푆퐸푀푃∅ 푇 푌 퐿 14: 푐 ← 15: end if 푛푒푥푡 16: return푐 ← 퐻퐸퐴퐷 퐿 ComputeNext 17: end procedure 푝푟푒푣 푛푒푥푡 18: 푃 퐴퐼푅 푟(푎, 푏, 푐) # (푇 퐴퐼퐿 퐿 , 푇 퐴퐼퐿 퐿 , 푇 퐴퐼퐿 퐿 ) 19: ∅ # ∅ 푝푟푒푣 푝푟푒푣 0 푚−1 20: return퐿 ←ComputeNext 푃 퐴퐼푅 퐿 퐿 = ⟨ , 훼 , … , 훼 , ⊥⟩ 푛푒푥푡 푛푒푥푡 1 푚−1 퐿 ← 푇 퐴퐼퐿 퐿 퐿 = ⟨훼 , … , 훼 , ⊥} 푝푟푒푣 푛푒푥푡 Once we can compute NEXT(퐿 , we, 퐿, can 퐿 simulate) the execution of on input using the following recursion. Define FINAL to be 푀 the final configuration of when initialized at configuration . The 푀function FINAL푥 can be defined recursively as follows: (훼) 푀 훼

is halting configuration FINAL (8.21) NEXT otherwise {⎧훼 훼 (훼) = . Checking whether⎨ a configuration푀 is halting (i.e., whether itis ⎩{ (훼) one in which the transition function would output Halt) can be easily implemented in the calculus, and hence we can use the RECURSE to compute FINAL. If we let be the initial configuration of on input then we can휆 obtain the0 output from FINAL , hence completing the proof. 훼 0푀 푥 푀(푥) (훼 ) ■ equivalent models of computation 311

8.7 FROM ENHANCED TO PURE Λ CALCULUS

While the collection of “basic” functions we allowed for the enhanced λ calculus is smaller than what’s provided by most Lisp dialects, com- ing from NAND-TM it still seems a little “bloated”. Can we make do with less? In other words, can we find a subset of these basic opera- tions that can implement the rest? It turns out that there is in fact a proper subset of the operations of the enhanced λ calculus that can be used to implement the rest. That subset is the empty set. That is, we can implement all the operations above using the λ formalism only, even without using ’s and ’s. It’s λ’s all the way down! 0 1

P This is a good point to pause and think how you would implement these operations your- self. For example, start by thinking how you could implement MAP using REDUCE, and then REDUCE using RECURSE combined with IF PAIR HEAD TAIL NIL ISEMPTY. You can also PAIR, HEAD and TAIL based on IF. The most 0,challenging 1, , part, is to, implement, , RECURSE using only the operations of the pure λ calculus. 0, 1,

Theorem 8.18 — Enhanced λ calculus equivalent to pure λ calculus.. There are λ expressions that implement the functions , ,IF,PAIR, HEAD, TAIL, NIL, ISEMPTY, MAP, REDUCE, and RECURSE. 0 1 The idea behind Theorem 8.18 is that we encode and them- selves as λ expressions, and build things up from there. This is known as Church encoding, as it was originated by Church0 in his1 effort to show that the λ calculus can be a basis for all computation. We will not write the full formal proof of Theorem 8.18 but outline the ideas involved in it:

• We define to be the function that on two inputs outputs , and to be the function that on two inputs outputs . We use Currying to0 achieve the effect of two-input functions푥, 푦 and 푦hence 1 and . (This representation푥, 푦 scheme푥 is the common convention for representing false and true but there are 0many = 휆푥.휆푦.푦 other alternative1 = 휆푥.휆푦.푥 representations for and that would have worked just as well.) 0 1 • The above implementation makes the IF function trivial: IF is simply since and . We can write IF to achieve IF IF (푐표푛푑,. 푎, 푏) 푐표푛푑 푎 푏 0푎푏 = 푏 1푎푏 = 푎 = 휆푥.푥 (푐표푛푑, 푎, 푏) = ((( 푐표푛푑)푎)푏) = 푐표푛푑 푎 푏 312 introduction to theoretical computer science

• To encode a pair we will produce a function that has and “in its belly” and satisfies for every function . 푥,푦 That is, PAIR (푥, 푦) . We can extract the푓 first element푥 of 푥,푦 a pair푦 by writing and the second푓 푔 element = 푔푥푦 by writing , and푔 so HEAD =and 휆푥,TAIL 푦. (휆푔.푔푥푦) . 푝 푝1 푝0 • We define= 휆푝.푝1NIL to be the function= 휆푝.푝0 that ignores its input and always outputs . That is, NIL . The ISEMPTY function checks, given an input , whether we get if we apply to the function 1 that ignores= 휆푥.1 both its inputs and always outputs . For every valid pair푝 of the form PAIR1 , 푝 while 푧푒푟표NIL = 휆푥, 푦.0. Formally, ISEMPTY . 0 푝 = 푥푦 푝푧푒푟표 = 푝푥푦 = 0 푧푒푟표 = 1 = 휆푝.푝(휆푥, 푦.0)

R Remark 8.19 — Church numerals (optional). There is nothing special about Boolean values. You can use similar tricks to implement natural numbers using λ terms. The standard way to do so is to represent the number by the function ITER that on input a function outputs the function ( 푛 times). That푛 is, we represent the natural number as , the푓 number as 푥 ↦, the 푓(푓(⋯ number 푓(푥)))as푛 , and so on and so forth. (Note that1 this휆푓.푓 is not the same2 representation휆푓.(휆푥.푓(푓푥)) we used for in3 휆푓.(휆푥.푓(푓(푓푥)))the Boolean context: this is fine; we already know that the same object can be represented in more than1 one way.) The number is represented by the function that maps any function to the identity function . (That is, 0 .) In this representation, we푓 can compute PLUS 휆푥.푥 as 0 = 휆푓.(휆푥.푥)and TIMES as . Subtraction and division are trickier, but can be(푛, 푚) achieved휆푓.휆푥.(푛푓)((푚푓)푥) using recursion. (Working(푛, this 푚) out휆푓.푛(푚푓) is a great exercise.)

8.7.1 List processing Now we come to a bigger hurdle, which is how to implement MAP, FILTER, REDUCE and RECURSE in the pure λ calculus. It turns out that we can build MAP and FILTER from REDUCE, and REDUCE from RECURSE. For example MAP is the same as REDUCE where is the operation that on input and , outputs PAIR NIL if is NIL and otherwise outputs(퐿, 푓)PAIR . (I leave checking(퐿, 푔) this as a푔 (recommended!) exercise for you,푥 the푦 reader.) We(푓(푥), can define) REDUCE푦 recursively, by setting (푓(푥), 푦) REDUCE NIL NIL and stipulating that given a non- empty list , which we can(퐿, think 푔) of as a pair , ( , 푔) = 퐿 (ℎ푒푎푑, 푟푒푠푡) equivalent models of computation 313

REDUCE REDUCE . Thus, we might try to write a recursive λ expression for REDUCE as follows (퐿, 푔) = 푔(ℎ푒푎푑, (푟푒푠푡, 푔)))

REDUCE IF ISEMPTY NIL HEAD REDUCE TAIL (8.22) The only= fly 휆퐿, 푔.in (this ointment(퐿), is, that 푔 the(퐿) λ calculus( does(퐿), not havethe푔)) . notion of recursion, and so this is an invalid definition. But of course we can use our RECURSE operator to solve this problem. We will replace the recursive call to “REDUCE” with a call to a function that is given as an extra argument, and then apply RECURSE to this. Thus REDUCE RECURSE where 푚푒

= 푚푦푅퐸퐷푈퐶퐸 IF ISEMPTY NIL HEAD TAIL (8.23) 푚푦푅퐸퐷푈퐶퐸 = 휆푚푒, 퐿, 푔. ( (퐿), , 푔 (퐿)푚푒( (퐿), 푔)) . 8.7.2 The Y combinator, or recursion without recursion Eq. (8.23) means that implementing MAP, FILTER, and REDUCE boils down to implementing the RECURSE operator in the pure λ calculus. This is what we do now. How can we implement recursion without recursion? We will illustrate this using a simple example - the XOR function. As shown in Solved Exercise 8.4, we can write the XOR function of a list recursively as follows:

is empty XOR (8.24) XOR HEAD XOR TAIL otherwise {⎧0 퐿 (퐿) = where XOR⎨ 2 is the XOR on two bits. In Python we ⎩{ ( (퐿), ( (퐿))) would write this as 2 2 ∶ {0, 1} → {0, 1} def xor2(a,b): return 1-b if a else b def head(L): return L[0] def tail(L): return L[1:] def xor(L): return xor2(head(L),xor(tail(L))) if L else 0 print(xor([0,1,1,0,0,1])) # 1

Now, how could we eliminate this recursive call? The main idea is that since functions can take other functions as input, it is perfectly legal in Python (and the λ calculus of course) to give a function itself as input. So, our idea is to try to come up with a non recursive function tempxor that takes two inputs: a function and a list, and such that tempxor(tempxor,L) will output the XOR of L! 314 introduction to theoretical computer science

P At this point you might want to stop and try to im- plement this on your own in Python or any other programming language of your choice (as long as it allows functions as inputs).

Our first attempt might be to simply use the idea of replacing the recursive call by me. Let’s define this function as myxor def myxor(me,L): return xor2(head(L),me(tail(L))) if L else 0

↪ Let’s test this out: myxor(myxor,[1,0,1])

If you do this, you will get the following complaint from the inter- preter: TypeError: myxor() missing 1 required positional argu- ment The problem is that myxor expects two inputs- a function and a list- while in the call to me we only provided a list. To correct this, we modify the call to also provide the function itself: def tempxor(me,L): return xor2(head(L),me(me,tail(L))) if L else 0

↪ Note the call me(me,..) in the definition of tempxor: given a func- tion me as input, tempxor will actually call the function me with itself as the first input. If we test this out now, we see that we actually get the right result! tempxor(tempxor,[1,0,1]) # 0 tempxor(tempxor,[1,0,1,1]) # 1

and so we can define xor(L) as simply return tem- pxor(tempxor,L). The approach above is not specific to XOR. Given a recursive func- tion f that takes an input x, we can obtain a non recursive version as follows:

1. Create the function myf that takes a pair of inputs me and x, and replaces recursive calls to f with calls to me.

2. Create the function tempf that converts calls in myf of the form me(x) to calls of the form me(me,x). equivalent models of computation 315

3. The function f(x) will be defined as tempf(tempf,x)

Here is the way we implement the RECURSE operator in Python. It will take a function myf as above, and replace it with a function g such that g(x)=myf(g,x) for every x. def RECURSE(myf): def tempf(me,x): return myf(lambda y: me(me,y),x)

return lambda x: tempf(tempf,x)

xor = RECURSE(myxor) print(xor([0,1,1,0,0,1])) # 1 print(xor([1,1,0,0,1,1,1,1])) # 0

From Python to the ฀calculus. In the λ calculus, a two input function that takes a pair of inputs is written as . So the function is simply written as and similarly the푔 function 푚푒, 푦 is simply 휆푚푒.(휆푦.푔) . (Can you see why?)푦 ↦ 푚푒(푚푒, Therefore 푦) the function tempf defined푚푒 푚푒 above can be written as λ푥 me. ↦ myf(me 푡푒푚푝푓(푡푒푚푝푓, me). This 푥) means that푡푒푚푝푓 if we 푡푒푚푝푓 denote the input of RECURSE by , then RECURSE where or in other words 푓 푚푦푓 = 푡푒푚푝푓 푡푒푚푝푓 푡푒푚푝푓 = 휆푚.푓(푚 푚) RECURSE (8.25)

The online appendix=contains 휆푓.((휆푚.푓(푚 an implementation 푚)) (휆푚.푓(푚 of 푚))) the λ calcu- lus using Python. Here is an implementation of the recursive XOR 3 3 Because of specific issues of Python syntax, in this function from that appendix: implementation we use f * g for applying f to g rather than fg, and use λx(exp) rather than λx.exp # XOR of two bits for abstraction. We also use _0 and _1 for the λ terms XOR2 = λ(a,b)(IF(a,IF(b,_0,_1),b)) for and so as not to confuse with the Python constants. 0 1 # Recursive XOR with recursive calls replaced by m parameter myXOR↪ = λ(m,l)(IF(ISEMPTY(l),_0,XOR2(HEAD(l),m(TAIL(l)))))

# Recurse operator (aka Y combinator) RECURSE = λf((λm(f(m*m)))(λm(f(m*m))))

# XOR function 316 introduction to theoretical computer science

XOR = RECURSE(myXOR)

#TESTING:

XOR(PAIR(_1,NIL)) # List [1] # equals 1

XOR(PAIR(_1,PAIR(_0,PAIR(_1,NIL)))) # List [1,0,1] # equals 0

R Remark 8.20 — The Y combinator. The RECURSE opera- tor above is better known as the Y combinator. It is one of a family of a fixed point operators that given a lambda expression , find a fixed point of such that . If you think about it, XOR is the fixed point of above.퐹 XOR is the function푓 퐹 such that for푓 every = 퐹 푓, if plug in XOR as the first argument of 푚푦푋푂푅then we get back XOR, or in other words XOR 푥 XOR. Hence finding a fixed point for 푚푦푋푂푅is the same as applying RECURSE to it. = 푚푦푋푂푅 푚푦푋푂푅

8.8 THE CHURCH-TURING THESIS (DISCUSSION) “[In 1934], Church had been speculating, and finally definitely proposed, that the λ-definable functions are all the effectively calculable functions ….When Church proposed this thesis, I sat down to disprove it … but, quickly realizing that [my approach failed], I became overnight a supporter of the thesis.”, Stephen Kleene, 1979.

“[The thesis is] not so much a definition or to an axiom but … a natural, law.” Emil Post, 1936.

We have defined functions to be computable if they can be computed by a NAND-TM program, and we’ve seen that the definition would remain the same if we replaced NAND-TM programs by Python pro- grams, Turing machines, λ calculus, cellular automata, and many other computational models. The Church-Turing thesis is that this is the only sensible definition of “computable” functions. Unlike the “Physical Extended Church-Turing Thesis” (PECTT) which we saw before, the Church-Turing thesis does not make a concrete physical prediction that can be experimentally tested, but it certainly motivates predictions such as the PECTT. One can think of the Church-Turing Thesis as either advocating a definitional choice, making some pre- diction about all potential computing devices, or suggesting some laws of nature that constrain the natural world. In Scott Aaronson’s equivalent models of computation 317

words, “whatever it is, the Church-Turing thesis can only be regarded as extremely successful”. No candidate computing device (including quantum computers, and also much less reasonable models such as the hypothetical “closed time curve” computers we mentioned before) has so far mounted a serious challenge to the Church-Turing thesis. These devices might potentially make some computations more effi- cient, but they do not change the difference between what is finitely computable and what is not. (The extended Church-Turing thesis, which we discuss in Section 13.3, stipulates that Turing machines cap- ture also the limit of what can be efficiently computable. Just like its physical version, quantum computing presents the main challenge to this thesis.)

8.8.1 Different models of computation We can summarize the models we have seen in the following table:

Table 8.1: Different models for computing finite functions and functions with arbitrary input length.

Computational problems Type of model Examples Finite functions Non uniform Boolean circuits, computation NAND circuits, 푛 푚 (algorithm straight-line programs 푓 ∶ {0, 1} → {0, 1} depends on input (e.g., NAND-CIRC) length) Functions with Sequential access Turing machines, unbounded inputs to memory NAND-TM programs

–∗ ∗ Indexed access / RAM machines, 퐹 ∶ {0, 1} → {0, 1} RAM NAND-RAM, modern programming languages – Other Lambda calculus, cellular automata

Later on in Chapter 17 we will study memory bounded computa- tion. It turns out that NAND-TM programs with a constant amount of memory are equivalent to the model of finite automata (the adjec- tives “deterministic” or “nondeterministic” are sometimes added as well, this model is also known as finite state machines) which in turn captures the notion of regular languages (those that can be described by regular expressions), which is a concept we will see in Chapter 10. 318 introduction to theoretical computer science

Chapter Recap

• While we defined computable functions using ✓ Turing machines, we could just as well have done so using many other models, including not just NAND-TM programs but also RAM machines, NAND-RAM, the λ-calculus, cellular automata and many other models. • Very simple models turn out to be “Turing com- plete” in the sense that they can simulate arbitrarily complex computation.

8.9 EXERCISES

Exercise 8.1 — Alternative proof for TM/RAM equivalence. Let SEARCH be the following function. The input is a pair ∗where ∗ , is an encoding of a list of key value pairs∶ {0, 1} → {0, 1} ∗ where , are binary (퐿,strings. 푘) The output푘 ∈ {0, is 1} for퐿 the smallest such that , if such 0 1 푚−1 푚−1 0 푚−1 0 푚−1 (푘exists,, 푣 ), and … , otherwise (푘 , 푣 the) empty푘 string., … , 푘 푣 , … , 푣 푖 푖 푣 푖 푘 = 푘 푖 1. Prove that SEARCH is computable by a Turing machine.

2. Let UPDATE be the function whose input is a list of pairs, and whose output is the list obtained by prepending the pair to the beginning(퐿, 푘, 푣) of . Prove′ that UPDATE is computable퐿 by a Turing machine. 퐿 (푘, 푣) 퐿 3. Suppose we encode the configuration of a NAND-RAM program by a list of key/value pairs where the key is either the name of a scalar variable foo or of the form Bar[] for some num- ber 퐿 and it contains all the nonzero values of variables. Let NEXT be the function that maps a configuration of a NAND- RAM program at one step to the configuration in the next step. Prove(퐿) that NEXT is computable by a Turing machine (you don’t have to implement each one of the arithmetic operations: it is enough to implement addition and multiplication).

4. Prove that for every that is computable by a NAND-RAM program, is computable∗ by∗ a Turing machine. 퐹 ∶ {0, 1} → {0, 1} ■ 퐹 Exercise 8.2 — NAND-TM lookup. This exercise shows part of the proof that NAND-TM can simulate NAND-RAM. Produce the code of a NAND- TM program that computes the function LOOKUP that is defined as follows. On input , where denotes∗ a prefix-free encoding of an integer , LOOKUP ∶ {0, 1} if→ {0, 1} 푝푓(푖)푥 푝푓(푖) 푖 푖 (푝푓(푖)푥) = 푥 푖 < |푥| equivalent models of computation 319

and LOOKUP otherwise. (We don’t care what LOOKUP outputs on inputs that are not of this form.) You can choose any prefix-free encoding(푝푓(푖)푥) = of 0 your choice, and also can use your favorite programming language to produce this code. ■

Exercise 8.3 — Pairing. Let be the function defined as 2 . 1 푒푚푏푒푑 ∶ ℕ → ℕ 0 1 0 1 0 1 1 푒푚푏푒푑(푥1. Prove that, 푥 ) for = 2 every(푥 + 푥 )(푥 + 푥, + 1) + 푥 is indeed a natural number. 0 1 0 1 푥 , 푥 ∈ ℕ 푒푚푏푒푑(푥 , 푥 ) 2. Prove that is one-to-one

3. Construct a NAND-TM program such that for every , 푒푚푏푒푑 , where is the prefix-free0 1 encoding0 map1 defined above.0 푃You1 can use the syntactic푥 , 푥 ∈sugar ℕ for 푃inner (푝푓(푥 loops,)푝푓(푥 conditionals,)) = 푝푓(푒푚푏푒푑(푥 and incrementing/decrementing, 푥 )) 푝푓 the counter.

4. Construct NAND-TM programs such that for for every and , . You can 0 1 use0 the1 syntactic sugar for inner loops,푃 , 푃 0 conditionals,1 and푖 increment- 푖 푥ing/decrementing, 푥 ∈ ℕ 푖 ∈ the 푁 counter.푃 (푝푓(푒푚푏푒푑(푥 , 푥 ))) = 푝푓(푥 )

Exercise 8.4 — Shortest Path. Let SHORTPATH be the function that on input a string encoding a triple∗ out-∗ puts a string encoding if and are disconnected∶ {0, 1} in→or {0, a 1} string encoding the length of the shortest path from to . Prove(퐺, 푢, 푣)that SHORTPATH is computable∞ 푢 by a Turing푣 machine. See footnote퐺 for hint.4 푘 푢 푣 4 You don’t have to give a full description of a Turing machine: use our “have the cake and eat it too” ■ paradigm to show the existence of such a machine by arguing from more powerful equivalent models. Exercise 8.5 — Longest Path. Let LONGPATH be the function that on input a string encoding a triple∗ outputs∗ a string encoding if and are disconnected∶ {0, in 1} →or a {0, string 1} en- coding the length of the longest simple path from to(퐺,. 푢, Prove 푣) that LONGPATH is computable∞ 푢 by푣 a Turing machine. See퐺 footnote for hint.5 푘 푢 푣 5 Same hint as Exercise 8.5 applies. Note that for showing that LONGPATH is computable you don’t ■ have to give an efficient algorithm. Exercise 8.6 — Shortest path λ expression. Let SHORTPATH be as in Exercise 8.4. Prove that there exists a expression that computes SHORTPATH. You can use Exercise 8.4 휆 ■

Exercise 8.7 — Next-step function is local. Prove Lemma 8.9 and use it to complete the proof of Theorem 8.7. 320 introduction to theoretical computer science

Exercise 8.8 — λ calculus requires at most three variables. Prove that for ev- ery λ-expression with no free variables there is an equivalent λ- 6 6 Hint: You can reduce the number of variables a expression that only uses the variables , , and . function takes by “pairing them up”. That is, define a 푒 ■ λ expression PAIR such that for every PAIR is 푓 푥 푦 푧 some function such that and . Then Exercise 8.9 — Evaluation order example in λ calculus. 1. Let use PAIR to iteratively reduce the number푥, 푦 of variables푥푦 . Prove that the simplification process of used. 푓 푓0 = 푥 푓1 = 푦 ends in a definite number if we use the “call 푒by = name” evaluation order휆푥.7 ((휆푥.푥푥)(휆푥.푥푥)) while it never ends if we use the “call by value” order. 푒

2. (bonus, challenging) Let be any λ expression. Prove that if the simplification process ends in a definite number if we use the“call by value” order then it also푒 ends in such a number if we use the 7 Use structural induction on the expression . “call by name” order. See footnote for hint.7 ■ 푒

Exercise 8.10 — Zip function. Give an enhanced λ calculus expression to compute the function that on input a pair of lists and of the same length , outputs a list of pairs such that the -th element of (which we denote푧푖푝 by ) is the pair . Thus퐼 퐿“zips 8 The name is a common name for this operation, 8 together” these푛 two lists of elements푛 into푀 a single list of푗 pairs. for example in Python. It should not be confused with 푗 푗 푗 푀 푀 (퐼 , 퐿 ) 푧푖푝 ■ the zip compression푧푖푝 file format.

Exercise 8.11 — Next-step function without . Let be a Turing machine. Give an enhanced λ calculus expression to compute the next-step function NEXT of (as in푅퐸퐶푈푅푆퐸 the proof of Theorem푀 8.16) 9 Use MAP and REDUCE (and potentially FILTER). 9 without using RECURSE. See footnote for hint. You might also find the function of Exercise 8.10 푀 푀 ■ useful. 푧푖푝 Exercise 8.12 — λ calculus to NAND-TM compiler (challenging). Give a program in the programming language of your choice that takes as input a λ expression and outputs a NAND-TM program that computes the same function as . For partial credit you can use the GOTO and all NAND-CIRC푒 syntactic sugar in your output program.푃 You can use any encoding of λ푒 expressions as binary string that is convenient for 10 Try to set up a procedure such that if array Left 10 you. See footnote for hint. contains an encoding of a λ expression and ■ array Right contains an encoding of another λ expres- sion , then the array Result will contain휆푥.푒 . Exercise 8.13 — At least two in calculus. Let and as ′ ′ before. Define 푒 푒[푥 → 푒 ] 휆 1 = 휆푥, 푦.푥 0 = 휆푥, 푦.푦 ALT (8.26)

Prove that ALT is a expression that computes the at least two func- = 휆푎, 푏, 푐.(푎(푏1(푐10))(푏푐0)) tion. That is, for every (as encoded above) ALT if and only at least two휆 of are equal to . 푎, 푏, 푐 ∈ {0, 1} 푎푏푐 = 1 ■ {푎, 푏, 푐} 1 equivalent models of computation 321

Exercise 8.14 — Locality of next-step function. This question will help you get a better sense of the notion of locality of the next step function of Tur- ing Machines. This locality plays an important role in results such as the Turing completeness of calculus and one dimensional cellular automata, as well as results such as Godel’s Incompleteness Theorem and the Cook Levin theorem휆 that we will see later in this course. De- fine STRINGS to be the a programming language that has the following semantics:

•A STRINGS program has a single string variable str that is both the input and the output of . The program has no loops and no other variables, but rather푄 consists of a sequence of conditional search and replace operations푄 that modify str.

• The operations of a STRINGS program are:

– REPLACE(pattern1,pattern2) where pattern1 and pattern2 are fixed strings. This replaces the first occurrence of pattern1 in str with pattern2 – if search(pattern) { code } executes code if pattern is a substring of str. The code code can itself include nested if’s. (One can also add an else { ... } to execute if pattern is not a substring of condf). – the returned value is str

•A STRING program computes a function if for every , if we initialize str to and then∗ execute the∗ sequence of instructions푄∗ in , then at the end퐹 ∶ of {0, the 1} execution→ {0, 1}str equals 푥 ∈. {0, 1} 푥 푄 For example,퐹 (푥) the following is a STRINGS program that computes the function such that for every , if contains a substring of∗ the form ∗ where ,∗ then where퐹 ∶ {0,is 1} obtained→ {0, 1} by replacing the first푥 occurrence ∈ {0, 1} of 푥in with ′. ′ 푦 = 11푎푏11 푎, 푏 ∈ {0, 1} 퐹 (푥) = 푥 푥 푦 if푥 search(00 '110011') { replace('110011','00') } else if search('110111') { replace('110111','00') } else if search('111011') { replace('111011','00') } else if search('111111') { replace('1111111','00') } 322 introduction to theoretical computer science

Prove that for every Turing Machine program , there exists a STRINGS program that computes the NEXT function that maps every string encoding a valid configuration of to푀 the string encoding 푀 the configuration푄 of the next step of ’s computation. (We don’t care what the function will do on strings that푀 do not encode a valid configuration.) You don’t have to푀 write the STRINGS program fully, but you do need to give a convincing argument that such a program exists. ■

8.10 BIBLIOGRAPHICAL NOTES

Chapters 7 in the wonderful book of Moore and Mertens [MM11] contains a great exposition much of this material. . The RAM model can be very useful in studying the concrete com- plexity of practical algorithms. Its theoretical study was initiated in [CR73]. However, the exact set of operations that are allowed in the RAM model and their costs vary between texts and contexts. One needs to be careful in making such definitions, especially if the word size grows, as was already shown by Shamir [Sha79]. Chapter 3 in Savage’s book [Sav98] contains a more formal description of RAM machines, see also the paper [Hag98]. A study of RAM algorithms that are independent of the input size (known as the “transdichoto- mous RAM model”) was initiated by [FW93] The models of computation we considered so far are inherently sequential, but these days much computation happens in parallel, whether using multi-core processors or in dis- tributed computation in data centers or over the Internet. is important in practice, but it does not really make much difference for the question of what can and can’t be computed. After all, if a computation can be performed using machines in time, then it can be computed by a single machine in time . The λ-calculus was described by Church in푚 [Chu41]. Pierce’s푡 book [Pie02] is a canonical textbook, see also [Bar84]. The푚푡 “Currying tech- nique” is named after the logician Haskell Curry (the Haskell pro- gramming language is named after Haskell Curry as well). Curry himself attributed this concept to Moses Schönfinkel, though for some reason the term “Schönfinkeling” never caught on. Unlike most programming languages, the pure λ-calculus doesn’t have the notion of types. Every object in the λ calculus can also be thought of as a λ expression and hence as a function that takes one input and returns one output. All functions take one input and re- turn one output, and if you feed a function an input of a form it didn’t expect, it still evaluates the λ expression via “search and replace”, replacing all instances of its parameter with copies of the input expres- equivalent models of computation 323

sion you fed it. Typed variants of the λ calculus are objects of intense research, and are strongly related to type systems for programming language and computer-verifiable proof systems, see [Pie02]. Some of the typed variants of the λ calculus do not have infinite loops, which makes them very useful as ways of enabling static analysis of pro- grams as well as computer-verifiable proofs. We will come back to this point in Chapter 10 and Chapter 22. Tao has proposed showing the Turing completeness of fluid dy- namics (a “water computer”) as a way of settling the question of the behavior of the Navier-Stokes equations, see this popular article.

Learning Objectives: • The universal machine/program - “one program to rule them all” • A fundamental result in computer science and mathematics: the existence of uncomputable functions. • The halting problem: the canonical example of an uncomputable function. • Introduction to the technique of reductions. • Rice’s Theorem: A “meta tool” for 9 uncomputability results, and a starting point for much of the research on compilers, programming languages, and software Universality and uncomputability verification.

“A function of a variable quantity is an analytic expression composed in any way whatsoever of the variable quantity and numbers or constant quantities.”, Leonhard Euler, 1748.

“The importance of the universal machine is clear. We do not need to have an infinity of different machines doing different jobs. … The engineering problem of producing various machines for various jobs is replaced by the office work of ‘programming’ the universal machine”, Alan Turing, 1948

One of the most significant results we showed for Boolean circuits (or equivalently, straight-line programs) is the notion of universality: there is a single circuit that can evaluate all other circuits. However, this result came with a significant caveat. To evaluate a circuit of gates, the universal circuit needed to use a number of gates larger than . It turns out that uniform models such as Turing machines푠 or NAND-TM programs allow us to “break out of this cycle” and obtain a truly푠 universal Turing machine that can evaluate all other machines, including machines that are more complex (e.g., more states) than itself. (Similarly, there is a Universal푈 NAND-TM program that can evaluate all NAND-TM programs, including programs that′ have more푈 lines than .) 푈 It is no exaggeration′ to say that the existence of such a universal program/machine푈 underlies the revolution that began in the latter half of the 20th century (and is still ongoing). Up to that point in history, people have produced various special- purpose calculating devices such as the abacus, the slide ruler, and machines that compute various trigonometric series. But as Turing (who was perhaps the one to see most clearly the ramifications of universality) observed, a general purpose computer is much more pow- erful. Once we build a device that can compute the single universal function, we have the ability, via software, to extend it to do arbitrary computations. For example, if we want to simulate a new Turing ma- chine , we do not need to build a new physical machine, but rather

Compiled on 8.26.2020 18:10 326 introduction to theoretical computer science

can represent as a string (i.e., using code) and then input to the universal machine . Beyond the푀 practical applications, the existence of a universal푀 algo- rithm also has surprising푈 theoretical ramifications, and in particular can be used to show the existence of uncomputable functions, upend- ing the intuitions of mathematicians over the centuries from Euler to Hilbert. In this chapter we will prove the existence of the univer- sal program, and also show its implications for uncomputability, see Fig. 9.1

This chapter: A non-mathy overview In this chapter we will see two of the most important results in Computer Science:

1. The existence of a universal Turing machine: a single algo- rithm that can evaluate all other algorithms,

2. The existence of uncomputable functions: functions (includ- ing the famous “Halting problem”) that cannot be computed by any algorithm.

Along the way, we develop the technique of reductions as a way to show hardness of computing a function. A reduction gives a way to compute a certain function using “wishful thinking” and assuming that another function can be com- puted. Reductions are of course widely used in program- ming - we often obtain an algorithm for one task by using another task as a “” subroutine. However we will use it in the “contra positive”: rather than using a reduction to show that the former task is “easy”, we use them to show that the latter task is “hard”. Don’t worry if you find this confusing - reductions are initially confusing - but they can be mastered with time and practice.

9.1 UNIVERSALITY OR A META-CIRCULAR EVALUATOR

We start by proving the existence of a universal Turing machine. This is a single Turing machine that can evaluate arbitrary Turing machines on arbitrary inputs , including machines that can have more states and larger alphabet푈 than itself. In particular, can even be 푀used to evaluate itself!푥 This notion of self reference푀 will appear time and again in this book, and as we will푈 see, leads to several푈 counter-intuitive phenomena in computing. universality and uncomputability 327

Figure 9.1: In this chapter we will show the existence of a universal Turing machine and then use this to de- rive first the existence of some uncomputable function. We then use this to derive the uncomputability of Turing’s famous “halting problem” (i.e., the HALT function), from which we a host of other uncom- putability results follow. We also introduce reductions, which allow us to use the uncomputability of a function to derive the uncomputability of a new function . 퐹 퐺

Theorem 9.1 — Universal Turing Machine. There exists a Turing machine such that on every string which represents a Turing machine, and , . 푈 That is, if the∗ machine 푀halts on and outputs some 푥 ∈then {0, 1} 푈(푀, 푥) = 푀(푥) , and if does not halt on (i.e., ∗ ) then 푀 . 푥 푦 ∈ {0, 1} 푈(푀, 푥) = 푦 푀 푥 푀(푥) = ⊥ 푈(푀, 푥) = ⊥  Big Idea 11 There is a “universal” algorithm that can evaluate arbitrary algorithms on arbitrary inputs.

Proof Idea: Once you understand what the theorem says, it is not that hard to prove. The desired program is an interpreter for Turing machines. Figure 9.2:A Universal Turing Machine is a single That is, gets a representation of the machine (think of it as source Turing Machine that can evaluate, given input the code), and some input , and푈 needs to simulate the execution of on (description as a string of) arbitrary Turing machine . 푈 푀 and input , the푈 output of on . In contrast to the universal circuit depicted in Fig. 5.6, the machine Think of how you would푥 code in your favorite programming푀 푀 can be much푥 more complex푀 (e.g.,푥 more states or 푥language. First, you would need to decide on some representation tape alphabet symbols) than . 푈 푀 scheme for . For example, you can use an array or a dictionary 푈 to encode ’s transition function. Then you would use some data structure, such푀 as a list, to store the contents of ’s tape. Now you can simulate 푀step by step, updating the data structure as you go along. The interpreter will continue the simulation until푀 the machine halts. Once you푀 do that, translating this interpreter from your favorite programming language to a Turing machine can be done just as we have seen in Chapter 8. The end result is what’s known as a “meta- 328 introduction to theoretical computer science

circular evaluator”: an interpreter for a programming language in the same one. This is a concept that has a long history in computer science starting from the original universal Turing machine. See also Fig. 9.3.

9.1.1⋆ Proving the existence of a universal Turing Machine To prove (and even properly state) Theorem 9.1, we need to fix some representation for Turing machines as strings. One potential choice for such a representation is to use the equivalence between Turing machines and NAND-TM programs and hence represent a Turing machine using the ASCII encoding of the source code of the corre- sponding NAND-TM program . However, we will use a more direct encoding.푀 푃 Definition 9.2 — String representation of Turing Machine. Let be a Turing machine with states and a size alphabet (we use the convention , , , ).푀 We represent 0 ℓ−1 as the triple푘 where isℓ the table∅ Σ of values= {휎 , for … , 휎 : } 0 1 2 3 휎 = 0 휎 = 1 휎 = 휎 = ▷ 푀 푀 (푘, ℓ, 푇 ) 푇 훿 (9.1)

where each value푀 0 푀 is a triple1 푀 withℓ−1 , 푇 = (훿 (0, 휎 ), 훿 (0, 휎 ), … , 훿 (푘 − 1, 휎 )) , and a number encoding′ one′ of L R′ S H . Thus 푀 such′ a machine is훿 encoded(푠, 휎) by a list of(푠 , 휎 , 푑) 푠natural∈ num- [푘] 휎bers.∈ Σ The string푑 representation{0, 1, 2,of 3} is obtained by concatenating{ , , , } prefix free representation푀 of all these2 integers. + 3푘 ⋅ ℓIf a string does not represent a list of integers푀 in the form above, then we treat∗ it as representing the trivial Turing machine with one state훼∈ that {0, 1} immediately halts on every input.

R Remark 9.3 — Take away points of representation. The details of the representation scheme of Turing ma- chines as strings are immaterial for almost all applica- tions. What you need to remember are the following points:

1. We can represent every Turing machine as a string. 2. Given the string representation of a Turing ma- chine and an input , we can simulate ’s execution on the input . (This is the content of Theorem푀 9.1.) 푥 푀 푥 An additional minor issue is that for convenience we make the assumption that every string represents some Turing machine. This is very easy to ensure by just mapping strings that would otherwise not represent a universality and uncomputability 329

Turing machine into some fixed trivial machine. This assumption is not very important, but does make a few results (such as Rice’s Theorem: Theorem 9.15) a little less cumbersome to state.

Using this representation, we can formally prove Theorem 9.1.

Proof of Theorem 9.1. We will only sketch the proof, giving the major ideas. First, we observe that we can easily write a Python program that, on input a representation of a Turing machine and an input , evaluates on . Here is the code of this program for concreteness, though you can feel(푘, ℓ,free 푇 ) to skip it if you are not푀 familiar with (or interested푥 in)푀 Python:푋

# constants def EVAL(δ,x): '''Evaluate TM given by transition table δ on input x''' Tape = ["฀"] + [a for a in x] i = 0; s = 0 # i = head pos, s = state while True: s, Tape[i], d = δ[(s,Tape[i])] if d == "H": break if d == "L": i = max(i-1,0) if d == "R": i += 1 if i>= len(Tape): Tape.append('Φ')

j = 1; Y = [] # produce output while Tape[j] != 'Φ': Y.append(Tape[j]) j += 1 return Y

On input a transition table this program will simulate the cor- responding machine step by step, at each point maintaining the invariant that the array Tape contains훿 the contents of ’s tape, and the variable s contains푀 ’s current state. The above does not prove the theorem as stated, since푀 we need to show a Turing machine푀that computes EVAL rather than a Python program. With enough effort, we can translate this Python code line by line to a Turing machine. However, to prove the theorem we don’t need to do this, but can use our “eat the cake and have it too” paradigm. That is, while we need to evaluate a Turing machine, in writing the code for the interpreter we are allowed to use a richer model such as NAND-RAM since it is equivalent in power to Turing machines per Theorem 8.1). 330 introduction to theoretical computer science

Translating the above Python code to NAND-RAM is truly straight- forward. The only issue is that NAND-RAM doesn’t have the dictio- nary data structure built in, which we have used above to store the transition function δ. However, we can represent a dictionary of the form as simply a list of pairs. To compute we can scan over all the pairs until we find 퐷one of 0 0 푚−1 푚−1 the form {푘푒푦 in∶ 푣푎푙which, … case , 푘푒푦 we return∶ 푣푎푙 . Similarly} we scan the list to update the퐷[푘] dictionary with a new value, either modifying it or ap- pending the(푘, 푣) pair at the end. 푣 ■ (푘푒푦, 푣푎푙)

R Remark 9.4 — Efficiency of the simulation. The argu- ment in the proof of Theorem 9.1 is a very inefficient way to implement the dictionary data structure in practice, but it suffices for the purpose of proving the theorem. Reading and writing to a dictionary of values in this implementation takes steps, but it is in fact possible to do this in log steps using푚 a search tree data structure or even Ω(푚)(for “typical” instances) using a hash table. NAND-RAM푂( 푚) and RAM machines correspond to the architecture푂(1) of modern electronic computers, and so we can implement hash tables and search trees in NAND-RAM just as they are implemented in other programming languages.

The construction above yields a universal Turing Machine with a very large number of states. However, since universal Turing machines have such a philosophical and technical importance, researchers have attempted to find the smallest possible universal Turing machines, see Section 9.7.

9.1.2 Implications of universality (discussion) There is more than one Turing machine that satisfies the condi- tions of Theorem 9.1, but the existence of even a single such machine is already extremely fundamental to both푈 the theory and practice of computer science. Theorem 9.1’s impact reaches beyond the particu- lar model of Turing machines. Because we can simulate every Turing Machine by a NAND-TM program and vice versa, Theorem 9.1 im- mediately implies there exists a universal NAND-TM program such that for every NAND-TM program . We can 푈 also “mix and match” models. For example since we can simulate푃 푈 every NAND-RAM푃 (푃 , 푥) = program 푃 (푥) by a Turing machine, and every푃 Turing Machine by the calculus, Theorem 9.1 implies that there exists a expression such that for every NAND-RAM program and input on which 휆 , if we encode as a -expression (using휆 the 푒 푃 푥 푃 (푥) = 푦 (푃 , 푥) 휆 푓 universality and uncomputability 331

Figure 9.3: a) A particularly elegant example of a “meta-circular evaluator” comes from John Mc- Carthy’s 1960 paper, where he defined the Lisp programming language and gave a Lisp function that evaluates an arbitrary Lisp program (see above). Lisp was not initially intended as a practical program- ming language and this example was merely meant as an illustration that the Lisp universal function is more elegant than the universal Turing machine. It was McCarthy’s graduate student Steve Russell who suggested that it can be implemented. As McCarthy later recalled, “I said to him, ho, ho, you’re confusing theory with practice, this eval is intended for reading, not for computing. But he went ahead and did it. That is, he compiled the eval in my paper into IBM 704 , fixing a bug, and then advertised this as a Lisp interpreter, which it certainly was”. b) A self-replicating C program from the classic essay of Thompson [Tho84].

-calculus encoding of strings as lists of ’s and ’s) then eval- uates to an encoding of . More generally we can say that for every 휆 and in the set Turing Machines, RAM0 Machines,1 NAND-TM,(푒 푓) NAND-RAM, -calculus,푦 JavaScript, Python, of Turing equivalent models,풳 풴 there exists{ a program/machine in that computes the map 휆for every program/machine …}. The idea of a “universal program” is of course풳 not limited to theory. For(푃 , 푥) example ↦ 푃 (푥) compilers for programming languages푃 ∈ 풴 are often used to compile themselves, as well as programs more complicated than the compiler. (An extreme example of this is Fabrice Bellard’s Obfuscated Tiny C Compiler which is a C program of 2048 that can compile a large subset of the C programming language, and in particular can compile itself.) This is also related to the fact that it is possible to write a program that can print its own source code, see Fig. 9.3. There are universal Turing machines known that require a very small number of states or alphabet symbols, and in particular there is a universal Turing machine (with respect to a particular choice of representing Turing machines as strings) whose tape alphabet is and has fewer than states (see Section 9.7). ∅ {▷, , 0, 1} 25 9.2 IS EVERY FUNCTION COMPUTABLE?

In Theorem 4.12, we saw that NAND-CIRC programs can compute every finite function . Therefore a natural guess is that NAND-TM programs (or푛 equivalently, Turing Machines) could compute every infinite푓 ∶ {0,function 1} → {0, 1} . However, this turns out to be false. That is, there exists a∗ function that is uncomputable! 퐹 ∶ {0, 1} → {0, 1} ∗ 퐹 ∶ {0, 1} → {0, 1} 332 introduction to theoretical computer science

The existence of uncomputable functions is quite surprising. Our intuitive notion of a “function” (and the notion most mathematicians had until the 20th century) is that a function defines some implicit or explicit way of computing the output from the input . The notion of an “uncomputable function” thus seems푓 to be a contradic- tion in terms, but yet the following theorem푓(푥) shows that such푥 creatures do exist:

Theorem 9.5 — Uncomputable functions. There exists a function that is not computable by any Turing machine.∗ ∗ 퐹 ∶ {0, 1} → {0, 1} Proof Idea: The idea behind the proof follows quite closely Cantor’s proof that the reals are uncountable (Theorem 2.5), and in fact the theorem can also be obtained fairly directly from that result (see Exercise 7.11). However, it is instructive to see the direct proof. The idea is to con- struct in a way that will ensure that every possible machine will in fact fail∗ to compute . We do so by defining to equal if describes퐹 a Turing machine∗ which satisfies ∗ and defining푀 otherwise. By퐹 construction, if is any퐹 Turing(푥) machine0 and푥 ∗is the string describing it, then푀 푀(푥)and therefore= 1 does 퐹not(푥)compute = 1 . ∗ 푀 푥 ∗ 퐹 (푥) ≠ 푀(푥) 푀 퐹 ⋆ Proof of Theorem 9.5. The proof is illustrated in Fig. 9.4. We start by defining the following function : For every string , if satisfies∗ (1) is a valid repre- sentation of some Turing machine∗ 퐺 ∶ {0,(per 1} the→ {0, representation 1} scheme above) and (2) when푥 ∈ the {0, program 1} 푥 is executed푥 on the input it halts and produces an output, then푀 we define as the first bit of this output. Otherwise (i.e., if is not푀 a valid representation of푥 a Tur- ing machine, or the machine never halts on퐺(푥)) we define . We define . 푥 푥 We claim∗ that there is no Turing푀 machine that푥 computes 퐺(푥). In- = 0 deed, suppose,퐹 (푥) towards = 1 − 퐺(푥) the sake of contradiction, there exists∗ a ma- chine that computes , and let be the binary string that퐹 rep- resents the machine . On∗ one hand, since by our assumption computes푀 , on input 퐹the machine푥 halts and outputs . On the other hand,∗ by the푀 definition of , since is the representation∗ 푀 of the machine퐹 , 푥 ∗푀 , hence yielding퐹 (푥) a contradiction. ∗ 퐹 푥 푀 퐹 (푥) = 1 − 퐺(푥) = 1 − 푀(푥) ■ universality and uncomputability 333

Figure 9.4: We construct an uncomputable function by defining for every two strings the value which equals if the machine described by outputs on , and otherwise.푥, 푦 We then define 1 − 푀푦to(푥) be the “diagonal”0 of this table, namely ∗ 푦 1 푥 for every1 . The function 퐹is∗ uncomputable,(푥) because if it was computable by∗ 퐹some(푥) machine = 1 − 푀 whose푥(푥) string description푥 is then퐹 we would get that ∗ .

∗ ∗ ∗ 푥∗ ∗ 푀푥 (푥 ) = 퐹(푥 ) = 1 − 푀푥 (푥 )

 Big Idea 12 There are some functions that can not be computed by any algorithm.

P The proof of Theorem 9.5 is short but subtle. I suggest that you pause here and go back to read it again and think about it - this is a proof that is worth reading at least twice if not three or four times. It is not often the case that a few lines of mathematical reasoning estab- lish a deeply profound fact - that there are problems we simply cannot solve.

The type of argument used to prove Theorem 9.5 is known as di- agonalization since it can be described as defining a function based on the diagonal entries of a table as in Fig. 9.4. The proof can be thought of as an infinite version of the counting argument we used for showing lower bound for NAND-CIRC programs in Theorem 5.3. Namely, we show that it’s not possible to compute all functions from by Turing machines simply because there are more functions∗ like that then there are Turing machines. {0,As 1} mentioned→ {0, 1} in Remark 7.4, many texts use the “language” ter- minology and so will call a set an undecidable or non recursive language if the function ∗ such that is uncomputable.퐿 ⊆ {0, 1} ∗ 퐹 ∶ {0, 1} ∶→ {0, 1} 퐹 (푥) = 1 ↔ 푥 ∈ 퐿 334 introduction to theoretical computer science

9.3 THE HALTING PROBLEM

Theorem 9.5 shows that there is some function that cannot be com- puted. But is this function the equivalent of the “tree that falls in the forest with no one hearing it”? That is, perhaps it is a function that no one actually wants to compute. It turns out that there are natural uncomputable functions:

Theorem 9.6 — Uncomputability of Halting function. Let HALT be the function such that for every string ∗ , HALT if Turing machine halts on the∶ input {0, 1} and→∗ HALT{0, 1} otherwise. Then HALT is not computable.푀 ∈ {0, 1} (푀, 푥) = 1 푀 푥 Before(푀, turning 푥) = 0 to prove Theorem 9.6, we note that HALT is a very natural function to want to compute. For example, one can think of HALT as a special case of the task of managing an “App store”. That is, given the code of some application, the gatekeeper for the store needs to decide if this code is safe enough to allow in the store or not. At a minimum, it seems that we should verify that the code would not go into an infinite loop.

Proof Idea: One way to think about this proof is as follows:

Uncomputability of Universality Uncomputability of HALT ∗ (9.2) That is, we will use the퐹 universal+ Turing machine= that computes EVAL to derive the uncomputability of HALT from the uncomputability of shown in Theorem 9.5. Specifically, the proof will be by contra- diction.∗ That is, we will assume towards a contradiction that HALT is computable,퐹 and use that assumption, together with the universal Tur- ing machine of Theorem 9.1, to derive that is computable, which will contradict Theorem 9.5. ∗ 퐹

⋆  Big Idea 13 If a function is uncomputable we can show that another function is uncomputable by giving a way to reduce the task of computing to computing퐹 . 퐻 퐹 퐻 Proof of Theorem 9.6. The proof will use the previously established result Theorem 9.5. Recall that Theorem 9.5 shows that the following function is uncomputable: ∗ ∗ 퐹 ∶ {0, 1} → {0, 1} (9.3) otherwise ∗ {⎧1 푥(푥) = 0 퐹 (푥) = ⎨ ⎩{0 universality and uncomputability 335

where denotes the output of the Turing machine described by the string on the input (with the usual convention that if this computation푥(푥) does not halt). We푥 will show that푥 the uncomputability of implies푥(푥) the uncom- = ⊥ putability of HALT. Specifically, we will assume,∗ towards a contra- diction, that there exists a Turing machine 퐹that can compute the HALT function, and use that to obtain a Turing machine that com- putes the function . (This is known as a푀 proof by reduction′ , since we reduce the task of computing∗ to the task of computing푀HALT. By the contrapositive,퐹 this means the∗ uncomputability of implies the uncomputability of HALT.) 퐹 ∗ Indeed, suppose that is a Turing machine that computes퐹 HALT. Algorithm 9.7 describes a Turing Machine that computes . (We use “high level” description푀 of Turing machines,′ appealing to the∗ “have your cake and eat it too” paradigm, see푀 Big Idea 10.) 퐹

Algorithm 9.7 — to reduction. Input: ∗ 퐹 퐻퐴퐿푇 Output: ∗ 1: 푥 ∈∗ {0, 1} # Assume T.M. computes 2: Let 퐹 (푥) . # Assume . 퐻퐴퐿푇 3: if then 푀 퐻퐴퐿푇 퐻퐴퐿푇 4: return푧 ← 푀 (푥, 푥) 푧 = 퐻퐴퐿푇 (푥, 푥) 5: end푧 = if 0 6: Let 0 # universal TM, i.e., 7: if then 8: return푦 ← 푈(푥, 푥) 푈 푦 = 푥(푥) 9: end푦 = if 0 10: return 1

We claim that0 Algorithm 9.7 computes the function . In- deed, suppose that (and hence ). In∗ this case, HALT and hence, under our∗ assumption퐹 that HALT 푥(푥), the = 0value will equal퐹 (푥), and= 1 hence Al- gorithm 9.7(푥,will 푥) set = 1 , and output the correct value 푀(푥,. 푥) = (푥, 푥) 푧 1 Suppose otherwise푦 =that 푥(푥) = 0 (and hence ). In this 1case there are two possibilities: ∗ 푥(푥) ≠ 0 퐹 (푥) = 0

• Case 1: The machine described by does not halt on the input . In this case, HALT . Since we assume that computes HALT it means that on input , the푥 machine must halt and푥 output the value .(푥, This 푥) means= 0 that Algorithm 9.7 will푀 set and output . 푥, 푥 푀 0 푧 = 0 0 336 introduction to theoretical computer science

• Case 2: The machine described by halts on the input and out- puts some . In this case, since HALT , under our assumptions,′ Algorithm 9.7 will set푥 and so output푥 . 푦 ≠ 0 ′ (푥, 푥) = 1 We see that in all cases, 푦 =, 푦 which≠ 0 contradicts the0 fact that is uncomputable.′ Hence we∗ reach a contradiction to our original assumption∗ that 푀computes(푥) = 퐹HALT(푥) . 퐹 ■ 푀

P Once again, this is a proof that’s worth reading more than once. The uncomputability of the halting prob- lem is one of the fundamental theorems of computer science, and is the starting point for much of the in- vestigations we will see later. An excellent way to get a better understanding of Theorem 9.6 is to go over Section 9.3.2, which presents an alternative proof of the same result.

9.3.1 Is the Halting problem really hard? (discussion) Many people’s first instinct when they see the proof of Theorem 9.6 is to not believe it. That is, most people do believe the mathematical statement, but intuitively it doesn’t seem that the Halting problem is really that hard. After all, being uncomputable only means that HALT cannot be computed by a Turing machine. But seem to solve HALT all the time by informally or formally arguing that their programs halt. It’s true that their programs are written in C or Python, as opposed to Turing machines, but that makes no difference: we can easily translate back and forth between this model and any other programming language. While every programmer encounters at some point an infinite loop, is there really no way to solve the halting problem? Some people argue that they personally can, if they think hard enough, determine whether any concrete program that they are given will halt or not. Some have even argued that humans in general have the ability to do that, and hence humans have inherently superior intelligence to 1 computers or anything else modeled by Turing machines.1 This argument has also been connected to the issues of consciousness and free will. I am personally The best answer we have so far is that there truly is no way to solve skeptical of its relevance to these issues. Perhaps the HALT, whether using Macs, PCs, quantum computers, humans, or reasoning is that humans have the ability to solve the halting problem but they exercise their free will and any other combination of electronic, mechanical, and biological de- consciousness by choosing not to do so. vices. Indeed this assertion is the content of the Church-Turing Thesis. This of course does not mean that for every possible program , it is hard to decide if enters an infinite loop. Some programs don’t even have loops at all (and hence trivially halt), and there are푃 many other far less trivial푃 examples of programs that we can certify to never universality and uncomputability 337

enter an infinite loop (or programs that we know for sure that will enter such a loop). However, there is no general procedure that would determine for an arbitrary program whether it halts or not. More- over, there are some very simple programs for which no one knows whether they halt or not. For example,푃 the following Python program will halt if and only if Goldbach’s conjecture is false: def isprime(p): return all(p % i for i in range(2,p-1)) def Goldbach(n): return any( (isprime(p) and isprime(n-p)) for p in range(2,n-1)) n = 4 while True: if not Goldbach(n): break n+= 2

Given that Goldbach’s Conjecture has been open since 1742, it is unclear that humans have any magical ability to say whether this (or other similar programs) will halt or not.

9.3.2 A direct proof of the uncomputability of HALT (optional) It turns out that we can combine the ideas of the proofs of Theo- rem 9.5 and Theorem 9.6 to obtain a short proof of the latter theorem, that does not appeal to the uncomputability of . This short proof appeared in print in a 1965 letter to the editor of Christopher∗ Strachey: 퐹 To the Editor, The Computer Journal.

An Impossible Program Figure 9.5: SMBC’s take on solving the Halting prob- Sir, lem. A well-known piece of folk-lore among programmers holds that it is impossible to write a program which can examine any other program and tell, in every case, if it will terminate or get into a closed loop when it is run. I have never actually seen a proof of this in print, and though Alan Turing once gave me a verbal proof (in a railway carriage on the way to a Conference at the NPL in 1953), I unfortunately and promptly forgot the details. This left me with an uneasy feeling that the proof must be long or complicated, but in fact it is so short and simple that it may be of interest to casual readers. The version below uses CPL, but not in any essential way. Suppose T[R] is a Boolean function taking a routine (or program) R with no formal or free variables as its arguments and that for all R, T[R] = True if R terminates if run and that T[R] = False if R does not terminate. Consider the routine P defined as follows 338 introduction to theoretical computer science

rec routine P §L: if T[P] go to L Return §

If T[P] = True the routine P will loop, and it will only terminate if T[P] = False. In each case ‘T[P]“ has exactly the wrong value, and this contradiction shows that the function T cannot exist. Yours faithfully, C. Strachey Churchill College, Cambridge

P Try to stop and extract the argument for proving Theorem 9.6 from the letter above.

Since CPL is not as common today, let us reproduce this proof. The idea is the following: suppose for the sake of contradiction that there exists a program T such that T(f,x) equals True iff f halts on input x. (Strachey’s letter considers the no-input variant of HALT, but as we’ll see, this is an immaterial distinction.) Then we can construct a program P and an input x such that T(P,x) gives the wrong answer. The idea is that on input x, the program P will do the following: run T(x,x), and if the answer is True then go into an infinite loop, and otherwise halt. Now you can see that T(P,P) will give the wrong answer: if P halts when it gets its own code as input, then T(P,P) is supposed to be True, but then P(P) will go into an infinite loop. And if P does not halt, then T(P,P) is supposed to be False but then P(P) will halt. We can also code this up in Python: def CantSolveMe(T): """ Gets function T that claims to solve HALT. Returns a pair (P,x) of code and input on which T(P,x) ≠ HALT(x) """ def fool(x): if T(x,x): while True: pass return "I halted"

return (fool,fool)

For example, consider the following Naive Python program T that guesses that a given function does not halt if its input contains while or for universality and uncomputability 339

def T(f,x): """Crude halting tester - decides it doesn't halt if it contains a loop."""

import↪ inspect source = inspect.getsource(f) if source.find("while"): return False if source.find("for"): return False return True

If we now set (f,x) = CantSolveMe(T), then T(f,x)=False but f(x) does in fact halt. This is of course not specific to this particular T: for every program T, if we run (f,x) = CantSolveMe(T) then we’ll get an input on which T gives the wrong answer to HALT.

9.4 REDUCTIONS

The Halting problem turns out to be a linchpin of uncomputability, in the sense that Theorem 9.6 has been used to show the uncomputabil- ity of a great many interesting functions. We will see several examples of such results in this chapter and the exercises, but there are many more such results (see Fig. 9.6).

Figure 9.6: Some uncomputability results. An arrow from problem X to problem Y means that we use the uncomputability of X to prove the uncomputability of Y by reducing computing X to computing Y. All of these results except for the MRDP Theorem appear in either the text or exercises. The Halting Problem HALT serves as our starting point for all these uncomputability results as well as many others.

The idea behind such uncomputability results is conceptually sim- ple but can at first be quite confusing. If we know that HALT is un- computable, and we want to show that some other function BLAH is uncomputable, then we can do so via a contrapositive argument (i.e., proof by contradiction). That is, we show that if there exists a Turing machine that computes BLAH then there exists a Turing machine that computes HALT. (Indeed, this is exactly how we showed that HALT itself is uncomputable, by reducing this fact to the uncomputability of the function from Theorem 9.5.) ∗ 퐹 340 introduction to theoretical computer science

For example, to prove that BLAH is uncomputable, we could show that there is a computable function such that for every pair and , HALT BLAH ∗ . The∗ existence of such a function implies that if BLAH푅 ∶ {0,was 1} computable→ {0, 1} then HALT would be computable푀 푥 as well,(푀, hence 푥) = leading(푅(푀, to a contradiction! 푥)) The confusing part about푅 reductions is that we are assuming something we believe is false (that BLAH has an algorithm) to derive something that we know is false (that HALT has an algorithm). describes such results as having the form “If pigs could whistle then horses could fly”. A reduction-based proof has two components. For starters, since we need to be computable, we should describe the algorithm to compute it. The algorithm to compute is known as a reduction since the transformation푅 modifies an input to HALT to an input to BLAH, and hence reduces the task of computing푅HALT to the task of comput- ing BLAH. The second푅 component of a reduction-based proof is the analysis of the algorithm : namely a proof that does indeed satisfy the desired properties. Reduction-based proofs푅 are just like other proofs푅 by contradiction, but the fact that they involve hypothetical algorithms that don’t really exist tends to make reductions quite confusing. The one silver lining is that at the end of the day the notion of reductions is mathematically quite simple, and so it’s not that bad even if you have to go back to first principles every time you need to remember what is the direction that a reduction should go in.

R Remark 9.8 — Reductions are algorithms. A reduction is an algorithm, which means that, as discussed in Remark 0.3, a reduction has three components:

• Specification (what): In the case of a reduction from HALT to BLAH, the specification is that func- tion should satisfy that HALT ∗ BLAH ∗ for every Tur- ing machine푅 ∶ {0, 1}and input→ {0,. In 1} general, to reduce a function(푀, 푥)to =, the reduction(푅(푀, should 푥)) satisfy 푀 for every푥 input to . • Implementation퐹 퐺 (how): The algorithm’s descrip- 퐹tion: (푤) the = 퐺(푅(푤)) precise instructions how푤 to transform퐹 an input to the output . • Analysis (why): A proof that the algorithm meets the specification.푤 In푅(푤) particular, in a reduction from to this is a proof that for every input , the output of the algorithm satisfies that 퐹 퐺 . 푤 푦 퐹 (푤) = 퐺(푦) universality and uncomputability 341

9.4.1 Example: Halting on the zero problem Here is a concrete example for a proof by reduction. We define the function HALTONZERO as follows. Given any string , HALTONZERO ∗ if and only if describes a Turing machine that halts when it∶ is {0, given 1} → the {0, string 1} as input. A priori HALTONZERO푀 seems like(푀) a potentially = 1 easier function푀 to compute than the full-fledged HALT function, and so we0 could perhaps hope that it is not uncomputable. Alas, the following theorem shows that this is not the case:

Theorem 9.9 — Halting without input. HALTONZERO is uncomputable.

P The proof of Theorem 9.9 is below, but before reading it you might want to pause for a couple of minutes and think how you would prove it yourself. In partic- ular, try to think of what a reduction from HALT to HALTONZERO would look like. Doing so is an excel- lent way to get some initial comfort with the notion of proofs by reduction, which a technique we will be using time and again in this book.

Figure 9.7: To prove Theorem 9.9, we show that HALTONZERO is uncomputable by giving a reduction from the task of computing HALT to the task of com- puting HALTONZERO. This shows that if there was a hypothetical algorithm computing HALTONZERO, then there would be an algorithm computing HALT, contradicting Theorem퐴 9.6. Since neither nor actually exists, this is an example퐵 of an implication of the form “if pigs could whistle then horses could퐴 퐵fly”.

Proof of Theorem 9.9. The proof is by reduction from HALT, see Fig. 9.7. We will assume, towards the sake of contradiction, that HALTONZERO is computable by some algorithm , and use this hypothetical algorithm to construct an algorithm to compute HALT, hence obtaining a contradiction to Theorem퐴 9.6. (As discussed 퐴 퐵 342 introduction to theoretical computer science

in Big Idea 10, following our “have your cake and eat it too” paradigm, we just use the generic name “algorithm” rather than worrying whether we model them as Turing machines, NAND-TM programs, NAND-RAM, etc.; this makes no difference since all these models are equivalent to one another.) Since this is our first proof by reduction from the Halting prob- lem, we will spell it out in more details than usual. Such a proof by reduction consists of two steps:

1. Description of the reduction: We will describe the operation of our algorithm , and how it makes “function calls” to the hypothetical algorithm . 퐵 2. Analysis of퐴 the reduction: We will then prove that under the hypoth- esis that Algorithm computes HALTONZERO, Algorithm will compute HALT. 퐴 퐵

Algorithm 9.10 — to reduction. Input: Turing machine and string . 퐻퐴퐿푇 퐻퐴퐿푇 푂푁푍퐸푅푂 Output: Turing machine such that halts on iff halts on zero 푀 ′ 푥 ′ 1: procedure ( ) 푀 # Description푀 of the T.M.푥 푀 2: return # Ignore the 푀,푥 푀,푥 Input: , evaluate푁 on푤 . 푁 3: end procedure퐸푉 퐴퐿(푀, 푥) 4: return푤 푀# We푥 do not execute : only return its description 푀,푥 푀,푥 푁 푁 Our Algorithm works as follows: on input , it runs Algo- rithm 9.10 to obtain a Turing Machine , and then returns . The machine ignores퐵 its input and simply′ runs푀, 푥 on . ′ In pseudocode,′ the program will푀 look something like퐴(푀 the) following: 푀 푧 푀 푥 푀,푥 푁 def N(z): M = r'...... ' # a string constant containing desc. of M x = r'...... ' # a string constant containing x return eval(M,x) # note that we ignore the input z

That is, if we think of as a program, then it is a program that contains and as “hardwired constants”, and given any input , it 푀,푥 simply ignores the input푁 and always returns the result of evaluating 푀 푥 푧 universality and uncomputability 343

on . The algorithm does not actually execute the machine . merely writes down the description of as a string (just as we 푀,푥 푀did above)푥 and feeds this퐵 string as input to . 푁 푀,푥 퐵 The above completes the description of푁 the reduction. The analysis is obtained by proving the following claim: 퐴 Claim: For every strings , the machine constructed by Algorithm in Step 1 satisfies that halts on if and only if the 푀,푥 program described by halts푀, on푥, 푧 the input . 푁 푀,푥 Proof of퐵 Claim: Since ignores푁 its input and푧 evaluates on using the universal Turing푀 machine, it will halt푥 on if and only if 푀,푥 halts on . 푁 푀 푥 In particular if we instantiate this claim with the푧 input to푀 , we푥 see that HALTONZERO HALT . Thus if the hypothetical algorithm satisfies HALTONZERO푧 = 0 푀,푥 푀,푥 푁for every then the algorithm (푁we construct) = satisfies(푀, 푥) HALT for every 퐴, contradicting퐴(푀) the = uncomputability(푀) of HALT. 푀 퐵 퐵(푀, 푥) = (푀, 푥) 푀, 푥 ■

R Remark 9.11 — The hardwiring technique. In the proof of Theorem 9.9 we used the technique of “hardwiring” an input to a program/machine . That is, modify- ing a program that it uses “hardwired constants” for some푥 of all of its input. This technique푃 is quite common in reductions푃 and elsewhere, and we will often use it again in this course.

9.5 RICE’S THEOREM AND THE IMPOSSIBILITY OF GENERAL SOFTWARE VERIFICATION

The uncomputability of the Halting problem turns out to be a special case of a much more general phenomenon. Namely, that we cannot certify semantic properties of general purpose programs. “Semantic prop- erties” mean properties of the function that the program computes, as opposed to properties that depend on the particular syntax used by the program. An example for a semantic property of a program is the property that whenever is given an input string with an even number of ’s, it outputs . Another example is the property that 푃 will always halt whenever the input푃 ends with a . In contrast, the property that a1 C program contains0 a comment before every function푃 declaration is not a semantic property, since it depends1 on the actual source code as opposed to the input/output relation. 344 introduction to theoretical computer science

Checking semantic properties of programs is of great interest, as it corresponds to checking whether a program conforms to a specifica- tion. Alas it turns out that such properties are in general uncomputable. We have already seen some examples of uncomputable semantic func- tions, namely HALT and HALTONZERO, but these are just the “tip of the iceberg”. We start by observing one more such example:

Theorem 9.12 — Computing all zero function. Let ZEROFUNC be the function such that for every , ZEROFUNC∗ if and only if represents a Turing machine such∗ that∶ {0,outputs 1} → {0,on 1} every input . Then ZEROFUNC푀 ∈ {0,is1} uncomputable. (푀) = 1 푀 ∗ 푀 0 푥 ∈ {0, 1}

P Despite the similarity in their names, ZEROFUNC and HALTONZERO are two different functions. For exam- ple, if is a Turing machine that on input , halts and outputs the OR of all of ’s coordinates, then∗ HALTONZERO푀 (since does푥 halt ∈ {0, on 1} the input ) but ZEROFUNC 푥 (since does not compute the constant(푀) zero = function). 1 푀 0 (푀) = 0 푀

Proof of Theorem 9.12. The proof is by reduction to HALTONZERO. Suppose, towards the sake of contradiction, that there was an algo- rithm such that ZEROFUNC for every . Then we will construct an algorithm that solves HALTONZERO∗, contradicting퐴 Theorem퐴(푀) 9.9 =. (푀) 푀 ∈ {0, 1} Given a Turing machine (which퐵 is the input to HALTONZERO), our Algorithm does the following: 푁 1. Construct a Turing Machine which on input , first 퐵 runs and then outputs . ∗ 푀 푥 ∈ {0, 1} 2. Return . 푁(0) 0 Now if halts on the input then the Turing machine com- 퐴(푀) putes the constant zero function, and hence under our assumption that computes푁 ZEROFUNC, 0 . If does not halt푀 on the input , then the Turing machine will not halt on any input, and so in퐴 particular will not compute퐴(푀) the constant = 1 푁 zero function. Hence under0 our assumption that computes푀 ZEROFUNC, . We see that in both cases, ZEROFUNC HALTONZERO and hence the value that Algorithm퐴 returns in step퐴(푀) 2 is equal = 0to HALTONZERO which is what we needed(푀) = to prove. (푁) 퐵 ■ (푁) Another result along similar lines is the following: universality and uncomputability 345

Theorem 9.13 — Uncomputability of verifying parity. The following func- tion is uncomputable

computes the parity function COMPUTES-PARITY otherwise {⎧1 푃 (푃 ) = (9.4) ⎨ ⎩{0

P We leave the proof of Theorem 9.13 as an exercise (Exercise 9.6). I strongly encourage you to stop here and try to solve this exercise.

9.5.1 Rice’s Theorem Theorem 9.13 can be generalized far beyond the parity function. In fact, this generalization rules out verifying any type of semantic spec- ification on programs. We define a semantic specification on programs to be some property that does not depend on the code of the program but just on the function that the program computes. For example, consider the following two C programs int First(int n) { if (n<0) return 0; return 2*n; } int Second(int n) { int i = 0; int j = 0 if (n<0) return 0; while (j

First and Second are two distinct C programs, but they compute the same function. A semantic property, would be either true for both programs or false for both programs, since it depends on the function the programs compute and not on their code. An example for a se- mantic property that both First and Second satisfy is the following: “The program computes a function mapping integers to integers satisfy- ing that for every input ”. 푃 푓 푓(푛) ≥ 푛 푛 346 introduction to theoretical computer science

A property is not semantic if it depends on the source code rather than the input/output behavior. For example, properties such as “the program contains the variable k” or “the program uses the while op- eration” are not semantic. Such properties can be true for one of the programs and false for others. Formally, we define semantic proper- ties as follows:

Definition 9.14 — Semantic properties. A pair of Turing machines and are functionally equivalent if for every , ′ . (In particular, iff for all∗ 푀.) 푀 ′ 푥′ ∈ {0, 1} 푀(푥)A function = 푀 (푥) 푀(푥)is =semantic ⊥ 푀if(푥) for every = ⊥ pair of 푥strings that represent∗ functionally equivalent Turing ma- chines, ′퐹 ∶ {0, 1} . (Recall→ {0, that 1} we assume that every string represents푀, 푀some Turing machine,′ see Remark 9.3) 퐹 (푀) = 퐹 (푀 ) There are two trivial examples of semantic functions: the constant one function and the constant zero function. For example, if is the constant zero function (i.e., for every ) then clearly for every pair of Turing machines and 푍that are functionally equivalent′ and푍(푀). = Here 0 is a non-trivial푀 example′ 퐹 (푀) = 퐹 (푀 ) 푀 푀 Solved Exercise 9.1 — is′ semantic. Prove that the function 푀 푀 ZEROFUNC is semantic. 푍퐸푅푂퐹 푈푁퐶 ■

Solution: Recall that ZEROFUNC if and only if for every . If and are functionally equivalent, then for every , ∗ . Hence(푀)′ ZEROFUNC = 1 푀(푥)if and = only 0 if ZEROFUNC푥 ∈ {0, 1} ′푀. 푀 푥 푀(푥) =′ 푀 (푥) (푀) = 1 ■ (푀 ) = 1 Often the properties of programs that we are most interested in computing are the semantic ones, since we want to understand the programs’ functionality. Unfortunately, Rice’s Theorem tells us that these properties are all uncomputable:

Theorem 9.15 — Rice’s Theorem. Let . If is seman- tic and non-trivial then it is uncomputable.∗ 퐹 ∶ {0, 1} → {0, 1} 퐹

Proof Idea: The idea behind the proof is to show that every semantic non- trivial function is at least as hard to compute as HALTONZERO. This will conclude the proof since by Theorem 9.9, HALTONZERO is uncomputable.퐹 If a function is non trivial then there are two

퐹 universality and uncomputability 347

machines and such that and . So, the goal would be to take a machine and find a way to map it into 0 1 0 1 a machine푀 푀 , such that퐹 (푀(i) if) =halts 0 on퐹 zero(푀 ) then = 1 is functionally equivalent to and (ii)푁 if does not halt on zero then is functionally푀 = 푅(푁) equivalent . 푁 푀 1 Because is semantic, if푀 we achieved this,푁 then we would be guar- 0 푀anteed that HALTONZERO 푀 , and hence would show that if was퐹 computable, then HALTONZERO would be computable as well, contradicting Theorem(푁) 9.9 =. 퐹 (푅(푁)) 퐹

Proof⋆ of Theorem 9.15. We will not give the proof in full formality, but rather illustrate the proof idea by restricting our attention to a particu- lar semantic function . However, the same techniques generalize to all possible semantic functions. Define MONOTONE as follows: MONOTONE퐹 if there does not exist ∗ and two inputs such that for every ∶ {0, 1}but→ {0, 1} outputs and ′ 푛(푀). That = is, 1 MONOTONE 푛if ∈′ it’s ℕ not 푖 푖 possible to find푥, 푥 ∈ {0,an′ input 1} such that flipping푖 ∈some [푛] 푥 bits≤ of 푥 from푀(푥)to will change1 푀(푥’s output) = 0 in the other direction from(푀) =to 1 . We will prove that MONOTONE 푥is uncomputable, but the proof will푥 easily0 1generalize to any푀 semantic function. 1 0 We start by noting that MONOTONE is neither the constant zero nor the constant one function:

• The machine INF that simply goes into an infinite loop on every input satisfies MONOTONE INF , since INF is not defined anywhere and so in particular there are no two inputs where for every but INF ( )and = 1INF . ′ ′ ′ 푥, 푥 • The푖 machine푖 PAR that computes the XOR or parity of its input, is 푥 ≤ 푥 푖 (푥) = 0 (푥 ) = 1 not monotone (e.g., PAR but PAR ) and hence MONOTONE PAR . (1, 1, 0, 0, … , 0) = 0 (1, 0, 0, … , 0) = (Note0 that INF and PAR are( machines) = 0and not functions.) We will now give a reduction from HALTONZERO to MONOTONE. That is, we assume towards a contradiction that there exists an algorithm that computes MONOTONE and we will build an algorithm that computes HALTONZERO. Our algorithm will work as follows: 퐴 퐵 퐵 Algorithm : Input: String describing a Turing machine. (Goal: Compute HALTONZERO퐵 ) Assumption: 푁Access to Algorithm to compute MONOTONE. (푁) Operation: 퐴 348 introduction to theoretical computer science

1. Construct the following machine : “On input do: (a) Run , (b) Return PAR ”. ∗ 2. Return . 푀 푧 ∈ {0, 1} 푁(0) (푧) To complete1 − the 퐴(푀) proof we need to show that outputs the cor- rect answer, under our assumption that computes MONOTONE. In other words, we need to show that HALTONZERO퐵 . Suppose that does퐴 not halt on zero. In this case the program constructed by Algorithm enters(푁) into = an 1 − in- 푀푂푁푂푇finite loop 푂푁퐸(푀) in step (a) and will never푁 reach step (b). Hence in this case is functionally푀 equivalent to INF. (The machine퐵 is not the same machine as INF: its description or code is different. But it does푁 have the same input/output behavior (in this case)푁 of never halting on any input. Also, while the program will go into an in- finite loop on every input, Algorithm never actually runs : it only produces its code and feeds it to . Hence푀 Algorithm will not enter into an infinite loop even in퐵 this case.) Thus in푀 this case, MONOTONE MONOTONE INF퐴 . 퐵 If does halt on zero, then step (a) in will eventually conclude and ’s output(푁) will = be determined( by step) = 1(b), where it simply out- puts푁 the parity of its input. Hence in this푀 case, computes the non- monotone푀 parity function (i.e., is functionally equivalent to PAR), and so we get that MONOTONE MONOTONE푀 PAR . In both cases, MONOTONE , which is what we wanted to prove. (푀) = ( ) = 0 An examination of(푀) this = proof 1 − 퐻퐴퐿푇 shows 푂푁푍퐸푅푂(푁) that we did not use anything about MONOTONE beyond the fact that it is semantic and non-trivial. For every semantic non-trivial , we can use the same proof, replacing PAR and INF with two machines and such that and . Such machines must퐹 exist if is non trivial. 0 1 0 푀 푀 퐹 (푀 ) = 0 ■ 1 퐹 (푀 ) = 1 퐹

R Remark 9.16 — Semantic is not the same as uncom- putable. Rice’s Theorem is so powerful and such a popular way of proving uncomputability that peo- ple sometimes get confused and think that it is the only way to prove uncomputability. In particular, a common misconception is that if a function is not semantic then it is computable. This is not at all the case. 퐹 For example, consider the following function HALTNOYALE . This is a function that on input a string that∗ represents a NAND-TM program , outputs∶ {0,if 1} and→ only {0, if 1} both (i) halts on the input , and (ii) the program does not con- tain a variable푃 with the1 identifier Yale. The function푃 0 푃 universality and uncomputability 349

HALTNOYALE is clearly not semantic, as it will out- put two different values when given as input oneof the following two functionally equivalent programs:

Yale[0] = NAND(X[0],X[0]) Y[0] = NAND(X[0],Yale[0])

and

Harvard[0] = NAND(X[0],X[0]) Y[0] = NAND(X[0],Harvard[0])

However, HALTNOYALE is uncomputable since every program can be transformed into an equivalent (and in fact improved :)) program that does not Yale contain the푃 variable . Hence if we′ could compute HALTNOYALE then determine halting푃 on zero for NAND-TM programs (and hence for Turing machines as well). Moreover, as we will see in Chapter 11, there are un- computable functions whose inputs are not programs, and hence for which the adjective “semantic” is not applicable. Properties such as “the program contains the variable Yale” are sometimes known as syntactic properties. The terms “semantic” and “syntactic” are used be- yond the realm of programming languages: a famous example of a syntactically correct but semantically meaningless sentence in English is Chomsky’s “Color- less green ideas furiously.” However, formally defining “syntactic properties” is rather subtle and we will not use this terminology in this book, sticking to the terms “semantic” and “non semantic” only.

9.5.2 Halting and Rice’s Theorem for other Turing-complete models As we saw before, many natural computational models turn out to be equivalent to one another, in the sense that we can transform a “pro- gram” of one model (such as a expression, or a game-of-life config- urations) into another model (such as a NAND-TM program). This equivalence implies that we can휆 translate the uncomputability of the Halting problem for NAND-TM programs into uncomputability for Halting in other models. For example:

Theorem 9.17 — NAND-TM Machine Halting. Let NANDTMHALT be the function that on input strings ∗ and outputs if the NAND-TM program de-∶ {0,scribed 1}∗ by →halts {0, on 1} the input∗ and outputs otherwise. Then푃 ∈ {0,NANDTMHALT 1} 푥 ∈is uncomputable. {0, 1} 1 푃 푥 0 350 introduction to theoretical computer science

P Once again, this is a good point for you to stop and try to prove the result yourself before reading the proof below.

Proof. We have seen in Theorem 7.11 that for every Turing machine , there is an equivalent NAND-TM program such that for ev- ery , . In particular this means that HALT 푀 푀NANDTMHALT . 푃 푀 The푥 푃 transformation(푥) = 푀(푥) that is obtained from the(푀) proof = 푀 of Theorem 7.11 (푃is constructive) . That is, the proof yields a way to 푀 compute the map 푀 ↦. This 푃 means that this proof yields a reduction from task of computing HALT to the task of computing 푀 NANDTMHALT,푀 which ↦ 푃means that since HALT is uncomputable, neither is NANDTMHALT. ■

The same proof carries over to other computational models such as the calculus, two dimensional (or even one-dimensional) automata etc. Hence for example, there is no algorithm to decide if a expression evaluates휆 the identity function, and no algorithm to decide whether an initial configuration of the game of life will result휆 in eventually coloring the cell black or not. Indeed, we can generalize Rice’s Theorem to all these models. For example, if (0, 0) is a non-trivial function such that for every∗ functionally equivalent NAND-TM programs then 퐹′is ∶ uncomputable, {0, 1} → {0, 1} and the same holds for NAND-RAM programs,퐹 (푃 )′ = 퐹 (푃-expressions,) and all other Turing complete models (as defined푃 , 푃 in Definition퐹 8.5), see also Exercise 9.12. 휆 9.5.3 Is software verification doomed? (discussion) Programs are increasingly being used for mission critical purposes, whether it’s running our banking system, flying planes, or monitoring nuclear reactors. If we can’t even give a certification algorithm that a program correctly computes the parity function, how can we ever be assured that a program does what it is supposed to do? The key insight is that while it is impossible to certify that a general program conforms with a specification, it is possible to write a program in the first place in a way that will make it easier to certify. As atrivial example, if you write a program without loops, then you can certify that it halts. Also, while it might not be possible to certify that an arbitrary program computes the parity function, it is quite possible to write a particular program for which we can mathematically prove that computes the parity. In fact, writing programs or algorithms 푃 푃 universality and uncomputability 351

and providing proofs for their correctness is what we do all the time in algorithms research. The field of software verification is concerned with verifying that given programs satisfy certain conditions. These conditions can be that the program computes a certain function, that it never writes into a dangerous memory location, that is respects certain invari- ants, and others. While the general tasks of verifying this may be uncomputable, researchers have managed to do so for many inter- esting cases, especially if the program is written in the first place in a formalism or programming language that makes verification eas- ier. That said, verification, especially of large and complex programs, remains a highly challenging task in practice as well, and the num- ber of programs that have been formally proven correct is still quite small. Moreover, even phrasing the right theorem to prove (i.e., the specification) if often a highly non-trivial endeavor.

Figure 9.8: The set R of computable Boolean functions (Definition 7.3) is a proper subset of the set of all functions mapping to . In this chapter

we saw a few examples of∗ elements in the latter set that are not in the former.{0, 1} {0, 1}

Chapter Recap

• There is a universal Turing machine (or NAND-TM ✓ program) such that on input a description of a Turing machine and some input , halts and outputs푈 if (and only if) halts on input . Unlike in the case푀 of finite computation푥 푈(푀,푥) (i.e., NAND-CIRC푀(푥) programs / circuits),푀 the input to 푥the program can be a machine that has more states than itself. • Unlike the finite푈 case, there are푀 actually functions that are inherently푈 uncomputable in the sense that they cannot be computed by any Turing machine. • These include not only some “degenerate” or “eso- teric” functions but also functions that people have 352 introduction to theoretical computer science

deeply care about and conjectured that could be computed. • If the Church-Turing thesis holds then a function that is uncomputable according to our definition cannot be computed by any means in our physical 퐹world.

9.6 EXERCISES

Exercise 9.1 — NAND-RAM Halt. Let NANDRAMHALT be the function such that on input where represents∗ a NAND- RAM program, NANDRAMHALT iff halts∶ {0, on 1} the→ input {0, 1} . Prove that NANDRAMHALT is uncomputable.(푃 , 푥) 푃 (푃 , 푥) = 1 푃 푥■

Exercise 9.2 — Timed halting. Let TIMEDHALT be the function that on input (a string representing) a triple∗ , TIMEDHALT iff the Turing ∶machine {0, 1} ,→ on {0, input 1} , halts within at most steps (where a step is defined as(푀, one 푥, 푇sequence ) of reading a symbol(푀, 푥, 푇 from ) = the 1 tape, updating the state,푀 writing a푥 new symbol and (potentially)푇 moving the head). Prove that TIMEDHALT is computable. ■

Exercise 9.3 — Space halting (challenging). Let SPACEHALT be the function that on input (a string representing) a triple∗ , SPACEHALT iff the Turing machine∶ {0, 1} , on→ {0,input 1} , halts before its head reached the -th location of its tape. (We (푀,don’t 푥, care 푇 ) how many steps(푀, 푥, 푇makes, ) = 1 as long as the head stays푀 inside locations푥 .) 푇 2 A machine with alphabet can have at most 푀 2 Prove that SPACEHALT is computable. See footnote for hint choices for the contents of the first locations of 푇 {0, … , 푇 − 1} ■ its tape. What happens if theΣ machine repeats a|Σ| previously seen configuration, in푇 the sense that the Exercise 9.4 — Computable compositions. Suppose that tape contents, the head location, and the current state, are all identical to what they were in some previous and are computable functions. For each∗ one of the state of the execution? following functions∗ , either prove that is necessarily퐹 ∶ {0, computable 1} → {0,or 1} give퐺 an ∶ example {0, 1} → of {0, a pair 1} and of computable functions such that will not be computable.퐻 Prove your assertions.퐻 퐹 퐺 퐻1. iff OR .

2. 퐻(푥) = 1 iff퐹 there (푥) = 1exist퐺(푥) two =nonempty 1 strings such that (i.e., is the concatenation of and ), ∗ and 퐻(푥) =. 1 푢, 푣 ∈ {0, 1} 푥 = 푢푣 푥 푢 푣 퐹 (푢) = 1 3. 퐺(푣) = 1 iff there exist a list of non empty strings such that strings for every and . 0 푡−1 퐻(푥) = 1 푢 , … , 푢 푖 0 1 푡−1 퐹 (푢 ) = 1 푖 ∈ [푡] 푥 = 푢 푢 ⋯ 푢 universality and uncomputability 353

4. iff is a valid string representation of a NAND++ program such that for every , on input the program 퐻(푥)outputs = 1 푥. ∗ 푃 푧 ∈ {0, 1} 푧 5. 푃 퐹iff (푧) is a valid string representation of a NAND++ program such that on input the program outputs . 퐻(푥) = 1 푥 6. 푃 iff is a valid string푥 representation푃 of a NAND++퐹 (푥) program such that on input , outputs after executing at most퐻(푥) = 1 푥lines. 푃 2 푥 푃 퐹 (푥) 100 ⋅ |푥| ■ Exercise 9.5 Prove that the following function FINITE is uncomputable. On input , we define FINITE ∗ if and only if is a string that represents∗ a NAND++∶ {0,program 1} → such {0, 1} 3 Hint: You can use Rice’s Theorem. that there only a finite number푃 ∈ {0, of inputs 1} s.t. (푃 ) =.3 1 푃 ∗ ■ 푥 ∈ {0, 1} 푃 (푥) = 1 Exercise 9.6 — Computing parity. Prove Theorem 9.13 without using Rice’s Theorem. ■

Exercise 9.7 — TM Equivalence. Let EQ be the func- tion defined as follows: given a string∗ representing apair of Turing machines, EQ ∶iff {0, 1}and∶→ {0,are 1} functionally′ equivalent as per Definition 9.14′ . Prove that EQ is′ uncomputable.(푀, 푀 ) Note that you cannot (푀,use Rice’s푀 ) = Theorem 1 푀 directly,푀 as this theorem only deals with functions that take a single Turing machine as input, and EQ takes two machines. ■

Exercise 9.8 For each of the following two functions, say whether it is computable or not:

1. Given a NAND-TM program , an input , and a number , when we run on , does the index variable i ever reach ? 푃 푥 푘 2. Given a푃 NAND-TM푥 program , an input , and a number푘 , when we run on , does ever write to an array at index ? 푃 푥 푘 푃 푥 푃 푘 ■ Exercise 9.9 Let be the function that is defined as follows. On input a string∗ that represents a NAND-RAM program and a String 퐹that ∶ {0, represents 1} → {0, a Turing1} machine, if and 4 only if there exists some input푃 such halts on but does not halt Hint: While it cannot be applied directly, with a 4 little “massaging” you can prove this using Rice’s on . Prove that푀 is uncomputable. See footnote for퐹 (푃 hint. , 푀) = 1 Theorem. 푥 푃 푥 푀 ■ 푥 퐹 354 introduction to theoretical computer science

Exercise 9.10 — Recursively enumerable. Define a function to be recursively enumerable if there exists a Turing machine∗ such that such that for every , if then퐹 ∶ {0, 1} ∶→, {0,and 1} if then . (i.e., if∗ then does not푀 halt on .) 푥 ∈ {0, 1} 퐹 (푥) = 1 푀(푥) = 1 퐹 (푥) = 0 푀(푥) = ⊥ 퐹 (푥) = 0 푀 1. Prove푥 that every computable is also recursively enumerable. 2. Prove that there exists that is not computable but is recursively 퐹 5 HALT has this property. enumerable. See footnote for hint.5 퐹 3. Prove that there exists a function such that is 6 6 You can either use the diagonalization method to not recursively enumerable. See footnote for∗ hint. prove this directly or show that the set of all recur- 퐹 ∶ {0, 1} → {0, 1} 퐹 sively enumerable functions is countable. 4. Prove that there exists a function such that is recursively enumerable but the function∗ defined as 7 is not recursively enumerable.퐹 ∶ {0, See 1} footnote→ {0, 1} for hint.7 HALT has this property: show that if both HALT and were recursively enumerable then HALT 퐹 퐹 퐹 (푥) = would be in fact computable. ■ 1 − 퐹 (푥) 퐻퐴퐿푇 Exercise 9.11 — Rice’s Theorem: standard form. In this exercise we will prove Rice’s Theorem in the form that it is typically stated in the litera- ture. For a Turing machine , define to be the set of all such that halts on the input and outputs∗ . (The set is known∗ in the literature푀 as the퐿(푀)language ⊆ {0, recognized1} by . Note 푥that ∈ {0,might 1} either output푀 a value other than푥 or not halt1 at all on 퐿(푀)inputs .) 푀 푀 1 1. Prove푥 ∉ that 퐿(푀) for every Turing Machine , if we define to be the function such that iff then ∗ 푀 is recursively enumerable as defined in Exercise푀 9.10. 퐹 ∶ {0, 1} → 푀 푀 {0, 1} 퐹 (푥) = 1 푥 ∈ 퐿(푀) 퐹 2. Use Theorem 9.15 to prove that for every , if (a) is neither the constant zero nor the constant one function,∗ and (b) for every such that 퐺 ∶ {0,, 1} → {0, 1} , 8 8 퐺then is uncomputable.′ See footnote for hint.′ ′ Show that any satisfying (b) must be semantic. 푀, 푀 퐿(푀) = 퐿(푀 ) 퐺(푀) = 퐺(푀 ) ■ 퐺 퐺 Exercise 9.12 — Rice’s Theorem for general Turing-equivalent models (optional). Let be the set of all partial functions from to and be a Turing-equivalent model as defined∗ in Definition 8.5. We defineℱ∗ a function to{0, be 1} -semantic{0, 1} if thereℳ ∶ {0,exists 1} some→ ℱ such∗ that for every . 퐹 ∶ {0, 1} → {0, 1} ℳ Prove that∗ 풢 for ∶ ℱ every → {0,-semantic 1} 퐹 (푃 ) = 풢(ℳ(푃 ))that is neither 푃the ∈ constant {0, 1} one nor the constant zero function,∗ is uncomputable. ℳ 퐹 ∶ {0, 1} → {0, 1} ■ 퐹 universality and uncomputability 355

Exercise 9.13 — Busy Beaver. In this question we define the NAND- TM variant of the busy beaver function (see Aaronson’s 1999 essay, 2017 blog post and 2020 survey [Aar20]; see also Tao’s highly recom- mended presentation on how civilization’s scientific progress can be measured by the quantities we can grasp).

1. Let be defined as follows. For every string , if represents∗ a NAND-TM program such that when is 퐵퐵 executed푇∗ ∶ on {0, the 1} input→ ℕ then it halts within steps then 푃 ∈ {0,. 1} Otherwise푃 (if does not represent a NAND-TM program,푃 or it 퐵퐵 is a program that does0 not halt on ), 푀 . Prove푇 that(푃 ) = 푀is uncomputable. 푃 퐵퐵 퐵퐵 0 푇 (푃 ) = 0 푇 2. Let TOWER denote the number (that is, a “tower of pow- 2 times⋅⋅ ers of two” of height ). To get a sense⋅ of how fast this function (푛) 푛2⏟ grows, TOWER , TOWER , TOWER 2 , TOWER 푛 and TOWER2 which2 is about .(1)TOWER =16 2 is already(2) =2 a number= 4 that65536 is(3) too = big 2 to= write16 even20000 in(4) scientific = 2 = 65536notation. Define NBB(5) = 2 (for “NAND- TM Busy10 Beaver”) to be the(6) function NBB max where is as defined in Question 6.1.∶ Prove ℕ → that ℕNBB grows푛 푃 ∈{0,1} 퐵퐵 faster than TOWER, in the sense that TOWER(푛) = NBB 푇 . See(푃 ) 9 You will not need to use very specific properties 퐵퐵 footnote푇 for hint9 of the TOWER function in this exercise. For exam- (푛) = 표( (푛)) ple, NBB also grows faster than the Ackerman function. ■ (푛)

9.7 BIBLIOGRAPHICAL NOTES

The cartoon of the Halting problem in Fig. 9.1 and taken from Charles Cooper’s website. Section 7.2 in [MM11] gives a highly recommended overview of uncomputability. Gödel, Escher, Bach [Hof99] is a classic popular science book that touches on uncomputability, and unprovability, and specifically Gödel’s Theorem that we will seein Chapter 11. See also the recent book by Holt [Hol18]. The history of the definition of a function is intertwined with the development of mathematics as a field. For many years, a function was identified (as per Euler’s quote above) with the means tocalcu- late the output from the input. In the 1800’s, with the invention of the Fourier series and with the systematic study of continuity and differentiability, people have started looking at more general kinds of functions, but the modern definition of a function as an arbitrary map- ping was not yet universally accepted. For example, in 1899 Poincare wrote “we have seen a mass of bizarre functions which appear to be forced to resemble as little as possible honest functions which serve some purpose. 356 introduction to theoretical computer science

… they are invented on purpose to show that our ancestor’s reasoning was at fault, and we shall never get anything more than that out of them”. Some of this fascinating history is discussed in [Gra83; Kle91; Lüt02; Gra05]. The existence of a universal Turing machine, and the uncomputabil- ity of HALT was first shown by Turing in his seminal paper[Tur37], though closely related results were shown by Church a year before. These works built on Gödel’s 1931 incompleteness theorem that we will discuss in Chapter 11. Some universal Turing Machines with a small alphabet and number of states are given in [Rog96], including a single-tape universal Turing machine with the binary alphabet and with less than states; see also the survey [WN09]. Adam Yedidia has written software to help in producing Turing machines with a small number of25 states. This is related to the recreational pastime of “Code Golfing” which is about solving a certain computational task using the as short as possible program. Finding “highly complex” small Turing machine is also related to the “Busy Beaver” problem, see Exercise 9.13 and the survey [Aar20]. The diagonalization argument used to prove uncomputability of is derived from Cantor’s argument for the uncountability of the reals ∗ discussed in Chapter 2. 퐹 Christopher Strachey was an English computer scientist and the inventor of the CPL programming language. He was also an early artificial intelligence visionary, programming a computer to play Checkers and even write love letters in the early 1950’s, see this New Yorker article and this website. Rice’s Theorem was proven in [Ric53]. It is typically stated in a form somewhat different than what we used, see Exercise 9.11. We do not discuss in the chapter the concept of recursively enumer- able languages, but it is covered briefly in Exercise 9.10. As usual, we use function, as opposed to language, notation. The cartoon of the Halting problem in Fig. 9.1 is copyright 2019 Charles F. Cooper. Learning Objectives: • See that Turing completeness is not always a good thing. • Another example of an always-halting formalism: context-free grammars and simply typed calculus. • The pumping lemma for non context-free functions.휆 • Examples of computable and uncomputable semantic properties of regular expressions and 10 context-free grammars. Restricted computational models

“Happy families are all alike; every unhappy family is unhappy in its own way”, Leo Tolstoy (opening of the book “Anna Karenina”).

We have seen that many models of computation are Turing equiva- lent, including Turing machines, NAND-TM/NAND-RAM programs, standard programming languages such as C/Python/Javascript, as well as other models such as the calculus and even the game of life. The flip side of this is that for all these models, Rice’s theorem (The- orem 9.15) holds as well, which means휆 that any semantic property of programs in such a model is uncomputable. The uncomputability of halting and other semantic specification problems for Turing equivalent models motivates restricted com- putational models that are (a) powerful enough to capture a set of functions useful for certain applications but (b) weak enough that we can still solve semantic specification problems on them. In this chapter we discuss several such examples.

 Big Idea 14 We can use restricted computational models to bypass limitations such as uncomputability of the Halting problem and Rice’s Theorem. Such models can compute only a restricted subclass of functions, but allow to answer at least some semantic questions on programs.

10.1 TURING COMPLETENESS AS A BUG

We have seen that seemingly simple computational models or sys- tems can turn out to be Turing complete. The following webpage lists several examples of formalisms that “accidentally” turned out to Tur- ing complete, including supposedly limited languages such as the C preprocessor, CSS, (certain variants of) SQL, sendmail configuration, as well as games such as Minecraft, Super Mario, and the card game

Compiled on 8.26.2020 18:10 358 introduction to theoretical computer science

Figure 10.1: Some restricted computational models. We have already seen two equivalent restricted models of computation: regular expressions and deterministic finite automata. We show a more powerful model: context-free grammars. We also present tools to demonstrate that some functions can not be computed in these models.

“Magic: The Gathering”. Turing completeness is not always a good thing, as it means that such formalisms can give rise to arbitrarily complex behavior. For example, the postscript format (a precursor of PDF) is a Turing-complete programming language meant to describe documents for printing. The expressive power of postscript can allow for short descriptions of very complex images, but it also gave rise to some nasty surprises, such as the attacks described in this page rang- ing from using infinite loops as a denial of attack, to accessing the ’s .

■ Example 10.1 — The DAO Hack. An interesting recent example of the pitfalls of Turing-completeness arose in the context of the cryp- tocurrency Ethereum. The distinguishing feature of this currency is the ability to design “smart contracts” using an expressive (and in particular Turing-complete) programming language. In our current “human operated” economy, Alice and Bob might sign a contract to agree that if condition X happens then they will jointly invest in Charlie’s company. Ethereum allows Alice and Bob to create a joint venture where Alice and Bob pool their funds to- gether into an account that will be governed by some program that decides under what conditions it disburses funds from it. For example, one could imagine a piece of code that interacts between푃 Alice, Bob, and some program running on Bob’s car that allows Alice to rent out Bob’s car without any human intervention or overhead. Specifically Ethereum uses the Turing-complete programming language solidity which has a syntax similar to JavaScript. The flagship of Ethereum was an experiment known as The “Decen- tralized Autonomous Organization” or The DAO. The idea was to create a smart contract that would create an autonomously run decentralized venture capital fund, without human managers, where shareholders could decide on investment opportunities. The restricted computational models 359

DAO was at the time the biggest crowdfunding success in history. At its height the DAO was worth 150 million dollars, which was more than ten percent of the total Ethereum market. Investing in the DAO (or entering any other “smart contract”) amounts to pro- viding your funds to be run by a . i.e., “code is law”, or to use the words the DAO described itself: “The DAO is borne from immutable, unstoppable, and irrefutable computer code”. Unfortunately, it turns out that (as we saw in Chapter 9) under- standing the behavior of computer programs is quite a hard thing to do. A hacker (or perhaps, some would say, a savvy investor) was able to fashion an input that caused the DAO code to enter into an infinite recursive loop in which it continuously transferred funds into the hacker’s account, thereby cleaning out about 60 mil- lion dollars out of the DAO. While this transaction was “legal” in the sense that it complied with the code of the smart contract, it was obviously not what the humans who wrote this code had in mind. The Ethereum community struggled with the response to this attack. Some tried the “Robin Hood” approach of using the same loophole to drain the DAO funds into a secure account, but it only had limited success. Eventually, the Ethereum community decided that the code can be mutable, stoppable, and refutable. Specifically, the Ethereum maintainers and miners agreed ona “hard ” (also known as a “bailout”) to revert history to be- fore the hacker’s transaction occurred. Some community members strongly opposed this decision, and so an alternative currency called Ethereum Classic was created that preserved the original history.

10.2 CONTEXT FREE GRAMMARS

If you have ever written a program, you’ve experienced a syntax error. You probably also had the experience of your program entering into an infinite loop. What is less likely is that the compiler or interpreter entered an infinite loop while trying to figure out if your program has a syntax error. When a person designs a programming language, they need to determine its syntax. That is, the designer decides which strings corre- sponds to valid programs, and which ones do not (i.e., which strings contain a syntax error). To ensure that a compiler or interpreter al- ways halts when checking for syntax errors, language designers typi- cally do not use a general Turing-complete mechanism to express their syntax. Rather they use a restricted computational model. One of the most popular choices for such models is context free grammars. 360 introduction to theoretical computer science

To explain context free grammars, let us begin with a canonical ex- ample. Consider the function ARITH that takes as input a string over the alphabet ∗ and returns if and only if the string ∶ Σrepresents→ {0, 1} a valid arithmetic expression.푥 Intuitively, we buildΣ = {(,expressions ), +, −, ×, by ÷, applying 0, 1, 2, 3, 4, an 5, opera- 6, 7, 8, 9} tion such as 1 , , or to smaller expressions,푥 or enclosing them in parenthesis, where the “base case” corresponds to expressions that are simply numbers.+ − × More÷ precisely, we can make the following defini- tions:

•A digit is one of the symbols .

•A number is a sequence of digits.0, 1, 2,(For 3, 4, simplicity 5, 6, 7, 8, 9 we drop the condi- tion that the sequence does not have a leading zero, though it is not hard to encode it in a context-free grammar as well.)

• An operation is one of

• An expression has either+, the −, ×, form ÷ “number”, the form “sub- expression1 operation sub-expression2”, or the form “(sub- expression1)”, where “sub-expression1” and “sub-expression2” are themselves expressions. (Note that this is a recursive definition.)

A context free grammar (CFG) is a formal way of specifying such conditions. A CFG consists of a set of rules that tell us how to generate strings from smaller components. In the above example, one of the rules is “if and are valid expressions, then is also a valid expression”; we can also write this rule using the short- hand 푒푥푝1 푒푥푝2 . As푒푥푝1 in the × above 푒푥푝2 ex- ample, the rules of a context-free grammar are often recursive: the rule 푒푥푝푟푒푠푠푖표푛 ⇒ 푒푥푝푟푒푠푠푖표푛 × 푒푥푝푟푒푠푠푖표푛defines valid expressions in terms of itself. We now formally define context-free grammars: 푒푥푝푟푒푠푠푖표푛 ⇒ 푒푥푝푟푒푠푠푖표푛 × 푒푥푝푟푒푠푠푖표푛 Definition 10.2 — Context Free Grammar. Let be some finite set. A context free grammar (CFG) over is a triple such that: Σ • , known as the variables, is aΣ set disjoint(푉 from , 푅, 푠).

• 푉 is known as the initial variable. Σ

• 푠 ∈is 푉 a set of rules. Each rule is a pair with and . We often write the rule as and say that 푅the string can∗ be derived from the variable(푣, 푧) . 푣 ∈ 푉 푧 ∈ (Σ ∪ 푉 ) (푣, 푧) 푣 ⇒ 푧 푧 푣 restricted computational models 361

■ Example 10.3 — Context free grammar for arithmetic expressions. The example above of well-formed arithmetic expressions can be cap- tured formally by the following context free grammar:

• The alphabet is

• The variablesΣ are {(, ), +, −, ×, ÷, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9} .

• The rules are the set푉 = {푒푥푝푟푒푠푠푖표푛containing the , 푛푢푚푏푒푟 following , 푑푖푔푖푡rules: , 표푝푒푟푎푡푖표푛}

– The rules 푅 , , 19 , and . 4 표푝푒푟푎푡푖표푛 ⇒ + 표푝푒푟푎푡푖표푛 ⇒ − 표푝푒푟푎푡푖표푛 ⇒ × – The rules , , . 표푝푒푟푎푡푖표푛 ⇒ ÷ – The rule . 10 푑푖푔푖푡 ⇒ 0 … 푑푖푔푖푡 ⇒ 9 – The rule . 푛푢푚푏푒푟 ⇒ 푑푖푔푖푡 – The rule . 푛푢푚푏푒푟 ⇒ 푑푖푔푖푡 푛푢푚푏푒푟 – The rule . 푒푥푝푟푒푠푠푖표푛 ⇒ 푛푢푚푏푒푟 – The rule . 푒푥푝푟푒푠푠푖표푛 ⇒ 푒푥푝푟푒푠푠푖표푛 표푝푒푟푎푡푖표푛 푒푥푝푟푒푠푠푖표푛 • The starting푒푥푝푟푒푠푠푖표푛 variable is ⇒ (푒푥푝푟푒푠푠푖표푛)

People use many different푒푥푝푟푒푠푠푖표푛 notations to write context free grammars. One of the most common notations is the Backus–Naur form. In this notation we write a rule of the form (where is a variable and is a string) in the form := a. If we have several rules of the form , , and then we can푣 combine ⇒ 푎 them푣 as := a|b|c푎. (In words we say that can derive either , , or .) For example, the 푣Backus-Naur ↦ 푎 푣 ↦ 푏 description푣 ↦ 푐 for the context free grammar of Example 10.3 is the following (using푣 ASCII equivalents푎 for푏 operations):푐 operation := +|-|*|/ digit := 0|1|2|3|4|5|6|7|8|9 number := digit|digit number expression := number|expression operation expression|(expression)

↪ Another example of a context free grammar is the “matching paren- thesis” grammar, which can be represented in Backus-Naur as fol- lows: match := ""|match match|(match)

A string over the alphabet (,) can be generated from this gram- mar (where match is the starting expression and "" corresponds to the empty string) if and only if it{ consists} of a matching set of parenthesis. 362 introduction to theoretical computer science

In contrast, by Lemma 6.19 there is no regular expression that matches a string if and only if contains a valid sequence of matching paren- thesis. 푥 푥 10.2.1 Context-free grammars as a computational model We can think of a context-free grammar over the alphabet as defin- ing a function that maps every string in to or depending on whether can be generated by the rules of the∗ grammars.Σ We now make this definition formally. 푥 Σ 1 0 푥 Definition 10.4 — Deriving a string from a grammar. If is a context-free grammar over , then for two strings we say that can be derived in one step from , denoted퐺 = by (푉 , 푅, 푠) ∗, if we can obtain from byΣ applying one of the rules훼, 훽 of ∈ (Σ. That ∪ 푉 is, ) 퐺 we obtain 훽by replacing in one occurrence훼 of the variable훼 ⇒with훽 the string , where훽 훼 is a rule of . 퐺 We say that훽 can be derived훼 from , denoted by 푣 , if it can be derived푧 by some푣 ⇒ 푧finite number퐺 of steps. That is, if∗ there 퐺 are 훽 , so that훼 훼 ⇒ 훽 . ∗ 푘 1 푘−1 퐺 1 퐺 2 퐺 퐺 We훼 , say … , that훼 ∈ (Σ ∪is 푉 )matched by훼 ⇒ 훼 ⇒ 훼if ⇒can⋯ be ⇒ de- 푘−1 퐺 훼rived⇒ from훽 the starting∗ variable (i.e., if ). We define the function computed푥 by ∈ Σ to be the퐺 map = (푉∗ , 푅, 푠) 푥 퐺 such that iff 푠is matched푠⇒ by 푥 .∗ A function 푉 ,푅,푠 is context(푉 , 푅, free 푠) if Φfor some∶ CFG Σ → {0, 1}. 1 푉 ,푅,푠 ∗ Φ (푥) = 1 푥 (푉 , 푅, 푠) 푉 ,푅,푠 퐹 ∶ Σ → {0, 1} 퐹 = Φ (푉 , 푅, 푠) 1 As in the case of Definition 6.7 we can also use A priori it might not be clear that the map is computable, language rather than function notation and say that a language is context free if the function such but it turns out that this is the case. 푉 ,푅,푠 that iff∗ is context free. Φ 퐿 ⊆ Σ 퐹 Theorem 10.5 — Context-free grammars always halt. For every CFG 퐹(푥) = 1 푥 ∈ 퐿 over , the function is computable. ∗ 푉 ,푅,푠 (푉 , 푅, 푠) {0, 1} Φ ∶ {0, 1} → {0, 1} As usual we restrict attention to grammars over although the proof extends to any finite alphabet . {0, 1} Proof. We only sketch the proof. WeΣ start with the observation we can convert every CFG to an equivalent version of Chomsky normal form, where all rules either have the form for variables or the form for a variable and symbol , plus potentially the rule "" where is the starting variable.푢 → 푣푤 푢, 푣, 푤 The푢 idea → 휎 behind such a transformation푢 휎 is ∈ to Σ simply add new vari- ables푠 as → needed, and푠 so for example we can translate a rule such as into the three rules , and .

푣 → 푢휎푤 푣 → 푢푟 푟 → 푡푤 푡 → 휎 restricted computational models 363

Using the Chomsky Normal form we get a natural recursive algo- rithm for computing whether for a given grammar and string . We simply try all possible∗ guesses for the first rule 퐺 that is used in such a derivation,푠 ⇒ and푥 then all possible ways to퐺 par- tition 푥as a concatenation . If we guessed the rule푠 and → the 푢푣 partition correctly, then this reduces′ ″ our task to checking whether 푥 and , which푥 = 푥 (as푥 it involves shorter strings) can be done∗ recursively.′ ∗ The″ base cases are when is empty or a single 퐺 퐺 푢symbol, ⇒ 푥 and can푣 ⇒ be easily푥 handled. 푥 ■

R Remark 10.6 — Parse trees. While we focus on the task of deciding whether a CFG matches a string, the algorithm to compute actually gives more in- formation than that. That is, on input a string , if 푉 ,푅,푠 then theΦ algorithm yields the sequence of rules that one can apply from the starting vertex푥 푉 ,푅,푠 Φto obtain(푥) the = final 1 string . We can think of these rules as determining a tree with being the root vertex and푠 the sinks (or leaves) corresponding푥 to the substrings of that are obtained by the푠 rules that do not have a variable in their second element. This tree is known as푥 the parse tree of , and often yields very useful information about the structure of . Often the first step푥 in a compiler or interpreter fora programming language is a parser that푥 transforms the source into the parse tree (also known as the abstract syntax tree). There are also tools that can automati- cally convert a description of a context-free grammars into a parser algorithm that computes the parse tree of a given string. (Indeed, the above recursive algorithm can be used to achieve this, but there are much more efficient versions, especially for grammars that have particular forms, and programming language design- ers often try to ensure their languages have these more efficient grammars.)

10.2.2 The power of context free grammars Context free grammars can capture every regular expression:

Theorem 10.7 — Context free grammars and regular expressions. Let be a regular expression over , then there is a CFG over such that . 푒 {0, 1} (푉 , 푅, 푠) 푉 ,푅,푠 푒 Proof.{0, 1}We prove theΦ theorem= Φ by induction on the length of . If is an expression of one bit length, then or , in which case we leave it to the reader to verify that there is a (trivial) CFG푒 that푒 푒 = 0 푒 = 1 364 introduction to theoretical computer science

computes it. Otherwise, we fall into one of the following case: case 1: , case 2: or case 3: where in all cases are′ shorter″ regular expressions.′ ″ By the induction′ ∗ hypothesis have′ 푒″ = grammars 푒 푒 푒 = 푒 |푒and 푒 =that (푒 ) compute and 푒respectively., 푒 By renaming′ ′ ′ of variables,″ ″ we″ can also assume without′ ″ 푒 푒 loss of generality(푉 that, 푅 , 푠and) (푉are, 푅disjoint., 푠 ) Φ Φ In case 1, we can define′ the″ new grammar as follows: we add anew starting variable 푉 and푉 the rule . In case 2, we can define the new grammar′ as follows: we add′ ″ a new starting variable and푠 the ∉ rules푉 ∪ 푉 and 푠 ↦ 푠. Case푠 3 will be the only one that′ uses recursion. As before′ we add″ a new starting variable 푠 ∉ 푉 ∪ 푉, but now add the푠 ↦ rules 푠 푠"" ↦(i.e., 푠 the empty string) and also add, for′ every rule of the form , the rule to . 푠 ∉We 푉 leave∪ 푉 it to the reader as (a very푠 ↦′ good!) exercise′ to verify that in all three cases the grammars we produce(푠 , 훼) capture ∈ 푅 the same푠 ↦ function 푠훼 푅 as the original expression. ■

It turns out that CFG’s are strictly more powerful than regular expressions. In particular, as we’ve seen, the “matching parenthesis” function MATCHPAREN can be computed by a context free grammar, whereas, as shown in Lemma 6.19, it cannot be computed by regular expressions. Here is another example: Solved Exercise 10.1 — Context free grammar for palindromes. Let PAL be the function defined in Solved Exercise 6.4 where PAL ∗ iff has the form . Then PAL can be computed∶ by a {0,context-free 1, ; } → {0, grammar 1} 푅 (푤) = 1 푤 푢; 푢 ■

Solution: A simple grammar computing PAL can be described using Backus–Naur notation:

start := ; | 0 start 0 | 1 start 1

One can prove by induction that this grammar generates exactly the strings such that PAL . ■ 푤 (푤) = 1 A more interesting example is computing the strings of the form that are not palindromes: Solved Exercise 10.2 — Non palindromes. Prove that there is a context free 푢; 푣 grammar that computes NPAL where NPAL if but . ∗ 푅 ∶ {0, 1, ; } → {0, 1} (푤) =■ 1 푤 = 푢; 푣 푣 ≠ 푢 restricted computational models 365

Solution: Using Backus–Naur notation we can describe such a grammar as follows

palindrome := ; | 0 palindrome 0 | 1 palindrome 1 different := 0 palindrome 1 | 1 palindrome 0 start := different | 0 start | 1 start | start 0 | start 1

↪ In words, this means that we can characterize a string such that NPAL as having the following form 푤 (푤) = 1 (10.1)

where are arbitrary strings and푅 ′ . Hence we can 푤 = 훼푏푢; 푢 푏 훽 generate such a string by first generating a palindrome′ (palindrome훼, 훽,variable), 푢 then adding on푏 either ≠ the 푏 left or right푅 and on the opposite side to get something that is not a palindrome푢; 푢 (different variable), and then we can0 add arbitrary number of ’s 1and ’s on either end (the start variable). 0 ■ 1 10.2.3 Limitations of context-free grammars (optional) Even though context-free grammars are more powerful than regular expressions, there are some simple languages that are not captured by context free grammars. One tool to show this is the context-free grammar analog of the “pumping lemma” (Theorem 6.20):

Theorem 10.8 — Context-free pumping lemma. Let be a CFG over , then there is some numbers such that for every with , if then(푉 , 푅, 푠) such that 0 1 Σ ∗ , , and푛 , 푛 ∈ ℕ for every 0 푉 ,푅,푠 푥 ∈ Σ. |푥| > 푛 Φ (푥) = 1 푥푘 =푘 푎푏푐푑푒 1 푉 ,푅,푠 |푏| + |푐| + |푑| ≤ 푛 |푏| + |푑| ≥ 1 Φ (푎푏 푐푑 푒) = 1 푘 ∈ ℕ P The context-free pumping lemma is even more cum- bersome to state than its regular analog, but you can remember it as saying the following: “If a long enough string is matched by a grammar, there must be a variable that is repeated in the derivation.”

Proof of Theorem 10.8. We only sketch the proof. The idea is that if the total number of symbols in the rules of the grammar is , then the only way to get with is to use recursion. 0 That is, there must be some variable such that we are푘 able to 0 푉 ,푅,푠 |푥| > 푛 Φ (푥) = 1 푣 ∈ 푉 366 introduction to theoretical computer science

derive from the value for some strings , and then further on derive from some string such that is∗ a substring of (in other words,푣 푏푣푑 for some∗ 푏, 푑 ∈ Σ ). If we take the variable satisfying푣 this requirement푐 ∈ Σ with a푏푐푑 minimum∗ number 푥of derivation steps,푥 then = 푎푏푐푑푒 we can ensure that푎, 푒 ∈ {0,is 1} at most some constant depending푣 on and we can set to be that constant ( will do, since we will not need more|푏푐푑| than applications 0 1 1 of rules, and each such application푛 can grow푛 the string by at most푛 = 0 10symbols). ⋅ |푅| ⋅ 푛 |푅| 0 Thus by the definition of the grammar, we can repeat the푛 derivation to replace the substring in with for every while retaining the property that the output of푘 푘 is still one. Since is a substring of , we can푏푐푑 write푥 푏 푐푑 and are guaranteed푘 ∈ ℕ that 푉 ,푅,푠 is matched by the grammar for everyΦ . 푏푐푑 ■ 푘 푘 푥 푥 = 푎푏푐푑푒 푎푏 푐푑 푒 푘 Using Theorem 10.8 one can show that even the simple function defined as follows: ∗ 퐹 ∶ {0, 1} → {0, 1} for some (10.2) otherwise ∗ {⎧1 푥 = 푤푤 푤 ∈ {0, 1} 퐹 (푥) = is not context free. (In⎨ contrast, the function ⎩{0 defined as iff ∗ for some and is context free, can you퐺 see ∶ {0, why?.) 1} → {0, 1} 0 1 푛−1 푛−1 푛−2 0 퐺(푥) = 1 푥 = 푤 푤 ⋯ 푤 푤 푤 ⋯ 푤 Solved Exercise∗ 10.3 — Equality is not context-free. Let EQ 푤 ∈ {0, 1} 푛 = |푤| be the function such that EQ if and only if ∗ for some . Then EQ is not context free. ∶ {0, 1, ; } → {0, 1} ∗ (푥) = 1 푥 = 푢; 푢 ■ 푢 ∈ {0, 1} Solution: We use the context-free pumping lemma. Suppose towards the sake of contradiction that there is a grammar that computes EQ, and let be the constant obtained from Theorem 10.8. Consider the string , and퐺 write it as 0 as per Theorem푛 10.8, with 푛0 푛0 푛0and푛0 with . By The- orem 10.8, it should hold푥 = that 1 EQ0 ; 1 0 . However, by푥 case = 푎푏푐푑푒 anal- 0 ysis this can be shown to be|푏푐푑| a contradiction. ≤ 푛 |푏| + |푑| ≥ 1 Firstly, unless is on the left side(푎푐푒) of the = 1separator and is on the right side, dropping and will definitely make the two parts different. But if푏 it is the case that is on the; left side and 푑is on the right side, then by the condition푏 푑 that we know that is a string of only zeros and is a string푏 of only ones. If we drop푑 and 0 then since one of them is non empty,|푏푐푑| we get ≤ 푛 that there are either푏 푑 푏 푑 restricted computational models 367

less zeroes on the left side than on the right side, or there are less ones on the right side than on the left side. In either case, we get that EQ , obtaining the desired contradiction. ■ (푎푐푒) = 0 10.3 SEMANTIC PROPERTIES OF CONTEXT FREE LANGUAGES

As in the case of regular expressions, the limitations of context free grammars do provide some advantages. For example, emptiness of context free grammars is decidable:

Theorem 10.9 — Emptiness for CFG’s is decidable. There is an algorithm that on input a context-free grammar , outputs if and only if is the constant zero function. 퐺 퐺 1 Φ Proof Idea: The proof is easier to see if we transform the grammar to Chomsky Normal Form as in Theorem 10.5. Given a grammar , we can recur- sively define a non-terminal variable to be non empty if there is either a rule of the form , or there is a rule of the form퐺 where both and are non empty. Then the푣 grammar is non empty if and only if the starting푣 variable ⇒ 휎 is non-empty. 푣 ⇒ 푢푤 푢 푤 푠 ⋆ Proof of Theorem 10.9. We assume that the grammar in Chomsky Normal Form as in Theorem 10.5. We consider the following proce- dure for marking variables as “non empty”: 퐺

1. We start by marking all variables that are involved in a rule of the form as non empty. 푣 2. We then푣 ⇒ continue 휎 to mark as non empty if it is involved in a rule of the form where have been marked before. 푣 We continue푣 this ⇒ 푢푤 way until푢, we 푤 cannot mark any more variables. We then declare that the grammar is empty if and only if has not been marked. To see why this is a valid algorithm, note that if a variable has been marked as “non empty” then there is some string푠 that can be derived from . On the other hand, if has not been marked,∗ 푣 then every sequence of derivations from will always have훼 a ∈variable Σ that has not been replaced푣 by alphabet symbols.푣 Hence in particular is the all zero function if and only if the푣 starting variable is not marked “non empty”. 퐺 Φ 푠 ■ 368 introduction to theoretical computer science

10.3.1 Uncomputability of context-free grammar equivalence (optional) By analogy to regular expressions, one might have hoped to get an algorithm for deciding whether two given context free grammars are equivalent. Alas, no such luck. It turns out that the equivalence problem for context free grammars is uncomputable. This is a direct corollary of the following theorem:

Theorem 10.10 — Fullness of CFG’s is uncomputable. For every set , let CFGFULL be the function that on input a context-free grammar over , outputs if and only if computes the constant function.Σ Σ Then there is some finite such that CFGFULL is uncomputable.퐺 Σ 1 퐺 1 Σ Theorem 10.10 immediatelyΣ implies that equivalence for context- free grammars is uncomputable, since computing “fullness” of a grammar over some alphabet corresponds to checking whether is equivalent to the grammar "" . 0 푘−1 Note that 퐺Theorem 10.10 and TheoremΣ = {휎 10.9, …together , 휎 } imply that 0 푘−1 context-free grammars,퐺 unlike regular expressions,푠 are ⇒ not|푠휎closed| ⋯ |푠휎 under complement. (Can you see why?) Since we can encode every element of using log bits (and this finite encoding can be easily carried out within a grammar) Theorem 10.10 implies that fullness is also uncomputableΣ ⌈ for grammars|Σ|⌉ over the binary alphabet.

Proof Idea: We prove the theorem by reducing from the Halting problem. To do that we use the notion of configurations of NAND-TM programs, as defined in Definition 8.8. Recall that a configuration of a program is a binary string that encodes all the information about the program in the current iteration. 푃 We define 푠 to be plus some separator characters and define INVALID to be the function that maps every string to if andΣ only∗ if {0,does 1} not encode a sequence of configurations 푃 that∗ correspond∶ Σ → to a{0, valid 1} halting history of the computation of on퐿 ∈ Σthe empty1 input. 퐿 The heart of the proof is to show that INVALID is context-free.푃 Once we do that, we see that halts on the empty input if and only if 푃 INVALID for every . To show that, we will encode the list in a special way that makes it푃 amenable to deciding via a context-free 푃 grammar. Specifically(퐿) = 1 we will퐿 reverse all the odd-numbered strings.

⋆ Proof of Theorem 10.10. We only sketch the proof. We will show that if we can compute CFGFULL then we can solve HALTONZERO, which has been proven uncomputable in Theorem 9.9. Let be an input

푀 restricted computational models 369

Turing machine for HALTONZERO. We will use the notion of configu- rations of a Turing machine, as defined in Definition 8.8. Recall that a configuration of Turing machine and input cap- tures the full state of at some point of the computation. The partic- ular details of configurations are not so important,푀 but what푥 you need to remember is that: 푀

• A configuration can be encoded by a binary string . ∗ • The initial configuration of on the input is some fixed string. 휎 ∈ {0, 1} •A halting configuration will have the value a certain state (which can 푀 0 be easily “read off” from it) set to .

• If is a configuration at some step of the computation, we denote 1 by NEXT as the configuration at the next step. NEXT is a string휎 that agrees with on all but푖 a constant number of coor- 푀 푀 dinates (those(휎) encoding the position corresponding to the head(휎) position and the two adjacent휎 ones). On those coordinates, the value of NEXT can be computed by some finite function.

푀 We will let the alphabet(휎) .A computation his- tory of on the input is a string that corresponds to a list Σ(i.e., = {0,comes 1} ∪ {‖, before #} an even numbered block, and푀 comes before0 an odd퐿 numbered ∈ Σ one) such that if is 0 1 2 3 푡−2 푡−1 even‖휎 #휎 then‖휎 #휎is the⋯ 휎 string‖휎 encoding# ‖ the configuration of on input at the beginning# of its -th iteration, and if is odd then it is the푖 same 푖 except the휎 string is reversed. (That is, for odd , encodes푃 the 0 configuration of on input푖 at the beginning푖 of its -th iteration.) 푖 Reversing the odd-numbered blocks is a technical푖 푟푒푣(휎 trick) to ensure that the function INVALID푃 we0 define below is context푖 free. We now define INVALID as follows: 푀 ∗ 푀 ∶ Σ → {0, 1} is a valid computation history of on INVALID otherwise ⎧ 푀 {0 퐿 푀 (10.3)0 (퐿) = ⎨ We will show the⎩{ following1 claim: CLAIM: INVALID is context-free. The claim implies the theorem. Since halts on if and only if 푀 there exists a valid computation history, INVALID is the constant one function if and only if does not halt푀 on . In particular,0 this 푀 allows us to reduce determining whether halts on to determining whether the grammar 푀corresponding to INVALID0 is full. We now turn to the proof of the claim.푀 We will not0 show all the 푀 푀 details, but the main point퐺 INVALID if at least one of the following three conditions hold: 푀 (퐿) = 1 370 introduction to theoretical computer science

1. is not of the right format, i.e. not of the form binary-string binary-string binary-string .

2. 퐿 contains a substring of the form such⟨ that ⟩#⟨ ⟩‖⟨ ⟩# ⋯ NEXT ′ 퐿′ ‖휎#휎 ‖ 푃 3. 휎 contains≠ 푟푒푣( a substring(휎)) of the form such that NEXT ′ 퐿′ #휎‖휎 # 푃 휎Since≠ context-free(푟푒푣(휎)) functions are closed under the OR operation, the claim will follow if we show that we can verify conditions 1, 2 and 3 via a context-free grammar. For condition 1 this is very simple: checking that is of the correct format can be done using a regular expression. Since regular expres- sions are closed under negation, this means that checking퐿 that is not of this format can also be done by a regular expression and hence by a context-free grammar. 퐿 For conditions 2 and 3, this follows via very similar reasoning to that showing that the function such that iff is context-free, see Solved Exercise 10.2. After all, the NEXT function only modifies its input in a퐹 constant 퐹number (푢#푣) of = places. 1 푢 ≠ We 푟푒푣(푣) leave filling 푀 out the details as an exercise to the reader. Since INVALID if and only if satisfies one of the conditions 1., 2. or 3., and allthree 푀 conditions can be tested for via a context-free grammar, this(퐿) completes = 1 the proof of the퐿 claim and hence the theorem. ■

10.4 SUMMARY OF SEMANTIC PROPERTIES FOR REGULAR EX- PRESSIONS AND CONTEXT-FREE GRAMMARS

To summarize, we can often trade expressiveness of the model for amenability to analysis. If we consider computational models that are not Turing complete, then we are sometimes able to bypass Rice’s The- orem and answer certain semantic questions about programs in such models. Here is a summary of some of what is known about semantic questions for the different models we have seen.

Table 10.1: Computability of semantic properties

Model Halting Emptiness Equivalence Regular expressions Computable Computable Computable Context free grammars Computable Computable Uncomputable Turing-complete models UncomputableUncomputable Uncomputable restricted computational models 371

Chapter Recap

• The uncomputability of the Halting problem for ✓ general models motivates the definition of re- stricted computational models. • In some restricted models we can answer semantic questions such as: does a given program terminate, or do two programs compute the same function? • Regular expressions are a restricted model of com- putation that is often useful to capture tasks of string matching. We can test efficiently whether an expression matches a string, as well as answer questions such as Halting and Equivalence. • Context free grammars is a stronger, yet still not Tur- ing complete, model of computation. The halting problem for context free grammars is computable, but equivalence is not computable.

10.5 EXERCISES

Exercise 10.1 — Closure properties of context-free functions. Suppose that are context free. For each one of the following definitions∗ of the function , either prove that is always context 퐹free , 퐺 or ∶ {0, give 1} a counterexample→ {0, 1} for regular that would make not context free. 퐻 퐻 퐹 , 퐺 퐻 1. .

2. 퐻(푥) = 퐹 (푥) ∨ 퐺(푥)

3. 퐻(푥) = 퐹NAND (푥) ∧ 퐺(푥) .

4. 퐻(푥) = where(퐹 (푥), 퐺(푥))is the reverse of : for . 푅 푅 푅 푛−1 푛−2 표 퐻(푥) = 퐹 (푥 ) 푥 푥 푥 = 푥 푥 ⋯ 푥 s.t. 5. 푛 = |푥| otherwise {⎧1 푥 = 푢푣 퐹 (푢) = 퐺(푣) = 1 퐻(푥) = {⎨0 s.t. 6. ⎩ otherwise {⎧1 푥 = 푢푢 퐹 (푢) = 퐺(푢) = 1 퐻(푥) = {⎨0 s.t. 7. ⎩ otherwise푅 {⎧1 푥 = 푢푢 퐹 (푢) = 퐺(푢) = 1 퐻(푥) = ⎨ ■ ⎩{0 Exercise 10.2 Prove that the function such that if and only if is a power of two is∗ not context free. 퐹 ∶ {0, 1} → {0, 1} ■ 퐹 (푥) = 1 |푥| 372 introduction to theoretical computer science

Exercise 10.3 — Syntax for programming languages. Consider the following syntax of a “programming language” whose source can be written using the ASCII character set:

• Variables are obtained by a sequence of letters, numbers and under- scores, but can’t start with a number.

•A statement has either the form foo = bar; where foo and bar are variables, or the form IF (foo) BEGIN ... END where ... is list of one or more statements, potentially separated by newlines.

A program in our language is simply a sequence of statements (pos- sibly separated by newlines or spaces).

1. Let VAR be the function that given a string , outputs∗ if and only if corresponds to an ASCII encoding of∶∗ {0, a valid 1} → variable {0, 1} identifier. Prove that VAR is regular. 푥 ∈ {0, 1} 1 푥 2. Let SYN be the function that given a string , outputs∗ if and only if is an ASCII encoding of a valid program in∗∶ {0, our 1} language.→ {0, 1} Prove that SYN is context free. (You do 푠not ∈ have{0, 1} to specify the1 full formal grammar푠 for SYN, but you need to show that such a grammar exists.) 2 Try to see if you can “embed” in some way a func- 3. Prove that SYN is not regular. See footnote for hint2 tion that looks similar to MATCHPAREN in SYN, so you can use a similar proof. Of course for a function

■ to be non-regular, it does not need to utilize literal parentheses symbols.

10.6 BIBLIOGRAPHICAL NOTES

As in the case of regular expressions, there are many resources avail- able that cover context-free grammar in great detail. Chapter 2 of [Sip97] contains many examples of context-free grammars and their properties. There are also websites such as Grammophone where you can input grammars, and see what strings they generate, as well as some of the properties that they satisfy. The adjective “context free” is used for CFG’s because a rule of the form means that we can always replace with the string , no matter what is the context in which appears. More generally, we might푣 want ↦ 푎 to consider cases where the replacement푣 rules depend 푎on the context. This gives rise to the notion푣 of general (aka “Type 0”) grammars that allow rules of the form where both and are strings over . The idea is that if, for example, we wanted to enforce the condition∗ that we only apply푎 ⇒ some 푏 rule such푎 as 푏 when is surrounded(푉 ∪ Σ) by three zeroes on both sides, then we could do so by adding a rule of the form (and푣 of ↦ course 0푤1 we can푣 add much more general conditions). Alas, this generality 000푣000 ↦ 0000푤1000 restricted computational models 373

comes at a cost - general grammars are Turing complete and hence their halting problem is uncomputable. That is, there is no algorithm that can determine for every general grammar and a string , whether or not the grammar generates . 퐴 The is a hierarchy of grammars퐺 from the푥 least restrictive (most powerful) Type퐺 0 grammars,푥 which correspond to recursively enumerable languages (see Exercise 9.10) to the most re- strictive Type 3 grammars, which correspond to regular languages. Context-free languages correspond to Type 2 grammars. Type 1 gram- mars are context sensitive grammars. These are more powerful than context-free grammars but still less powerful than Turing machines. In particular functions/languages corresponding to context-sensitive grammars are always computable, and in fact can be computed by a linear bounded automatons which are non-deterministic algorithms that take space. For this reason, the class of functions/languages corresponding to context-sensitive grammars is also known as the complexity푂(푛) class NSPACE ; we discuss space-bounded com- plexity in Chapter 17). While Rice’s Theorem implies that we cannot compute any non-trivial semantic푂(푛) property of Type 0 grammars, the situation is more complex for other types of grammars: some seman- tic properties can be determined and some cannot, depending on the grammar’s place in the hierarchy.

Learning Objectives: • More examples of uncomputable functions that are not as tied to computation. • Gödel’s incompleteness theorem - a result that shook the world of mathematics in the early 20th century.

11 Is every theorem provable?

“Take any definite unsolved problem, such as … the existence of an infinite number of prime numbers of the form . However unapproachable these problems may seem to us and however helpless푛 we stand before them, we have, nevertheless, the firm conviction that2 their+ 1 solution must follow by afinite number of purely logical processes…” “…This conviction of the solvability of every mathematical problem is a pow- erful incentive to the worker. We hear within us the perpetual call: There is the problem. Seek its solution. You can find it by pure reason, for in mathematics there is no ignorabimus.”, David Hilbert, 1900.

“The meaning of a statement is its method of verification.”, Moritz Schlick, 1938 (aka “The verification principle” of logical positivism)

The problems shown uncomputable in Chapter 9, while natural and important, still intimately involved NAND-TM programs or other computing mechanisms in their definitions. One could perhaps hope that as long as we steer clear of functions whose inputs are themselves programs, we can avoid the “curse of uncomputability”. Alas, we have no such luck. In this chapter we will see an example of a natural and seemingly “computation free” problem that nevertheless turns out to be uncom- putable: solving Diophantine equations. As a corollary, we will see one of the most striking results of 20th century mathematics: Gödel’s Incompleteness Theorem, which showed that there are some mathemat- ical statements (in fact, in ) that are inherently unprov- able. We will actually start with the latter result, and then show the former.

11.1 HILBERT’S PROGRAM AND GÖDEL’S INCOMPLETENESS THEOREM “And what are these …vanishing increments? They are neither finite quanti- ties, nor quantities infinitely small, nor yet nothing. May we not call them the ghosts of departed quantities?”, George Berkeley, Bishop of Cloyne, 1734.

Compiled on 8.26.2020 18:10 376 introduction to theoretical computer science

Figure 11.1: Outline of the results of this chapter. One version of Gödel’s Incompleteness Theorem is an immediate consequence of the uncomputability of the Halting problem. To obtain the theorem as originally stated (for statements about the integers) we first prove that the QMS problem of determining truth of quantified statements involving both integers and strings is uncomputable. We do so using the notion of Turing Machine configurations but there are alternative approaches to do so as well, see Remark 11.14.

The 1700’s and 1800’s were a time of great discoveries in mathe- matics but also of several crises. The discovery of calculus by Newton and Leibnitz in the late 1600’s ushered a golden age of problem solv- ing. Many longstanding challenges succumbed to the new tools that were discovered, and mathematicians got ever better at doing some truly impressive calculations. However, the rigorous foundations behind these calculations left much to be desired. Mathematicians manipulated infinitesimal quantities and infinite series cavalierly, and while most of the time they ended up with the correct results, there were a few strange examples (such as trying to calculate the value of the infinite series ) which seemed to give out different answers depending on the method of calculation. This led to a growing sense1−1+1−1+1+… of unease in the foundations of the subject which was addressed in the works of mathematicians such as Cauchy, Weierstrass, and Riemann, who eventually placed analysis on firmer foundations, giving rise to the ’s and ’s that students taking honors calculus grapple with to this day. In the beginning of the 20th휖 century,훿 there was an effort to replicate this effort, in greater rigor, to all parts of mathematics. The hopewas to show that all the true results of mathematics can be obtained by starting with a number of axioms, and deriving theorems from them using logical rules of inference. This effort was known as the Hilbert program, named after the influential mathematician David Hilbert. Alas, it turns out the results we’ve seen dealt a devastating blow to this program, as was shown by Kurt Gödel in 1931:

Theorem 11.1 — Gödel’s Incompleteness Theorem: informal version. For every sound proof system for sufficiently rich mathematical statements, there is a mathematical statement that is true but is not provable in . 푉

푉 is every theorem provable? 377

11.1.1 Defining “Proof Systems” Before proving Theorem 11.1, we need to define “proof systems” and even formally define the notion of a “mathematical statement”. In and other , proof systems are often defined by starting with some basic assumptions or axioms and then deriving more statements by using inference rules such as the famous Modus Ponens, but what axioms shall we use? What rules? We will use an extremely general notion of proof systems, not even restricting ourselves to ones that have the form of axioms and inference.

Mathematical statements. At the highest level, a mathematical statement is simply a piece of text, which we can think of as a string . Mathematical statements contain assertions whose truth does not ∗ depend on any empirical fact, but rather only on properties푥 ∈of {0, abstract 1} 1 This happens to be a false statement. objects. For example, the following is a mathematical statement:1

“The number , , , , , , , , , , , , , , , , , , , , , is prime”. 2 696 635 869 504 783 333 238 805 675 613 588 278 597 832 162 617 892 474 670 798 113 Mathematical statements do not have to involve numbers. They can assert properties of any other mathematical object including sets, strings, functions, graphs and yes, even programs. Thus, another exam- 2 ple of a mathematical statement is the following:2 It is unknown whether this statement is true or false.

The following Python function halts on every positive integer n

def f(n): if n==1: return 1 return f(3*n+1) if n % 2 else f(n//2)

Proof systems. A proof for a statement is another piece of text that certifies the truth of the statement∗ asserted in . The conditions∗ for a valid proof system푥 are: ∈ {0, 1} 푤 ∈ {0, 1} 푥 1. (Effectiveness) Given a statement and a proof , there is an algo- rithm to verify whether or not is a valid proof for . (For exam- ple, by going line by line and checking푥 that each푤 line follows from the preceding ones using one of푤 the allowed inference푥 rules.)

2. (Soundness) If there is a valid proof for then is true.

These are quite minimal requirements푤 for푥 a proof푥 system. Require- ment 2 (soundness) is the very definition of a proof system: you shouldn’t be able to prove things that are not true. Requirement 1 is also essential. If there is no set of rules (i.e., an algorithm) to check that a proof is valid then in what sense is it a proof system? We could 378 introduction to theoretical computer science

replace it with a system where the “proof” for a statement is “trust me: it’s true”. We formally define proof systems as an algorithm where푥 holds if the string is a valid proof for the statement . Even if is true, the string does not have to be a valid푉 proof for it 푉(there (푥, 푤) are = plenty 1 of wrong proofs푤 for true statements such as 4=2+2푥) but if 푥is a valid proof for 푤 then must be true.

Definition푤 11.2 — Proof systems.푥 Let푥 be some set (which we consider the “true” statements). A proof system∗ for is an algo- rithm that satisfies: 풯 ⊆ {0, 1} 풯 1. (Effectiveness)푉 For every , halts with an out- put of either or . ∗ 푥, 푤 ∈ {0, 1} 푉 (푥, 푤) 2. (Soundness) For0 every1 and , . ∗ A true statement 푥 ∉ 풯is unprovable푤 ∈ {0,(with 1} 푉 respect (푥, 푤) = to 0 ) if for every , . We say that is complete if there does not exist a true∗ 푥 statement ∈ 풯 that is unprovable with respect푉 to . 푤 ∈ {0, 1} 푉 (푥, 푤) = 0 푉 푥 푉  Big Idea 15 A proof is just a string of text whose meaning is given by a verification algorithm.

11.2 GÖDEL’S INCOMPLETENESS THEOREM: COMPUTATIONAL VARIANT

Our first formalization of Theorem 11.1 involves statements about Turing machines. We let be the set of strings that have the form “Turing machine halts on the zero input”. ∗ ℋ 푥 ∈ {0, 1} Theorem 11.3 — Gödel’s Incompleteness푀 Theorem: computational variant. There does not exist a complete proof system for .

Proof Idea: ℋ If we had such a complete and sound proof system then we could solve the HALTONZERO problem. On input a Turing machine , we would search all purported proofs and halt as soon as we find a proof of either “ halts on zero” or “ does not halt on zero”.푀 If the system is sound and complete then푤 we will eventually find such a proof, and it will provide푀 us with the correct푀 output.

⋆ is every theorem provable? 379

Proof of Theorem 11.3. Assume for the sake of contradiction that there was such a proof system . We will use to build an algorithm that computes HALTONZERO, hence contradicting Theorem 9.9. Our algorithm will will work푉 as follows: 푉 퐴

Algorithm퐴 11.4 — Halting from proofs. Input: Turing Machine Output: if halts on the Input: ; otherwise. 푀 1: for 1 푀 do 2: for0 0 do 3: 푛 =if 1, 2,” 3, …halts푛 on ” then 4: 푤 ∈return {0, 1} 5: end푉 ( if푀 0 , 푤) = 1 6: if ” does1 not halt on ” then 7: return 8: end푉 ( if푀 0 , 푤) = 1 9: end for 0 10: end for

If halts on then under our assumption there exists that proves this fact, and so when Algorithm reaches we will eventually푀 find0 this and output , unless we already halted푤 be- fore. But we cannot halt before and output퐴 a wrong푛 answer = |푤| because it would contradict the푤 soundness1 of the proof system. Similarly, this shows that if does not halt on then (since we assume there is a proof of this fact too) our algorithm will eventually halt and output . 푀 0 퐴 ■ 0

R Remark 11.5 — The Gödel statement (optional). One can extract from the proof of Theorem 11.3 a procedure that for every proof system , yields a true statement that cannot be proven in . But Gödel’s proof gave∗ a very explicit description푉 of such a statement which푥 is closely related to the푉 “Liar’s paradox”. That∗ is, Gödel’s statement was designed to be true if and푥 only if ∗ . In other words, it satisfied

the following property∗ 푥 푤∈{0,1} ∀ 푉 (푥, 푤) = 0

is true does not have a proof in (11.1) ∗ ∗ One푥 can see that⇔ 푥 if is true, then it does not푉 have a proof, but it is false then∗ (assuming the proof system is sound) then it cannot푥 have a proof, and hence ∗ 푥 380 introduction to theoretical computer science

must be both true and unprovable. One might wonder how is it possible to come up with an that satisfies a condition such as (11.1) where the same∗ string appears on both the righthand side and푥 the lefthand∗ side of the equation. The idea is that the proof of푥The- orem 11.3 yields a way to transform every statement into a statement that is true if and only if does not have a proof in . Thus needs to be a fixed point푥 of : a sentence such퐹 (푥) that ∗ . It turns푥 out that we can always find 푉such a fixed∗푥 point∗ of . We’ve al- ready퐹 seen this phenomenon푥 = in the퐹 (푥 )calculus, where the combinator maps every into a fixed퐹 point of . This is very related to the idea휆 of programs that can푌 print their own code. Indeed,퐹 Scott Aaronson likes푌 퐹 to describe퐹 Gödel’s statement as follows:

The following sentence repeated twice, the sec- ond time in quotes, is not provable in the formal system . “The following sentence repeated twice, the second time in quotes, is not provable in the formal푉 system .”

In the argument above we actually showed that is 푉 true, under the assumption that is sound. Since∗ is true and does not have a proof in , this means푥 that∗ we cannot carry the above argument푉 in the system푥 , which means that cannot prove its푉 own soundness (or even consistency: that there is no proof of both푉 a statement and its negation).푉 Using this idea, it’s not hard to get Gödel’s second incompleteness theorem, which says that every sufficiently rich cannot prove its own consistency. That is, if we formalize the state- ment that is true if and only if is consistent푉 (i.e., cannot∗ prove both a statement and the statement’s negation),푐 then cannot be proven푉 in . 푉 ∗ 푐 푉

11.3 QUANTIFIED INTEGER STATEMENTS

There is something “unsatisfying” about Theorem 11.3. Sure, it shows there are statements that are unprovable, but they don’t feel like “real” statements about math. After all, they talk about programs rather than numbers, matrices, or derivatives, or whatever it is they teach in math courses. It turns out that we can get an analogous result for statements such as “there are no positive integers and such that ”, or “there are positive integers such that 2 ” that7 only talk about natural numbers. It푥 doesn’t푦 get much2 푥 more6 − 2 “real11 = 푦math” than this. Indeed, the 19th century푥, 푦, 푧 mathematician푥 + 푦 Leopold= 푧 Kronecker famously said that “God made the integers, all else is the work of man.” (By the way, the status of the above two statements is unknown.) is every theorem provable? 381

To make this more precise, let us define the notion of quantified integer statements:

Definition 11.6 — Quantified integer statements. A quantified integer state- ment is a well-formed statement with no unbound variables involv- ing integers, variables, the operators , the logical operations (NOT), (AND), and (OR), as well as quantifiers of the form and where >,are <, variable ×, +, −, names. = ¬ ∧ ∨ 푥∈ℕ 푦∈ℕ We often care∃ deeply∀ about determining푥, 푦 the truth of quantified integer statements. For example, the statement that Fermat’s Last Theorem is true for can be phrased as the quantified integer statement 푛 = 3

(11.2) 푎∈ℕ 푏∈ℕ 푐∈ℕ ¬∃The∃twin∃ prime(푎 > conjecture 0)∧(푏 > 0)∧(푐, that > states 0)∧(푎 that × 푎 there × 푎 + is푏 an × infinite 푏 × 푏 = 푐 ×num- 푐 × 푐) . ber of numbers such that both and are primes can be phrased as the quantified integer statement 푝 푝 푝 + 2 PRIME PRIME (11.3)

푛∈ℕ 푝∈ℕ where we replace∀ ∃ an instance(푝 > 푛) ∧of PRIME(푝) ∧with the(푝 statement + 2) . The claim (mentioned in Hilbert’s quote(푞) above) that are infinitely(푞 > 푎∈ℕ 푏∈ℕ many1) ∧ ∀ primes∀ of(푎 the = 1) form ∨ (푎 = 푞) ∨ ¬(푎can × 푏 be = phrased푞) as follows: 푛 푝 = 2 + 1 PRIME (11.4) 푛∈ℕ PRIME푝∈ℕ DIVIDES ∀ ∃ (푝 > 푛) ∧ (푝)∧ 푘∈ℕ where (∀DIVIDES(푘 ≠ 2 ∧ is the statement(푘)) ⇒ ¬ (푘, 푝. − In 1)) English, this corresponds to the claim that for every there is some such that 푐∈ℕ all of ’s prime(푎, 푏)factors are equal to∃ . 푏 × 푐 = 푎 푛 푝 > 푛 푝 − 1 2 R Remark 11.7 — Syntactic sugar for quantified integer statements. To make our statements more readable, we often use syntactic sugar and so write as shorthand for , and so on. Similarly, the “implication operator” is “syntactic푥 sugar” ≠ 푦 or shorthand for ¬(푥, = and the 푦) “if and only if operator” is shorthand for 푎 ⇒ 푏 ). We will also allow ourselves¬푎 ∨ 푏the use of “macros”: plugging in 푎one ⇔ quantified integer(푎 statement ⇒ 푏) ∧ (푏 in ⇒ another, 푎 as we did with DIVIDES and PRIME above. 382 introduction to theoretical computer science

Much of number theory is concerned with determining the truth of quantified integer statements. Since our experience has been that, given enough time (which could sometimes be several centuries) hu- manity has managed to do so for the statements that it cared enough about, one could (as Hilbert did) hope that eventually we would be able to prove or disprove all such statements. Alas, this turns out to be impossible:

Theorem 11.8 — Gödel’s Incompleteness Theorem for quantified integer state- ments. Let a computable purported verification procedure for quantified∗ integer statements. Then either: 푉 ∶ {0, 1} → {0, 1} • is not sound: There exists a false statement and a string such that . 푉 ∗ 푥 푤or ∈ {0, 1} 푉 (푥, 푤) = 1

• is not complete: There exists a true statement such that for every , . 푉 ∗ 푥 Theorem푤 11.8 ∈ {0,is 1} a direct푉 (푥, corollary 푤) = 0 of the following result, just as Theorem 11.3 was a direct corollary of the uncomputability of HALTONZERO:

Theorem 11.9 — Uncomputability of quantified integer statements. Let QIS be the function that given a (string rep- resentation of)∗ a quantified integer statement outputs if it is true and ∶if it {0, is 1}false.→ Then {0,QIS 1} is uncomputable. 1 Since0 a quantified integer statement is simply a sequence ofsym- bols, we can easily represent it as a string. For simplicity we will as- sume that every string represents some quantified integer statement, by mapping strings that do not correspond to such a statement to an arbitrary statement such as .

푥∈ℕ ∃ 푥 = 1 P Please stop here and make sure you understand why the uncomputability of QIS (i.e., Theorem 11.9) means that there is no sound and complete proof system for proving quantified integer statements (i.e., Theorem 11.8). This follows in the same way that Theorem 11.3 followed from the uncomputability of HALTONZERO, but working out the details is a great exercise (see Exercise 11.1)

In the rest of this chapter, we will show the proof of Theorem 11.8, following the outline illustrated in Fig. 11.1. is every theorem provable? 383

11.4 DIOPHANTINE EQUATIONS AND THE MRDP THEOREM

Many of the functions people wanted to compute over the years in- volved solving equations. These have a much longer history than mechanical computers. The Babylonians already knew how to solve some quadratic equations in 2000BC, and the formula for all quadrat- ics appears in the Bakhshali Manuscript that was composed in India around the 3rd century. During the Renaissance, Italian mathemati- cians discovered generalization of these formulas for cubic and quar- tic (degrees and ) equations. Many of the greatest minds of the 17th and 18th century, including Euler, Lagrange, Leibniz and Gauss worked on the3 problem4 of finding such a formula for quintic equations to no avail, until in the 19th century Ruffini, Abel and Galois showed that no such formula exists, along the way giving birth to group theory. However, the fact that there is no closed-form formula does not mean we can not solve such equations. People have been solving higher degree equations numerically for ages. The Chinese manuscript Jiuzhang Suanshu from the first century mentions such approaches. Solving polynomial equations is by no means restricted only to ancient history or to students’ homework. The gradient descent method is the workhorse powering many of the machine learning tools that have revolutionized Computer Science over the last several years. But there are some equations that we simply do not know how to solve by any means. For example, it took more than 200 years until peo- ple succeeded in proving that the equation has no 3 solution in integers. The notorious difficulty11 of11 so called 11Diophantine equations (i.e., finding integer roots of a polynomial)푎 + 푏 motivated= 푐 the mathematician David Hilbert in 1900 to include the question of find- ing a general procedure for solving such equations in his famous list of twenty-three open problems for mathematics of the 20th century. I Figure 11.2: Diophantine equations such as finding don’t think Hilbert doubted that such a procedure exists. After all, the a positive integer solution to the equation whole up to this point involved the discovery (depicted more compactly of ever more powerful methods, and even impossibility results such 푎(푎 + 푏)(푎and whimsically + 푐) + 푏(푏 + above) 푎)(푏 + can 푐) be + 푐(푐 surprisingly + 푎)(푐 + difficult. 푏) = as the inability to trisect an angle with a straightedge and compass, or 4(푎There + are 푏)(푎 many + 푐)(푏 equations + 푐) for which we do not know the non-existence of an algebraic formula for quintic equations, merely if they have a solution, and there is no algorithm to solve them in general. The smallest solution for this pointed out to the need to use more general methods. equation has digits! See this Quora post for more Alas, this turned out not to be the case for Diophantine equations. information, including the credits for this image. 3 In 1970, Yuri Matiyasevich, building on a decades long line of work by This is a special80 case of what’s known as “Fermat’s Last Theorem” which states that has no Martin Davis, Hilary Putnam and Julia Robinson, showed that there is solution in integers for . This푛 was푛 conjectured푛 in simply no method to solve such equations in general: 1637 by Pierre de Fermat but only푎 proven+ 푏 = by 푐 Andrew Wiles in 1991. The case푛 > 2 (along with all other so called “regular prime exponents”) was established Theorem 11.10 — MRDP Theorem. Let DIO be the by Kummer in 1850. 푛 = 11 function that takes as input a string describing a∗ -variable poly- ∶ {0, 1} → {0, 1} 100 384 introduction to theoretical computer science

nomial with integer coefficients and outputs if and only if there exists s.t. . 0 99 Then DIO is uncomputable. 푃 (푥 , … , 푥 ) 1 0 99 0 99 푧 , … , 푧 ∈ ℕ 푃 (푧 , … , 푧 ) = 0 As usual, we assume some standard way to express numbers and text as binary strings. The constant is of course arbitrary; the prob- lem is known to be uncomputable even for polynomials of degree four and at most 58 variables. In fact100 the number of variables can be reduced to nine, at the expense of the polynomial having a larger (but still constant) degree. See Jones’s paper for more about this issue.

R Remark 11.11 — Active code vs static data. The diffi- culty in finding a way to distinguish between “code” such as NAND-TM programs, and “static content” such as polynomials is just another manifestation of the phenomenon that code is the same as data. While a fool-proof solution for distinguishing between the two is inherently impossible, finding heuristics that do a reasonable job keeps many and anti-virus manufacturers very busy (and finding ways to bypass these tools keeps many hackers busy as well).

11.5 HARDNESS OF QUANTIFIED INTEGER STATEMENTS

We will not prove the MRDP Theorem (Theorem 11.10). However, as we mentioned, we will prove the uncomputability of QIS (i.e., Theo- rem 11.9), which is a special case of the MRDP Theorem. The reason is that a Diophantine equation is a special case of a quantified integer statement where the only quantifier is . This means that deciding the truth of quantified integer statements is a potentially harder problem than solving Diophantine equations, and∃ so it is potentially easier to prove that QIS is uncomputable.

P If you find the last sentence confusing, it is worth- while to reread it until you are sure you follow its logic. We are so accustomed to trying to find solu- tions for problems that it can sometimes be hard to follow the arguments for showing that problems are uncomputable.

Our proof of the uncomputability of QIS (i.e. Theorem 11.9) will, as usual, go by reduction from the Halting problem, but we will do so in two steps: is every theorem provable? 385

1. We will first use a reduction from the Halting problem to show that deciding the truth of quantified mixed statements is uncomputable. Quantified mixed statements involve both strings and integers. Since quantified mixed statements are a more general concept than quantified integer statements, it is easier to prove the uncomputabil- ity of deciding their truth.

2. We will then reduce the problem of quantified mixed statements to quantifier integer statements.

11.5.1 Step 1: Quantified mixed statements and computation histories We define quantified mixed statements as statements involving not just integers and the usual arithmetic operators, but also string variables as well.

Definition 11.12 — Quantified mixed statements. A quantified mixed state- ment is a well-formed statement with no unbound variables involv- ing integers, variables, the operators , the logical operations (NOT), (AND), and (OR), as well as quanti- fiers of the form , , ,>, <, ×, +,where −, = are variable names.¬ These∧ also include∗ the∨ operator∗ which returns 푥∈ℕ 푎∈{0,1} 푦∈ℕ 푏∈{0,1} the length of a string∃ valued∃ variable∀ ∀, as well as the푥, operator 푦, 푎, 푏 where is a string-valued variable and is an integer|푎| valued ex- 푖 pression which is true if is smaller than푎 the length of and the푎 coordinate푎 of is , and is false otherwise.푖 푡ℎ 푖 푎 푖 For example,푎 the1 true statement that for every string there is a string that corresponds to in reverse order can be phrased as the following quantified mixed statement 푎 푏 푎 (11.5)

∗ ∗ 푎∈{0,1} 푏∈{0,1} 푖∈ℕ 푖 |푎|−푖 Quantified∀ ∃ mixed(|푎| =statements |푏|) ∧ (∀ are푖 < |푎|more ⇒ (푎general⇔ 푏 than)) . quantified integer statements, and so the following theorem is potentially easier to prove than Theorem 11.9:

Theorem 11.13 — Uncomputability of quantified mixed statements. Let QMS be the function that given a (string rep- resentation of)∗ a quantified mixed statement outputs if it is true and if∶ it {0, is false. 1} → Then {0,QMS 1} is uncomputable. 1 Proof Idea:0 The idea behind the proof is similar to that used in showing that one-dimensional cellular automata are Turing complete (Theorem 8.7) as well as showing that equivalence (or even “fullness”) of context free grammars is uncomputable (Theorem 10.10). We use the notion 386 introduction to theoretical computer science

of a configuration of a NAND-TM program as in Definition 8.8. Such a configuration can be thought of as a string over some large-but- finite alphabet describing its current state, including the values of all arrays, scalars, and the index variable i훼. It can be shown that if is the configurationΣ at a certain step of the execution and is the configuration at the next step, then for all outside of 훼 where is the value of i. In particular, every value훽 is 푗 푗 simply a function of . Using these훽 = observations 훼 푗 we can write 푗 {푖a quantified − 1, 푖, 푖 + 1} mixed statement푖 NEXT that will be true if and only훽 if 푗−1,푗,푗+1 is the configuration훼 encoding the next step after . Since a program halts on input if and only if there(훼, 훽) is a sequence of configurations 훽 (known as a computation history) starting훼 with the initial configuration푃0 푡−1 with푥 input and ending in a halting configuration, we can훼 , … define , 훼 a quantified mixed statement to determine if there issuch a statement by taking a universal푥 quantifier over all strings (for history) that encode a tuple and then checking that and are valid starting and0 1 halting푡−1 configurations, 퐻and that NEXT0 푡−1 is true for every(훼 , 훼 , … , 훼 ) . 훼 훼푗 푗+1 (훼 , 훼 ) 푗 ∈ {0, … , 푡 − 2} Proof⋆ of Theorem 11.13. The proof is obtained by a reduction from the Halting problem. Specifically, we will use the notion ofa configura- tion of a Turing Machines (Definition 8.8) that we have seen in the context of proving that one dimensional cellular automata are Turing complete. We need the following facts about configurations:

• For every Turing Machine , there is a finite alphabet , and a configuration of is a string . 푀 ∗ Σ • A configuration푀 encodes all훼 the ∈ Σ state of the program at a particu- lar iteration, including the array, scalar, and index variables. 훼 • If is a configuration, then NEXT denotes the configura- tion of the computation after one more iteration. is a string over 푃 of훼 length either or 훽, and = every coordinate(훼) of is a function of just three coordinates in . That is, for every 훽 ,Σ MAP |훼| |훼| + 1where MAP is some훽 function depending on . 훼 3 푗 ∈ {0, … , |훽| − 1} 푗 푃 푗−1 푗 푗+1 푃 훽 = (훼 , 훼 , 훼 ) ∶ Σ → Σ • There are simple푃 conditions to check whether a string is a valid starting configuration corresponding to an input , as well as to check whether a string is a halting configuration. 훼In particular these conditions can be phrased as quantified mixed푥 statements. 훼 • A program halts on input if and only if there exists a sequence of configurations such that (i) is a valid 푀 0 1푥 푇 −1 0 퐻 = (훼 , 훼 , … , 훼 ) 훼 is every theorem provable? 387

starting configuration of with input , (ii) is a valid halting configuration of , and (iii) NEXT 푇 −1for every . 푀 푖+1 푥 훼 푖 푃 푃 훼 = (훼 ) 푖 ∈ {0,We … can , 푇 encode − 2} such a sequence of configuration as a binary string. For concreteness, we let log and encode each symbol in by a string in퐻 . We use “ ” as a “separator” symbol, and so encode ℓ = ⌈ (|Σ|ℓ +as 1)⌉ the concatenation of the encodings휎 Σ ∪ {"; of "}each configuration,0 1{0, 1} using푇 −1 “ ” to separate; the en- coding of and for퐻 every = (훼 , 훼 , …. , In 훼 particular) for every Turing Machine 푖, halts푖+1 on the input if and only; if the following state- ment is훼 true 훼 푖 ∈ [푇 ] 푀 푀 0 푀 휑 encodes halting configuration sequence starting with input ∗ (11.6) 퐻∈{0,1} ∃ If we can퐻 encode the statement as a mixed-integer statement 0 . then, since is true if and only if HALTONZERO , this 푀 would reduce the task of computing휑 HALTONZERO to computing 푀 MIS, and hence휑 imply (using Theorem 9.9 ) that MIS(푀)is uncomputable, = 1 completing the proof. Indeed, can be encoded as a mixed-integer statement for the following reasons: 푀 휑 1. Let be two strings that encode configurations of . We can define a∗ quantified mixed predicate NEXT that is true if and훼, only훽 ∈{0, if 1} NEXT (i.e., encodes the configuration푀 obtained by proceeding from in one computational(훼,훽) step). Indeed 푀 NEXT is훽 true = if for every(훽) 훽 which is a multiple of , MAP 훼 where MAP (훼,is the 훽) finite function 푖above ∈ {0, (identifying … , |훽|} elements of 3ℓwith 푖,…,푖+ℓ−1 푀 푖−ℓ,⋯,푖+2ℓ−1 푀 theirℓ 훽 encodingℓ = in ).(훼 Since MAP) is a finite function,∶ {0, 1} we→ can express{0, 1} it using the logicalℓ operations AND,OR, NOT (for exampleΣ 푀 by computing MAP{0, 1}with NAND’s).

2. Using the above we푀 can now write the condition that for every substring of that has the form ENC with and ENC being the encoding of the separator “ ”, it holds thatℓ NEXT is퐻 true. 훼 (; )훽 훼, 훽 ∈ {0, 1} (; ) ; 3. Finally, if is a binary string encoding the initial configuration of (훼, 훽) on input0 , checking that the first bits of equal can be expressed훼 using AND,OR, and NOT’s.0 Similarly checking that the 0 푀last configuration0 encoded by corresponds|훼 | to퐻 a state in훼 which will halt can also be expressed as a quantified statement. 퐻 푀 Together the above yields a computable procedure that maps every Turing Machine into a quantified mixed statement such that

푀 푀 휑 388 introduction to theoretical computer science

HALTONZERO if and only if QMS . This reduces computing HALTONZERO to computing QMS, and hence the uncom- 푀 putability of HALTONZERO(푀) = 1 implies the uncomputability(휑 ) = 1 of QMS. ■

R Remark 11.14 — Alternative proofs. There are sev- eral other ways to show that QMS is uncomputable. For example, we can express the condition that a 1- dimensional cellular automaton eventually writes a “ ” to a given cell from a given initial configuration as a quantified mixed statement over a string encod- ing1 the history of all configurations. We can then use the fact that cellular automatons can simulate Tur- ing machines (Theorem 8.7) to reduce the halting problem to QMS. We can also use other well known uncomputable problems such as tiling or the post cor- respondence problem. Exercise 11.5 and Exercise 11.6 explore two alternative proofs of Theorem 11.13.

11.5.2 Step 2: Reducing mixed statements to integer statements We now show how to prove Theorem 11.9 using Theorem 11.13. The idea is again a proof by reduction. We will show a transformation of every quantifier mixed statement into a quantified integer statement that does not use string-valued variables such that is true if and only if is true. 휑 휉 To remove string-valued variables from a statement,휑 we encode every string휉 by a pair integer. We will show that we can encode a string by a pair of numbers s.t. ∗ • 푥 ∈ {0, 1} (푋, 푛) ∈ ℕ • There is a quantified integer statement COORD that for every 푛 = |푥| , will be true if and will be false otherwise. (푋, 푖) 푖 푖This < 푛 will mean that we푥 = can 1 replace a “for all” quantifier over strings such as with a pair of quantifiers over integers of the form (and∗ similarly replace an existential quantifier of the form 푥∈{0,1} ∀with a pair of quantifiers ) . We can then replace all 푋∈ℕ 푛∈ℕ ∀calls to∀ ∗ by and all calls to by COORD . This means that 푥∈{0,1} 푋∈ℕ 푛∈ℕ ∃if we are able to define COORD via∃ a quantified∃ integer statement, 푖 then we|푥| obtain푛 a proof of Theorem푥 11.9, since(푋, we 푖) can use it to map every mixed quantified statement to an equivalent quantified inte- ger statement such that is true if and only if is true, and hence QMS QIS . Such a procedure휑 implies that the task of comput- ing QMS reduces휉 to the task휉 of computing QIS,휑 which means that the uncomputability(휑) = (휉) of QMS implies the uncomputability of QIS. is every theorem provable? 389

The above shows that proof of Theorem 11.9 all boils down to find- ing the right encoding of strings as integers, and the right way to implement COORD as a quantified integer statement. To achieve this we use the following technical result : Lemma 11.15 — Constructible prime sequence. There is a sequence of prime numbers such that there is a quantified integer statement PSEQ that is true if and only if . 0 1 2 푝 < 푝 < 푝 < ⋯ Using Lemma 11.15 we can encode a 푖by the numbers (푝, 푖) 푝 = 푝 where and . We can then∗ define the statement COORD as 푥 ∈ {0, 1} 푖 푖 (푋, 푛) 푋 = ∏푥 =1 푝 푛 = |푥| COORD (푋, 푖) PSEQ DIVIDES (11.7)

푝∈ℕ where DIVIDES(푋, 푖), = as ∃ before, is defined(푝, 푖) ∧ as (푝, 푋) . Note that indeed if encodes the string , then for every , 푐∈ℕ COORD (푎,, 푏) since divides if and∗ only∃ 푎 if × 푐 = 푏. Thus all푋, that 푛 is left to conclude푥 the ∈ proof{0, 1} of Theorem 11.9푖is < to 푛 푖 푖 푖 prove Lemma(푋, 푖) 11.15= 푥 , which푝 we now proceed푋 to do. 푥 = 1

Proof. The sequence of prime numbers we consider is the following: We fix to be a sufficiently large constant ( will do) and 34 define to be the smallest prime number that is in2 the interval 퐶 . It is known that퐶 there = 2exists such a prime 푖 number3 푝 for every 3 . Given this, the definition of PSEQ [(푖is + 퐶)simple:+ 1, (푖 + 퐶 + 1) − 1] 푖 ∈ ℕ (푝, 푖) PRIME (11.8)′ ′ ′ ′ 푝 We(푝 > leave (푖+퐶)×(푖+퐶)×(푖+퐶))∧(푝 it to the reader to verify < that (푖+퐶+1)×(푖+퐶+1)×(푖+퐶+1))∧(∀PSEQ is true iff . ¬ (푝 ) ∨ (푝 ≤ 푖) ∨ (푝 ≥ 푝)) , ■ 푖 (푝, 푖) 푝 = 푝 To sum up we have shown that for every quantified mixed state- ment , we can compute a quantified integer statement such that QMS if and only if QIS . Hence the uncomputability of QMS휑 (Theorem 11.13) implies the uncomputability of휉 QIS, com- pleting(휑) the = proof 1 of Theorem 11.9(휉), and= 1 so also the proof of Gödel’s Incompleteness Theorem for quantified integer statements (Theo- rem 11.8).

Chapter Recap

• Uncomputable functions include also functions ✓ that seem to have nothing to do with NAND-TM programs or other computational models such as determining the satisfiability of Diophantine equations. 390 introduction to theoretical computer science

• This also implies that for any sound proof system (and in particular every finite axiomatic system) , there are interesting statements (namely of the form “ ” for an uncomputable function푆 ) such that is not able to prove푋 either or its negation.퐹 (푥) = 0 퐹 푆 푋

11.6 EXERCISES

Exercise 11.1 — Gödel’s Theorem from uncomputability of . Prove Theo- rem 11.8 using Theorem 11.9. 푄퐼푆 ■

Exercise 11.2 — Proof systems and uncomputability. Let FINDPROOF be the following function. On input a Turing machine (which∗ we think of as the verifying algorithm for a proof system)∶ {0,and 1} a string→ {0, 1} , FINDPROOF if and only if there 푉exists such∗ that . 푥 ∈∗ {0, 1} (푉 , 푥) = 1 1. Prove푤 ∈ that {0, 1}FINDPROOF푉is (푥, uncomputable. 푤) = 1 2. Prove that there exists a Turing machine such that halts on every input but the function FINDPROOF defined as 4 FINDPROOF FINDPROOF 푉is uncomputable.푉 See Hint: think of as saying “Turing Machine halts 푉 on input ” and being a proof that is the number of footnote for hint.푥, 푣4 푉 steps that it will푥 take for this to happen. Can푀 you find (푥) = (푉 , 푥) an always-halting푢 푤 that will verify such statements? ■ 푉 Exercise 11.3 — Expression for floor. Let FSQRT . Prove that FSQRT is true if and only if . 푗∈ℕ (푛, 푚) = ∀ ((푗 × 푗) > ■ √ 푚) ∨ (푗 ≤ 푛) (푛, 푚) 푛 = ⌊ 푚⌋ Exercise 11.4 — axiomatic proof systems. For every representation of logical statements as strings, we can define an axiomatic proof system to consist of a finite set of strings and a finite set of rules with such that a proof that 0 푚−1 푗 Figure 11.3: In the puzzle problem, the input can be is true is valid if∗ for푘 every , either∗퐴 or is some 퐼 , … ,and 퐼 푗 1 푛 푛 thought of as a finite collection of types of puz- are 퐼 ∶ ({0, 1} )such→ that {0, 1} . A(푠 system, … , 푠 is) sound푠if zle pieces and the goal is to find out whether or not 푖 whenever there is no false푖 such that푠 ∈ there 퐴 is a proof푗 that ∈ [푚]is true. find a way to arrange pieces Σ from these types ina 1 푘푗 푖 푗 푖1 푘푗 푖 , … , 푖 < 푖 푠 = 퐼 (푠 , … , 푖 ) rectangle. Formally, we model the input as a pair of Prove that for every uncomputable function functions that 푠 푠 and every sound axiomatic proof system (that is characterized∗ by a such that 2 (respectively ↔ ↕ finite number of axioms and inference rules),퐹 ∶ {0, there 1} → is {0, some 1} input for 푚푎푡푐ℎ , 푚푎푡푐ℎ) if the∶ Σ pair→ of {0, pieces 1} are compatible푚푎푡푐ℎ when↔ placed(푙푒푓푡, in 푟푖푔ℎ푡) their respective = 1 posi- 푆 ↕ which the proof system is not able to prove neither that 푚푎푡푐ℎtions. We(푢푝, assume 푑표푤푛)contains = 1 a special symbol nor that . 푥 corresponding to having no piece, and an arrange-∅ ment of puzzle piecesΣ by an 푆 퐹 (푥) = 0 ■ rectangle is modeled by a string whose 퐹 (푥) ≠ 0 “outer coordinates’ ’ are and(푚 such − 2) that ×푚⋅푛 (푛 for − every 2) Exercise 11.5 — Post Corrrespondence Problem. In the Post Correspondence , ∅ 푥 ∈ Σ and Problem the input is a set where each . 푖 ∈ [푛 − 1], 푗 ∈ [푚 − 1] 푚푎푡푐ℎ↕(푥푖,푗, 푥푖+1,푗) = 1 0 0 푐−1 푐−1 푚푎푡푐ℎ↔(푥푖,푗, 푥푖,푗+1) = 1 푆 = {(훼 , 훽 ), … , (훽 , 훽 )} is every theorem provable? 391

and is a string in . We say that PCP if and only if there푖 exists푗 a list ∗ of pairs in such that 훼 훽 {0, 1} (푆) = 1 0 0 푚−1 푚−1 (훼 , 훽 ), … , (훼 , 훽 ) 푆 (11.9)

0 1 푚−1 0 1 푚−1 (We can think of each훼 pair훼 ⋯ 훼 = 훽as훽 a⋯ “domino 훽 . tile” and the ques- tion is whether we can stack a list of such tiles so that the top and the bottom yield the same string.)(훼, 훽) It can ∈ 푆 be shown that the PCP is uncom- putable by a fairly straightforward though somewhat tedious proof (see for example the Wikipedia page for the Post Correspondence Problem or Section 5.2 in [Sip97]). Use this fact to provide a direct proof that QMS is uncomputable by showing that there exists a computable map such that PCP QMS for every string encoding∗ an instance∗ of the post correspondence problem. 푅 ∶ {0, 1} → {0, 1} (푆) = (푅(푆)) 푆 ■

Exercise 11.6 — Uncomputability of puzzle. Let PUZZLE be the problem of determining, given a finite collection of ∗types of “puz- zle pieces”, whether it is possible to put them together∶ {0, in 1} a rectangle,→ {0, 1} see Fig. 11.3. Formally, we think of such a collection as a finite set (see Fig. 11.3). We model the criteria as to which pieces “fit together” by a pair of finite function such thatΣ a piece fits above a piece if and only if 2 and a piece ↕ ↔ fits to the left of a piece 푚푎푡푐ℎif and only, 푚푎푡푐ℎ if ∶ Σ → {0, 1} . To model ↕ the “straight푎 edge” pieces푏 that can be placed푚푎푡푐ℎ next(푎, to 푏) a = “blank 1 spot” ↔ 푐we assume that contains푑 the symbol 푚푎푡푐ℎand the(푐, matching 푑) = 1 functions are defined accordingly. A square tiling ∅of is an long string , such thatΣ for every and , 푚푛 Σ(i.e., every푚 × “internal 푛 pieve” 푥fits ∈ Σin with the pieces adjacent푖 ∈ {1, …to , 푚 it). − 2} We also푗 require ∈ {1, … , 푛all − 2}of the“outer 푖,푗 푖−1,푗 푖+1,푗 푖,푗−1 푖,푗+1 푚푎푡푐ℎ(푥pieces” (i.e.,, 푥 where, 푥 , 푥 , 푥 ) =of 1 ) are “blank” or equal to . The function PUZZLE takes as input a string describing 푖,푗 the set and∅푥 the function푖 ∈ {0, 푚and − 1} outputs푗 ∈ {0,if 푛 and − 1} only if there is some square tiling of : some not all blank string satisfying the aboveΣ condition. 푚푎푡푐ℎ 1 푚푛 Σ 푥 ∈ Σ 1. Prove that PUZZLE is uncomputable.

2. Give a reduction from PUZZLE to QMS.

Exercise 11.7 — MRDP exercise. The MRDP theorem states that the problem of determining, given a -variable polynomial with integer coefficients, whether there exists integers such that is uncomputable.푘 Consider the following푝 quadratic 0 푘−1 푥 , … , 푥 0 푘−1 푝(푥 , … , 푥 ) = 0 392 introduction to theoretical computer science

integer equation problem: the input is a list of polynomials over variables with integer coefficients, where each of the polynomi- 0 푚−1 als is of degree at most two (i.e., it is a quadratic function).푝 The, … , goal 푝 is to determine푘 whether there exist integers that solve the equations . 0 푘−1 Use the MRDP Theorem to prove that this푥 problem, … , 푥 is uncom- 0 푚−1 putable. That푝 (푥) is, = show ⋯ = that 푝 the(푥) function = 0 QUADINTEQ is uncomputable, where this function gets as input a string∗ de- scribing the polynomials (each with integer∶ coefficients {0, 1} → and{0, 1} degree at most two), and outputs if and only if there exists 0 푚−1 푝 , … , 푝 such that for every , . See 5 You can replace the equation with the pair 5 1 footnote for hint of equations and . Also,4 you can 0 푘−1 푖 0 푘−1 푥 , … , 푥 ∈ ℤ 푖 ∈ [푚] 푝 (푥 , … , 푥 ) = 0 ■ replace the equation2 with푦2 = the 푥 three equations , 푦 =and 푧 푧6 =. 푥 Exercise 11.8 — The Busy Beaver problem. In this question we define the 4 푤 = 푥2 푤 = 푦푢 푦 = 푥 푢 = 푥 NAND-TM variant of the busy beaver function.

1. We define the function as follows: for every string , if represents∗ a NAND-TM program such that when is executed∗ on the푇 ∶ input {0, 1} (i.e.,→ ℕthe string of length 1 that is simply푃), ∈ a {0, total 1} of 푃 lines are executed before the program halts, then 푃 . Otherwise (if 0does not represent a NAND-TM program,0 or it is a program푀 that does not halt on ), . Prove푇 that (푃 ) =is 푀 uncomputable. 푃

. 0 푇 (푃 ) = 0 .. 2. Let TOWER푇 denote the number 2 (that is, a “tower of pow- times2 ers of two” of height ). To get a sense2 of how fast this function (푛) 2⏟푛 grows, TOWER , TOWER , TOWER 2 , TOWER 푛 and TOWER2 which2 is about .(1)TOWER =16 2 is already(2) =2 a number= 4 that65536 is(3) too = big 2 to= write16 even20000 in(4) scientific = 2 = 65536notation. Define NBB(5) = 2 (for “NAND- TM Busy10 Beaver”) to be the(6) function NBB max where is the function defined ∶in ℕ Item → ℕ 1. Prove푛 that 푃 ∈{0,1} NBB grows faster than TOWER, in the sense(푛) that = TOWER 푇 (푃 ) NBB 푇 ∶(i.e., ℕ → for ℕ every , there exists such that for every 6 , TOWER NBB .).6 (푛) = You will not need to use very specific properties of 0 the TOWER function in this exercise. For example, 표( (푛)) 휖 > 0 푛 NBB also grows faster than the Ackerman func- 0 ■ 푛 > 푛 (푛) < 휖 ⋅ (푛) tion. You might find Aaronson’s blog post on the same(푛) topic to be quite interesting, and relevant to this 11.7 BIBLIOGRAPHICAL NOTES book at large. If you like it then you might also enjoy this piece by Terence Tao. As mentioned before, Gödel, Escher, Bach [Hof99] is a highly recom- mended book covering Gödel’s Theorem. A classic popular science book about Fermat’s Last Theorem is [Sin97]. Cantor’s are used for both Turing and Gödel’s theorems. In a twist of fate, using techniques originating from the works of Gödel and Tur- is every theorem provable? 393

ing, Paul Cohen showed in 1963 that Cantor’s Continuum Hypothesis is independent of the axioms of set theory, which means that neither it nor its negation is provable from these axioms and hence in some sense can be considered as “neither true nor false” (see [Coh08]). The Continuum Hypothesis is the conjecture that for every subset of , either there is a one-to-one and onto map between and or there is a one-to-one and onto map between and . It was conjectured푆 ℝ by Cantor and listed by Hilbert in 1900 as one of the푆 mostℕ important problems in mathematics. See also the푆 non-conventionalℝ survey of Shelah [She03]. See here for recent progress on a related question. Thanks to Alex Lombardi for pointing out an embarrassing mistake in the description of Fermat’s Last Theorem. (I said that it was open for exponent 11 before Wiles’ work.)

III EFFICIENT ALGORITHMS

Learning Objectives: • Describe at a high level some interesting computational problems. • The difference between polynomial and exponential time. • Examples of techniques for obtaining efficient algorithms • Examples of how seemingly small differences in problems can potentially make huge 12 differences in their computational complexity. Efficient computation: An informal introduction

“The problem of distinguishing prime numbers from composite and of resolving the latter into their prime factors is … one of the most important and useful in arithmetic … Nevertheless we must confess that all methods … are either restricted to very special cases or are so laborious … they try the patience of even the practiced … and do not apply at all to larger numbers.”, Carl Friedrich Gauss, 1798

“For practical purposes, the difference between algebraic and exponential order is often more crucial than the difference between finite and non-finite.”, Jack Edmunds, “Paths, Trees, and Flowers”, 1963

“What is the most efficient way to sort a million 32-bit integers?”, Eric Schmidt to Barack Obama, 2008 “I think the bubble sort would be the wrong way to go.”, Barack Obama.

So far we have been concerned with which functions are computable and which ones are not. In this chapter we look at the finer question of the time that it takes to compute functions, as a function of their input length. Time complexity is extremely important to both the theory and practice of computing, but in introductory courses, coding interviews, and , terms such as “ running time” are of- ten used in an informal way. People don’t have a precise definition of what a linear-time algorithm is, but rather assume푂(푛) that “they’ll know it when they see it”. In this book we will define running time pre- cisely, using the mathematical models of computation we developed in the previous chapters. This will allow us to ask (and sometimes answer) questions such as:

• “Is there a function that can be computed in time but not in time?” 2 푂(푛 ) • “Are푂(푛) there natural problems for which the best algorithm (and not just the best known) requires time?” Ω(푛) 2

Compiled on 8.26.2020 18:10 398 introduction to theoretical computer science

 Big Idea 16 The running time of an algorithm is not a number, it is a function of the length of the input.

This chapter: A non-mathy overview In this chapter, we informally survey examples of compu- tational problems. For some of these problems we know efficient (i.e., -time for a small constant ) algorithms, and for others the푐 best known algorithms are exponential. We present these푂(푛 examples) to get a feel as to the푐 kinds of problems that lie on each side of this divide and also see how sometimes seemingly minor changes in problem formula- tion can make the (known) complexity of a problem “jump” from polynomial to exponential. We do not formally define the notion of running time in this chapter, but use the same “I know it when I see it” notion of an or time algorithms as the one you’ve seen in introduction2 to com- puter science courses. We will see the푂(푛) precise푂(푛 definition) of running time (using Turing machines and RAM machines / NAND-RAM) in Chapter 13.

While the difference between and time can be crucial in practice, in this book we focus on the even bigger2 difference between polynomial and exponential running푂(푛) time. As푂(푛 we) will see, the difference between polynomial versus exponential time is typically insensitive to the choice of the particular computational model, a polynomial-time algorithm is still polynomial whether you use Turing machines, RAM machines, or parallel cluster as your model of computation, and sim- ilarly an exponential-time algorithm will remain exponential in all of these platforms. One of the interesting phenomena of computing is that there is often a kind of a “threshold phenomenon” or “zero-one law” for running time. Many natural problems can either be solved in polynomial running time with a not-too-large exponent (e.g., some- thing like or ), or require exponential (e.g., at least or ) time2 to solve.3 The reasons for this phenomenon are stillΩ(푛) not √ fullyΩ( understood,푛) 푂(푛 ) but푂(푛 some) light on it is shed by the concept of2NP completeness2 , which we will see in Chapter 15. This chapter is merely a tiny sample of the landscape of computa- tional problems and efficient algorithms. If you want to explore the field of algorithms and data structures more deeply (which Ivery much hope you do!), the bibliographical notes contain references to some excellent texts, some of which are available freely on the web. efficient computation: an informal introduction 399

R Remark 12.1 — Relations between parts of this book. Part I of this book contained a quantitative study of computation of finite functions. We asked what are the resources (in terms of gates of Boolean circuits or lines in straight-line programs) required to compute various finite functions. Part II of the book contained a qualitative study of computation of infinite functions (i.e., functions of unbounded input length). In that part we asked the qualitative question of whether or not a function is com- putable at all, regardless of the number of operations. Part III of the book, beginning with this chapter, merges the two approaches and contains a quantitative study of computation of infinite functions. In this part we ask how do resources for computing a function scale with the length of the input. In Chapter 13 we define the notion of running time, and the class P of functions that can be computed using a number of steps that scales polynomially with the input length. In Section 13.6 we will relate this class to the models of Boolean circuits and straightline programs that we studied in Part I.

12.1 PROBLEMS ON GRAPHS

In this chapter we discuss several examples of important computa- tional problems. Many of the problems will involve graphs. We have already encountered graphs before (see Section 1.4.4) but now quickly recall the basic notation. A graph consists of a set of vertices and edges where each edge is a pair of vertices. We typically denote by the number of vertices (and in fact퐺 often consider graphs where푉 the set of퐸 vertices equals the set of the integers between and ). In푛 a directed graph, an edge is an ordered pair , which we some- times denote as푉 . In an undirected[푛] graph, an edge is an unordered0 푛 − 1 pair (or simply a set) which we sometimes(푢, 푣) denote as or . An equivalent⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗푢푣 viewpoint is that an undirected graph corre- sponds to a directed graph{푢, 푣} satisfying the property that whenever푢 푣 the edge푢 ∼ 푣 is present then so is the edge . In this chapter we restrict our attention to graphs that are undirected and simple (i.e., containing no parallel⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗푢푣 edges or self-loops). Graphs⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗푣푢 can be represented either in the adjacency list or adjacency matrix representation. We can transform between these two representations using operations, and hence for our purposes we will mostly consider them2 as equivalent. Graphs are so ubiquitous in computer푂(푛 science) and other sciences because they can be used to model a great many of the data that we encounter. These are not just the “obvious” data such as the road 400 introduction to theoretical computer science

network (which can be thought of as a graph of whose vertices are locations with edges corresponding to road segments), or the web (which can be thought of as a graph whose vertices are web pages with edges corresponding to links), or social networks (which can be thought of as a graph whose vertices are people and the edges correspond to friend relation). Graphs can also denote correlations in data (e.g., graph of observations of features with edges corresponding to features that tend to appear together), causal relations (e.g., gene regulatory networks, where a gene is connected to gene products it derives), or the state space of a system (e.g., graph of configurations of a physical system, with edges corresponding to states that can be reached from one another in one step).

12.1.1 Finding the shortest path in a graph The shortest path problem is the task of finding, given a graph Figure 12.1: Some examples of graphs found on the Internet. and two vertices , the length of the shortest path between and (if such a path exists). That is, we want to find퐺 = the smallest(푉 , 퐸) number such푠, that 푡 ∈ there 푉 are vertices with , and푠 for every푡 an edge between and . For- 0 1 푘 0 mally, we define MINPATH푘 푣to, 푣 be, the… , 푣 function푣 that= 푠 푘 푖 푖+1 푣on= input 푡 a triple 푖 ∈ {0,(represented … , 푘 − 1}∗ as a string)∗ outputs푣 the푣 number which is the length of the shortest∶ {0, 1} path→ {0, in 1} between and or a string representing(퐺,no 푠, 푡) path if no such path exists. (In practice people often푘 want to also find the actual path and퐺 not just its푠 length;푡 it turns out that the algorithms to compute the length of the path often yield the actual path itself as a byproduct, and so everything we say about the task of computing the length also applies to the task of finding the path.) 1 If each vertex has at least two neighbors then there can be an expo- A queue is a data structure for storing a list of el- ements in “First In First Out (FIFO)” order. Each nential number of paths from to , but fortunately we do not have to “pop” operation removes an element from the queue enumerate them all to find the shortest path. We can find the short- in the order that they were “pushed” into it; see the Wikipedia page. est path using a breadth first푠 search푡 (BFS), enumerating ’s neigh- bors, and then neighbors’ neighbors, etc.. in order. If we maintain the neighbors in a list we can perform a BFS in time,푠 while us- 1 ing a queue we can do this in time. Dijkstra’s2 algorithm is a well-known generalization of BFS to weighted graphs.푂(푛 ) More formally, the algorithm for computing푂(푚) the function MINPATH is described in Algorithm 12.2. efficient computation: an informal introduction 401

Algorithm 12.2 — Shortest path via BFS. Input: Graph and vertices . Assume . Output: Length퐺 =of (푉shortest , 퐸) path from 푠,to 푡 ∈or 푉 if no such푉 = [푛] path exists. 1: Let be length-푘 array. 푠 푡 ∞ 2: Set and for all . 3: Initialize퐷 queue 푛 to contain . ⧵ 4: while퐷[푠] =non 0 empty퐷[푖]do = ∞ 푖 ∈ [푛] {푠} 5: Pop from 푄 푠 6: if 푄 then 7: return푣 푄 8: end푣 = if 푡 9: for neighbor퐷[푣] of with do 10: Set 11: Add푢 to . 푣 퐷[푢] = ∞ 12: end for퐷[푢] ← 퐷[푣] + 1 13: end while 푢 푄 14: return

Since we only∞ add to the queue vertices with (and then immediately set to an actual number), we never push to the queue a vertex more than once, and hence푤 the퐷[푤] algorithm = ∞ makes at most “push” and “pop”퐷[푤] operations. For each vertex , the number of times we run the inner loop is equal to the degree of and hence the total푛 running time is proportional to the sum of all푣 degrees which equals twice the number of edges. Algorithm 12.2 returns푣 the cor- rect answer since the vertices are added to the queue in the order of their distance from , and푚 hence we will reach after we have explored all the vertices that are closer to than . 푠 푡 푠 푡 R Remark 12.3 — On data structures. If you’ve ever taken an algorithms course, you have probably encountered many data structures such as lists, arrays, queues, stacks, heaps, search trees, hash tables and many more. Data structures are extremely important in com- puter science, and each one of those offers different tradeoffs between overhead in storage, operations supported, cost in time for each operation, and more. For example, if we store items in a list, we will need a linear (i.e., time) scan to retrieve an element, while we achieve the same푛 operation in time if we used a hash푂(푛) table. However, when we only care about polynomial-time algorithms, such푂(1) factors of in the running time will not make much differ-

푂(푛) 402 introduction to theoretical computer science

ence. Similarly, if we don’t care about the difference between and , then it doesn’t matter if we represent graphs as adjacency2 lists or adjacency matri- ces. Hence푂(푛) we will푂(푛 often) describe our algorithms at a very high level, without specifying the particular data structures that are used to implement them. How- ever, it will always be clear that there exists some data structure that is sufficient for our purposes.

12.1.2 Finding the longest path in a graph The longest path problem is the task of finding the length of the longest simple (i.e., non intersecting) path between a given pair of vertices and in a given graph . If the graph is a road network, then the longest path might seem less motivated than the shortest path (unless 푠you are푡 the kind of person퐺 that always prefers the “scenic ”). But graphs can and are used to model a variety of phenomena, and in many such cases finding the longest path (and some of its variants) can be very useful. In particular, finding the longest path is a gener- alization of the famous Hamiltonian path problem which asks for a maximally long simple path (i.e., path that visits all vertices once) between and , as well as the notorious traveling salesman problem (TSP) of finding (in a weighted graph) a path 푛visiting all vertices of cost at most푠 .푡 TSP is a classical optimization problem, with appli- cations ranging from planning and logistics to DNA sequencing and astronomy. 푤 Surprisingly, while we can find the shortest path in time, there is no known algorithm for the longest path problem that signif- icantly improves on the trivial “exhaustive search” or “brute푂(푚) force” algorithm that enumerates all the exponentially many possibilities for such paths. Specifically, the best known algorithms for the longest path problem take time for some constant . (At the mo- ment the best record is푛 or so; even obtaining an time bound is not that simple,푂(푐 ) see Exercise 12.1.) 푐 > 1 푛 푐 ∼ 1.65 푂(2 ) 12.1.3 Finding the minimum cut in a graph Given a graph , a cut of is a subset such that is neither empty nor is it all of . The edges cut by are those edges where one of their퐺 = endpoints (푉 , 퐸) is in 퐺and the other푆 is ⊆ in 푉 .푆 We denote this set of edges by 푉 . If are a푆 pair of vertices⧵ then an cut is a cut such that 푆 and (see푆Fig. = 12.3푉 ).푆 The minimum cut problem퐸(푆,is the푆) task푠, of 푡∈ finding, 푉 given and , the minimum푠, 푡 number such that there푠 ∈ is 푆 an 푡cut ∈ 푆 cutting edges (the Figure 12.2:A knight’s tour can be thought of as a 푠, 푡 푠 푡 maximally long path on the graph corresponding to problem is also sometimes phrased as finding the set that achieves a chessboard where we put an edge between any two this minimum; it turns푘 out that algorithms푠, to 푡 compute the푘 number squares that can be reached by one step via a legal often yield the set as well). Formally, we define MINCUT knight move. ∗ ∶ {0, 1} → efficient computation: an informal introduction 403

to be the function that on input a string representing a triple ∗ of a graph and two vertices, outputs the minimum number{0, 1} such that there exists a set with , and (퐺 = (푉 , 퐸),. 푠, 푡) Computing푘 minimum cuts is useful푆 ⊆푉 in many푠 applications ∈ 푆 푡 ∉ 푆 since minimum|퐸(푆, 푆)| = cuts 푘 often correspond to bottlenecks. For example, in a com- munication or railroad network푠, 푡 the minimum cut between and corresponds to the smallest number of edges that, if dropped, will disconnect from . (This was actually the original motivation푠 for푡 this problem; see Section 12.6.) Similar applications arise in scheduling and planning.푠 In the푡 setting of image segmentation, one can define a graph whose vertices are pixels and whose edges correspond to neigh- Figure 12.3:A cut in a graph is simply a boring pixels of distinct colors. If we want to separate the foreground subset of its vertices. The edges that are cut by from the background then we can pick (or guess) a foreground pixel are all those whose one endpoint퐺 = is(푉 in , 퐸)and the other one is in푆 . The cut edges are colored red푆 in and background pixel and ask for a minimum cut between them. this figure. ⧵ 푆 The naive algorithm for computing MINCUT will check all pos-푠 푆 = 푉 푆 sible subsets of an -vertex푡 graph, but it turns out we can do much푛 better than that. As we’ve seen in this book time and again, there2 is more than one algorithm푛 to compute the same function, and some of those algorithms might be more efficient than others. Luckily the minimum cut problem is one of those cases. In particular, as we will see in the next section, there are algorithms that compute MINCUT in time which is polynomial in the number of vertices.

12.1.4 Min-Cut Max-Flow and Linear programming We can obtain a polynomial-time algorithm for computing MINCUT using the Max-Flow Min-Cut Theorem. This theorem says that the minimum cut between and equals the maximum amount of flow we can send from to , if every edge has unit capacity. Specifically, imagine that every edge푠 of the푡 graph corresponded to a pipe that could carry one unit푠 of푡 fluid per one unit of time (say 1 liter ofwater per second). The maximum flow is the maximum units of water that we could transfer from to over these pipes. If there is an cut of edges, then the maximum푠, 푡 flow is at most . The reason is that such a cut acts as a “bottleneck”푠 푡 since at most units can푠, flow 푡 from 푘to its complement at any given unit of time.푘 This means that the maximum 푆 flow is always at most the value of the푘 minimum cut.푆 The surprising and non-trivial content of the Max-Flow Min- Cut Theorem is푠, that 푡 the maximum flow is also at least the value of the minimum푠, 푡 cut, and hence computing the cut is the same as computing the flow. The Max-Flow Min-Cut Theorem reduces the task of computing a minimum cut to the task of computing a maximum flow. However, this still does not show how to compute such a flow. The Ford-Fulkerson 404 introduction to theoretical computer science

Algorithm is a direct way to compute a flow using incremental im- provements. But computing flows in polynomial time is also a special case of a much more general tool known as linear programming. A flow on a graph of edges can be modeled as a vector where for every edge , corresponds to the amount of water per 푚 time-unit that flows퐺 on .푚 We think of an edge as an ordered푥 pair ∈ ℝ 푒 (we can choose the푒 푥 order arbitrarily) and let be the amount of flow that goes from 푒to . (If the flow is in푒 the other direction then 푒 we(푢, 푣) make negative.) Since every edge has capacity푥 one, we know that for every푢 푣 edge . A valid flow has the property that 푒 the amount푥 of water leaving the source is the same as the amount 푒 entering−1 ≤ the 푥 sink≤ 1 , and that for every푒 other vertex , the amount of water entering and leaving is the same.푠 Mathematically,푡 we can write these conditions as푣 follows: 푣

푒 푒 ∑푒∋푠 푥 + ∑푒∋푡 푥 = 0 (12.1) ⧵ 푒 푣∈푉 {푠,푡} ∑푒∋푣 푥 = 0 ∀ where for every vertex , summing푒 over 푒∈퐸means summing over all −1 ≤ 푥 ≤ 1 ∀ the edges that touch . The maximum flow푣 problem can be푒 ∋thought 푣 of as the task ofmax- imizing over푣 all the vectors that satisfy the above conditions (12.1). Maximizing a linear function푚 over the set of 푒∋푠 푒 that∑ satisfy푥 certain linear equalities푥 ∈ ℝ and inequalities is known as linear푚 programming. Luckily, there are polynomial-timeℓ(푥) algorithms for푥 ∈ solving ℝ linear programming, and hence we can solve the maxi- mum flow (and so, equivalently, minimum cut) problem in polyno- mial time. In fact, there are much better algorithms for maximum- flow/minimum-cut, even for weighted directed graphs, with currently the record standing at min time. Solved Exercise 12.1 — Global minimum10/7 cut. Given√ a graph , 푂( {푚 , 푚 푛}) define the global minimum cut of to be the minimum over all with and of the number of edges cut by퐺. = Prove (푉 , 퐸) that there is a polynomial-time∅ algorithm퐺 to compute the global minimum푆 ⊆ 푉 cut of푆 a ≠ graph. 푆 ≠ 푉 푆 ■

Solution: By the above we know that there is a polynomial-time algorithm that on input finds the minimum cut in the graph

퐴 (퐺, 푠, 푡) 푠, 푡 efficient computation: an informal introduction 405

. Using , we can obtain an algorithm that on input a graph computes the global minimum cut as follows: 퐺 퐴 퐵 퐺 1. For every distinct pair , Algorithms sets . 푠,푡 푠, 푡 ∈ 푉 퐵 푘 ← 2. 퐴(퐺,returns 푠, 푡) the minimum of over all distinct pairs

푠,푡 The퐵 running time of will푘 be times the running푠, 푡 time of and hence polynomial time. Moreover,2 if the the global minimum cut is , then when reaches퐵 an푂(푛 iteration) with and 퐴it will obtain the value of this cut, and hence the value output by will be푆 the value of the퐵 global minimum cut. 푠 ∈ 푆 푡 ∉ 푆 The above is our first example of a reduction in the context of퐵 polynomial-time algorithms. Namely, we reduced the task of com- puting the global minimum cut to the task of computing minimum cuts. ■ 푠, 푡 12.1.5 Finding the maximum cut in a graph The maximum cut problem is the task of finding, given an input graph , the subset that maximizes the number of edges cut by . (We can also define an -cut variant of the maximum cut 퐺like = we (푉 did , 퐸) for minimum푆 cut; ⊆ 푉 the two variants have similar complexity but the푆 global maximum cut is more푠, 푡 common in the literature.) Like its cousin the minimum cut problem, the maximum cut problem is also very well motivated. For example, maximum cut arises in VLSI design, and also has some surprising relation to analyzing the Ising model in statistical physics. Surprisingly, while (as we’ve seen) there is a polynomial-time al- gorithm for the minimum cut problem, there is no known algorithm Figure 12.4: In a convex function (left figure), for every and and it holds that solving maximum cut much faster than the trivial “brute force” algo- . In푓 particular this means rithm that tries all possibilities for the set . that every푥 local푦 minimum푝 ∈ [0,of 1] is also a global푓(푝푥 minimum + (1 −. 푝)푦)In contrast ≤ 푝⋅푓(푥)+(1−푝)⋅푓(푦) in a non convex function there can be many 푛 local minima. 푓 12.1.6 A note on convexity2 푆 There is an underlying reason for the sometimes radical difference between the difficulty of maximizing and minimizing a function over a domain. If , then a function is convex if for every and 푛 . That is, applied to퐷 the⊆ ℝ -weighted midpoint푓 ∶ between 퐷 → 푅 and is smaller Figure 12.5: In the high dimensional case, if is a 푥,than 푦 ∈ the 퐷 -weighted푝 ∈ [0, 1]average 푓(푝푥 + value (1 − of 푝)푦). If ≤ 푝푓(푥)itself + is (1 convex − 푝)푓(푦) (which convex function (left figure) the global minimum 푓 푝 푥 푦 is the only local minimum, and we can find푓 it by means that if are in then so is the line segment between them), a local-search algorithm which can be thought of then this푝 means that if is a local minimum푓 퐷of then it is also a global as dropping a marble and letting it “slide down” minimum. The푥, reason 푦 is퐷 that if then every point until it reaches the global minimum. In contrast, a non-convex function (right figure) might have an on the line푥 segment between and푓 will satisfy exponential number of local minima in which any and푓(푦) hence < in 푓(푥) particular cannot be푧 =a local local-search algorithm could get stuck. 푝푥 + (1 − 푝)푦 푥 푦 푓(푧) ≤ 푝푓(푥) + (1 − 푝)푓(푦) < 푓(푥) 푥 406 introduction to theoretical computer science

minimum. Intuitively, local minima of functions are much easier to find than global ones: after all, any “local search” algorithm that keeps finding a nearby point on which the value is lower, will eventually arrive at a local minima. One example of such a local search algorithm is gradient descent which takes a sequence of small steps, each one in the direction that would reduce the value by the most amount based on the current derivative. Indeed, under certain technical conditions, we can often efficiently find the minimum of convex functions over a convex domain, and this is the reason why problems such as minimum cut and shortest path are easy to solve. On the other hand, maximizing a convex func- tion over a convex domain (or equivalently, minimizing a concave function) can often be a hard computational task. A linear function is both convex and concave, which is the reason that both the maxi- mization and minimization problems for linear functions can be done efficiently. The minimum cut problem is not a priori a convex minimization task, because the set of potential cuts is discrete and not continuous. However, it turns out that we can embed it in a continuous and con- vex set via the (linear) maximum flow problem. The “max flow min cut” theorem ensures that this embedding is “tight” in the sense that the minimum “fractional cut” that we obtain through the maximum- flow linear program will be the same as the true minimum cut.Un- fortunately, we don’t know of such a tight embedding in the setting of the maximum cut problem. Convexity arises time and again in the context of efficient computa- tion. For example, one of the basic tasks in machine learning is empir- ical risk minimization. This is the task of finding a classifier for a given set of training examples. That is, the input is a list of labeled examples , where each and , and the goal is to find a classifier 푛 (or sometimes 0 0 푚−1 푚−1 푖 푖 (푥 , 푦 ), … , (푥 ) that, 푦 minimizes) the number푥 ∈푛 {0, of 1}errors. More푦 ∈ generally, {0, 1} we want푛 to find that minimizesℎ ∶ {0, 1} → {0, 1} ℎ ∶ {0, 1} → ℝ ℎ (12.2) 푚−1

푖 푖 where is some loss function∑푖=0measuring퐿(푦 , ℎ(푥 how)) far is the predicted la- bel from the true label . When is the square loss function 퐿 and is a linear function, empirical risk mini- 푖 푖 mizationℎ(푥′ ) corresponds′ 2 to the푦 well-known퐿 convex minimization task of 퐿(푦,linear 푦 regression) = (푦 −. In 푦 ) other cases,ℎ when the task is non convex, there can be many global or local minima. That said, even if we don’t find the global (or even a local) minima, this continuous embedding can still help us. In particular, when running a local improvement algorithm efficient computation: an informal introduction 407

such as Gradient Descent, we might still find a function that is “use- ful” in the sense of having a small error on future examples from the same distribution. ℎ

12.2 BEYOND GRAPHS

Not all computational problems arise from graphs. We now list some other examples of computational problems that are of great interest.

12.2.1 SAT A involves variables and the logical operators AND ( ), OR ( ), and NOT ( , also denoted as ). We say 1 푛 that such a formula is in휑 conjunctive푛 normal form푥 ,(CNF … , 푥 for short) if it is an AND of ORs of∧ variables∨ or their negations¬ (we call a term⋅ of the form or a literal). For example, this is a CNF formula

푖 푖 푥 푥 (12.3)

7 22 15 37 22 55 7 The satisfiability(푥 ∨ 푥problem∨ 푥is the) ∧ task (푥 of∨ 푥 determining,) ∧ (푥 ∨ 푥 given) a CNF formula , whether or not there exists a satisfying assignment for .A satisfying assignment for is a string such that evalu- ates to True휑 if we assign its variables the values of푛 . The SAT problem휑 might seem as an abstract휑 question of푥 interest ∈ {0, 1}only in logic but휑 in fact SAT is of huge interest in industrial optimization, with푥 applications including manufacturing planning, circuit synthesis, software verifica- tion, air-traffic control, scheduling sports tournaments, and more.

2SAT. We say that a formula is a -CNF it is an AND of ORs where each OR involves exactly literals. The -SAT problem is the restric- tion of the satisfiability problem푘 for the case that the input formula is a -CNF. In particular, the푘2SAT problem푘is to find out, given a -CNF formula , whether there is an assignment that satisfies ,푘 in the sense that it makes it evaluate to or “True”.푛 The trivial,2 brute-force,휑 algorithm for 2SAT will enumerate푥 ∈ all {0, the 1} assignments 휑 but fortunately we can do much1 better. The푛 key is that we can think푛 of every constraint of the form (where2 are literals푥 ∈ {0,, corresponding 1} to variables or their negations) as an implication 푖 푗 푖 푗 , since it corresponds to the constraintsℓ that∨ ℓ if the literalℓ , ℓ is true then it must be the case that is true as well. Hence we can′ 푖 푗 푖 푖 ℓthink⇒ ℓ of as a directed graph between the literals, with an edgeℓ = ℓ 푗 from to corresponding to an implicationℓ from the former to the latter. It휑 can be shown that is unsatisfiable2푛 if and only if there isa 푖 푗 variableℓ ℓsuch that there is a directed path from to as well as a directed path from to 휑(see Exercise 12.2). This reduces 2SAT 푖 푖 푖 to the (efficiently푥 solvable) problem of determining푥 푥 connectivity in 푖 푖 directed graphs. 푥 푥 408 introduction to theoretical computer science

3SAT. The 3SAT problem is the task of determining satisfiability for 3CNFs. One might think that changing from two to three would not make that much of a difference for complexity. One would be wrong. Despite much effort, we do not know of a significantly better than brute force algorithm for 3SAT (the best known algorithms take roughly steps). Interestingly,푛 a similar issue arises time and again in computation, where the1.3 difference between two and three often corresponds to the difference between tractable and intractable. We do not fully un- derstand the reasons for this phenomenon, though the notion of NP completeness we will see later does offer a partial explanation. It may be related to the fact that optimizing a polynomial often amounts to equations on its derivative. The derivative of a quadratic polynomial is linear, while the derivative of a cubic is quadratic, and, as we will see, the difference between solving linear and quadratic equations canbe quite profound.

12.2.2 Solving linear equations One of the most useful problems that people have been solving time and again is solving linear equations in variables. That is, solve equations of the form 푛 푛

0,0 0 0,1 1 0,푛−1 푛−1 0 푎 푥 + 푎 푥 + ⋯ + 푎 푥 = 푏 (12.4) 1,0 0 1,1 1 1,푛−1 푛−1 1 푎 푥 + 푎 푥 + ⋯ + 푎 푥 = 푏 ⋮+ ⋮ + ⋮ + ⋮ =⋮ where푛−1,0 0 푛−1,1and1 are real푛−1,푛−1 (or rational)푛−1 numbers.푛−1 More 푎 푥 + 푎 푥 + ⋯ + 푎 푥 = 푏 compactly, we can write this as the equations where is an 푖,푗 푖,푗∈[푛] 푖 푖∈[푛] matrix,{푎 and} we think{푏 of} are column vectors in . The standard Gaussian elimination algorithm퐴푥 can = 푏 be used푛 퐴 to solve 푛such × 푛 equations in polynomial푥, time 푏 (i.e., determine if theyℝ have a so- lution, and if so, to find it). As we discussed above, if we are willing to allow some loss in precision, we even have algorithms that handle linear inequalities, also known as linear programming. In contrast, if we insist on integer solutions, the task of solving for linear equalities or inequalities is known as integer programming, and the best known algorithms are exponential time in the worst case.

R Remark 12.4 — Bit complexity of numbers. Whenever we discuss problems whose inputs correspond to num- bers, the input length corresponds to how many bits are needed to describe the number (or, as is equiv- alent up to a constant factor, the number of digits efficient computation: an informal introduction 409

in base 10, 16 or any other constant). The difference between the length of the input and the magnitude of the number itself can be of course quite profound. For example, most people would agree that there is a huge difference between having a billion (i.e. ) dollars and having nine dollars. Similarly there is9 a huge difference between an algorithm that 10takes steps on an -bit number and an algorithm that takes steps. 푛 One푛 example푛 is the problem (discussed below) of finding2 the prime factors of a given integer . The natural algorithm is to search for such a factor by try- ing all numbers from to , but that would푁 take steps which is exponential in the input length, which is the number of bits needed1 푁 to describe . (The run-푁 ning time of this algorithm can be easily improved to roughly , but this is still exponential푁 (i.e., ) in the number of bits to describe .) It is an important √ 푛/2 and long open푁 question whether there is such2 an algo- rithm that runs푛 in time polynomial푁 in the input length (i.e., polynomial in log ).

12.2.3 Solving quadratic equations Suppose that we want to solve not just linear but also equations in- volving quadratic terms of the form . That is, suppose that we are given a set of quadratic polynomials and consider 푖,푗,푘 푗 푘 the equations . To avoid푎 issues푥 푥 with bit representations, 1 푚 we will always assume that the equations contain푝 , … , the 푝 constraints 푖 {푝 (푥). Since = 0} only and satisfy the equation , this2 assumption means that we can restrict attention to solutions2 in 푖 푖 푖∈[푛] {푥 − 푥. Solving= 0} quadratic equations0 1 in several variables is푎 a classical− 푎 = 0 and extremely푛 well motivated problem. This is the generalization of {0,the 1} classical case of single-variable quadratic equations that gener- ations of high school students grapple with. It also generalizes the quadratic assignment problem, introduced in the 1950’s as a way to optimize assignment of economic activities. Once again, we do not know a much better algorithm for this problem than the one that enu- merates over all the possibilities. 푛 12.3 MORE ADVANCED2 EXAMPLES

We now list a few more examples of interesting problems that are a little more advanced but are of significant interest in areas such as physics, economics, number theory, and cryptography.

12.3.1 Determinant of a matrix The determinant of a matrix , denoted by det , is an ex- tremely important quantity in . For example, it is known 푛 × 푛 퐴 (퐴) 410 introduction to theoretical computer science

that det if and only if is nonsingular, which means that it has an inverse , and hence we can always uniquely solve equations of the form(퐴) ≠ 0−1 where and퐴 are -dimensional vectors. More generally, the determinant퐴 can be thought of as a quantitative measure as to what extent퐴푥 = 푏 is far from푥 being푏 singular.푛 If the rows of are “al- most” linearly dependent (for example, if the third row is very close to being a linear combination퐴 of the first two rows) then퐴 the determi- nant will be small, while if they are far from it (for example, if they are are orthogonal to one another, then the determinant will be large). In particular, for every matrix , the absolute value of the determinant of is at most the product of the norms (i.e., square root of sum of squares of entries) of the rows,퐴 with equality if and only if the rows are퐴 orthogonal to one another. The determinant can be defined in several ways. One way to define the determinant of an matrix is:

det푛 × 푛 sign퐴 (12.5)

푖,휋(푖) 푛 where is the set of all(퐴) permutations =휋∈푆 ∑ (휋) from푖∈[푛] ∏ 퐴 to and the sign of a permutation is equal to raised to the power of the number of 푛 inversions푆 in (pairs such that but [푛] [푛] ). This definition휋 suggests−1 that computing det might require summing over휋 terms푖, 푗 which would푖 > 푗 take휋(푖) exponential < 휋(푗) time since . However, there are other ways(퐴) to compute the de- 푛 terminant. For푛|푆 example,| it is known that det is the only function that 푛 |푆satisfies| = 푛! the > following 2 conditions:

1. det AB det det for every square matrices .

2. For every triangular matrix with diagonal entries ( ) = (퐴) (퐵) 퐴, 퐵 , det . In particular det where is the identity푛 matrix. × 푛 (A triangular푛 matrix푇 is one in which either all 0 푛−1 푖=0 푖 푑entries, … , 푑 below the(푇 diagonal, ) = ∏ or푑 all entries above the(퐼) diagonal, = 1 are퐼 zero.)

3. det where is a “swap matrix” that corresponds to swapping two rows or two columns of . That is, there are two (푆) = −1 푆 퐼 coordinates such that for every , . ⎧ {1 푖 = 푗 , 푖 ∉ {푎, 푏} 푖,푗 { otherwise 푎, 푏 푖, 푗 푆 = ⎨1 {푖, 푗} = {푎, 푏} Using these rules and the Gaussian elimination{algorithm, it is ⎩0 possible to tell whether is singular or not, and in the latter case, de- compose as a product of a polynomial number of swap matrices and triangular matrices.퐴 (Indeed one can verify that the row opera- tions in Gaussian퐴 elimination corresponds to either multiplying by a efficient computation: an informal introduction 411

swap matrix or by a triangular matrix.) Hence we can compute the determinant for an matrix using a polynomial time of arithmetic operations. 푛 × 푛 12.3.2 Permanent of a matrix Given an matrix , the permanent of is defined as

푛 × 푛 perm퐴 퐴 (12.6)

푖,휋(푖) 푛 That is, perm is defined(퐴) analogously =휋∈푆 ∑ 푖∈[푛]∏ 퐴 to the. determinant in(12.5) except that we drop the term sign . The permanent of a matrix is a natural quantity,(퐴) and has been studied in several contexts including combinatorics and graph theory. It(휋) also arises in physics where it can be used to describe the quantum state of multiple Boson particles (see here and here).

Permanent modulo 2. If the entries of are integers, then we can de- fine the Boolean function which outputs on input a matrix the result of the permanent of modulo퐴 . It turns out that we can 2 compute in polynomial푝푒푟푚 time. The key is that modulo 퐴, and are the same quantity and퐴 hence,2 since the only difference 2 between (푝푒푟푚12.5) and(퐴) (12.6) is that some terms are multiplied by 2 −푥, det +푥 mod perm mod for every . −1 Permanent modulo 3. (퐴) 2 = Emboldened(퐴) 2 by our good퐴 fortune above, we might hope to be able to compute the permanent modulo any prime and perhaps in full generality. Alas, we have no such luck. In a similar “two to three” type of a phenomenon, we do not know of a much 푝 better than brute force algorithm to even compute the permanent modulo .

12.3.3 Finding3 a zero-sum equilibrium A zero sum game is a game between two players where the payoff for one is the same as the penalty for the other. That is, whatever the first player gains, the second player loses. As much as we want to avoid them, zero sum games do arise in life, and the one good thing about them is that at least we can compute the optimal strategy. A zero sum game can be specified by an matrix , where if player 1 chooses action and player 2 chooses action then player one gets and player 2 loses the same amount.푛 × The 푛 famous퐴Min Max Theorem by John von Neumann푖 states that if we allow푗 probabilistic or 푖,푗 “mixed”퐴 strategies (where a player does not choose a single action but rather a distribution over actions) then it does not matter who plays first and the end result will be the same. Mathematically the minmax theorem is that if we let be the set of probability distributions over

푛 Δ 412 introduction to theoretical computer science

(i.e., non-negative columns vectors in whose entries sum to ) then 푛 [푛] ℝ 1

max min min max (12.7) ⊤ ⊤ 푝∈∆푛 푞∈∆푛 푝 퐴푞 = 푞∈∆푛 푝∈∆푛 푝 퐴푞 The min-max theorem turns out to be a corollary of linear pro- gramming duality, and indeed the value of (12.7) can be computed efficiently by a linear program.

12.3.4 Finding a Nash equilibrium Fortunately, not all real-world games are zero sum, and we do have more general games, where the payoff of one player does not neces- sarily equal the loss of the other. John Nash won the Nobel prize for showing that there is a notion of equilibrium for such games as well. In many economic texts it is taken as an article of faith that when actual agents are involved in such a game then they reach a Nash equilibrium. However, unlike zero sum games, we do not know of an efficient algorithm for finding a Nash equilibrium given thede- scription of a general (non zero sum) game. In particular this means that, despite economists’ intuitions, there are games for which natural strategies will take an exponential number of steps to converge to an equilibrium.

12.3.5 Primality testing Another classical computational problem, that has been of interest since the ancient Greeks, is to determine whether a given number is prime or composite. Clearly we can do so by trying to divide it with all the numbers in , but this would take at least 푁steps which is exponential in its bit complexity log . We can reduce these steps to 2, … ,by 푁 observing − 1 that if is a composite푁 of 푛 = 푁 the form PQ then either√ or is smaller than . But this is 푁 푁 푁 still quite terrible. If is a bit integer, is about√ , and so 푁 = 푃 푄 푁 running this algorithm on such an input would√ take much512 more than the lifetime of the universe.푁 1024 푁 2 Luckily, it turns out we can do radically better. In the 1970’s, Ra- bin and Miller gave probabilistic algorithms to determine whether a given number is prime or composite in time for log . We will discuss the probabilistic model of computation later in this course. In 2002,푁 Agrawal, Kayal, and Saxena found푝표푙푦(푛) a deterministic푛 = 푁 time algorithm for this problem. This is surely a development that mathematicians from Archimedes till Gauss would have found exciting.푝표푙푦(푛) efficient computation: an informal introduction 413

12.3.6 Integer factoring Given that we can efficiently determine whether a number is prime or composite, we could expect that in the latter case we could also ef- ficiently find the factorization of . Alas, no such algorithm푁 is known. In a surprising and exciting turn of events, the non existence of such an algorithm has been used as a basis푁 for encryptions, and indeed it un- derlies much of the security of the . We will return to the factoring problem later in this course. We remark that we do know much better than brute force algorithms for this problem. While the brute force algorithms would require time to factor an -bit inte- ger, there are known algorithms runningΩ(푛) in time roughly and √ also algorithms that are widely believed2 (though not fully푂( rigorously푛 푛) analyzed) to run in time roughly . (By “roughly” we2 mean that 1/3 we neglect factors that are polylogarithmic푂(푛 ) in .) 2 푛 12.4 OUR CURRENT KNOWLEDGE

The difference between an exponential and polynomial time algo- rithm might seem merely “quantitative” but it is in fact extremely significant. As we’ve already seen, the brute force exponential time algorithm runs out of steam very very fast, and as Edmonds says, in practice there might not be much difference between a problem where the best algorithm is exponential and a problem that is not solvable at all. Thus the efficient algorithms we mentioned above are widely used and power many computer science applications. Moreover, a Figure 12.6: The current computational status of polynomial-time algorithm often arises out of significant insight to several interesting problems. For all of them we either know a polynomial-time algorithm or the known the problem at hand, whether it is the “max-flow min-cut” result, the algorithms require at least for some . In 푐 solvability of the determinant, or the group theoretic structure that fact for all except the factoring푛 problem, we either enables primality testing. Such insight can be useful regardless of its know an time algorithm2 or the best푐 > known 0 algorithm require3 at least time where is a computational implications. natural parameter푂(푛 ) such thatΩ(푛) there is a brute force At the moment we do not know whether the “hard” problems are algorithm taking roughly 2 or time. Whether푛 this “cliff” between the easy푛 and hard problem isareal truly hard, or whether it is merely because we haven’t yet found the phenomenon or a reflection2 of푛! our ignorance is still an right algorithms for them. However, we will now see that there are open question. problems that do inherently require exponential time. We just don’t know if any of the examples above fall into that category.

Chapter Recap

• There are many natural problems that have ✓ polynomial-time algorithms, and other natural problems that we’d love to solve, but for which the best known algorithms are exponential. • Often a polynomial time algorithm relies on dis- covering some hidden structure in the problem, or finding a surprising equivalent formulation for it. 414 introduction to theoretical computer science

• There are many interesting problems where there is an exponential gap between the best known algo- rithm and the best algorithm that we can rule out. Closing this gap is one of the main open questions of theoretical computer science.

12.5 EXERCISES

Exercise 12.1 — exponential time algorithm for longest path. The naive algo- rithm for computing the longest path in a given graph could take more than steps. Give a time algorithm for the longest 2 Hint: Use dynamic programming to compute for 2 every and the value path problem in vertex graphs. 푛 which equals if there is a simple path from to 푛! 푝표푙푦(푛)2 ■ that uses푠, 푡 exactly ∈ [푛] the vertices푆 ⊆ [푛] in . Do this푃 (푠, iteratively 푡, 푆) 푛 for ’s of growing1 sizes. 푠 푡 Exercise 12.2 — 2SAT algorithm. For every 2CNF , define the graph 푆 on vertices corresponding to the literals , such 푆 휑 that there is an edge iff the constraint 휑 is in . Prove that퐺 1 푛 1 푛 is unsatisfiable2푛 if and only if there is some 푥such, … , that 푥 , 푥 there,…, is푥 a path 푖 푗 푖 푗 from to and fromℓ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗ℓ to in . Showℓ how∨ ℓ to use휑 this to solve휑 2SAT in polynomial time. 푖 푖 푖 푖 푖 휑 푥 푥 푥 푥 퐺 ■

Exercise 12.3 — Reductions for showing algorithms. The following fact is true: there is a polynomial-time algorithm BIP that on input a graph outputs if and only if the graph is bipartite: there is a partition of to disjoint parts and such that every edge 퐺satisfies = (푉 , 퐸)either 1and or and . Use this fact to prove푉 that there is a polynomial-time푆 푇 algorithm to compute(푢, 푣) ∈ 퐸 that following function푢 ∈ 푆 CLIQUEPARTITION푣 ∈ 푇 푢 ∈ 푇 that푣 on ∈ input 푆 a graph outputs if and only if there is a partition of the graph into two parts and such that both and are cliques: for every 퐺pair = of (푉 distinct , 퐸) vertices1 , the edge is in and푉 similarly for every pair of distinct푆 푇 vertices ,푆 the edge푇 is in . 푢, 푣 ∈ 푆 (푢, 푣) 퐸 ■ 푢, 푣 ∈ 푇 (푢, 푣) 퐸

12.6 BIBLIOGRAPHICAL NOTES

The classic undergraduate introduction to algorithms text is [Cor+09]. Two texts that are less “encyclopedic” are Kleinberg and Tardos [KT06], and Dasgupta, Papadimitriou and Vazirani [DPV08]. Jeff Erickson’s book is an excellent algorithms text that is freely available online. The origins of the minimum cut problem date to the Cold War. Specifically, Ford and Fulkerson discovered their max-flow/min-cut algorithm in 1955 as a way to find out the minimum amount of train efficient computation: an informal introduction 415

tracks that would need to be blown up to disconnect from the rest of Europe. See the survey [Sch05] for more. Some algorithms for the longest path problem are given in [Wil09; Bjo14].

12.7 FURTHER EXPLORATIONS

Some topics related to this chapter that might be accessible to ad- vanced students include: (to be completed)

Learning Objectives: • Formally modeling running time, and in particular notions such as or time

algorithms. 3 • The classes P and EXP modelling푂(푛) polynomial푂(푛 ) and exponential time respectively. • The time hierarchy theorem, that in particular says that for every there are functions we can compute in time but can not

compute in time.푘 ≥ 1푘+1 푂(푛 ) 13 • The class P poly of푘 non uniform computation 푂(푛 ) and the result that P P poly / Modeling running time ⊆ /

“When the measure of the problem-size is reasonable and when the sizes assume values arbitrarily large, an asymptotic estimate of … the or- der of difficulty of [an] algorithm .. is theoretically important. It cannot be rigged by making the algorithm artificially difficult for smaller sizes”, Jack Edmonds, “Paths, Trees, and Flowers”, 1963

Max Newman: It is all very well to say that a machine could … do this or that, but … what about the time it would take to do it? Alan Turing: To my mind this time factor is the one question which will involve all the real technical difficulty. BBC radio panel on “Can automatic Calculating Machines Be Said to Think?”, 1952

In Chapter 12 we saw examples of efficient algorithms, and made some claims about their running time, but did not give a mathemati- cally precise definition for this concept. We do so in this chapter, using the models of Turing machines and RAM machines (or equivalently NAND-TM and NAND-RAM) we have seen before. The running time of an algorithm is not a fixed number since any non-trivial algo- rithm will take longer to run on longer inputs. Thus, what we want to measure is the dependence between the number of steps the algo- rithms takes and the length of the input. In particular we care about the distinction between algorithms that take at most polynomial time (i.e., time for some constant ) and problems for which every algorithm푐 requires at least exponential time (i.e., for some ). As 푐 mentioned푂(푛 ) in Edmond’s quote in Chapter푐 12, the difference푛 between these two can sometimes be as important as theΩ(2 difference) between푐 being computable and uncomputable.

This chapter: A non-mathy overview In this chapter we formally define what it means for a func- tion to be computable in a certain number of steps. As dis- cussed in Chapter 12, running time is not a number, rather

Compiled on 8.26.2020 18:10 418 introduction to theoretical computer science

Figure 13.1: Overview of the results of this chapter.

what we care about is the scaling behevaiour of the number of steps as the input size grows. We can use either Turing machines or RAM machines to give such a formal definition - it turns out that this doesn’t make a difference at the reso- lution we care about. We make several important definitions and prove some important theorems in this chapter. We will define the main time complexity classes we use in this book, and also show the Time Hierarchy Theorem which states that given more resources (more time steps per input size) we can compute more functions.

To put this in more “mathy” language, in this chapter we define what it means for a function to be computable in time steps, where is some function∗ mapping∗ the length of the input to the number of퐹 computation ∶ {0, 1} → steps {0, 1} allowed. Using this definition푇 (푛) we will do 푇the following (see also Fig. 13.1): 푛

• We define the class P of Boolean functions that can be computed in polynomial time and the class EXP of functions that can be com- puted in exponential time. Note that P EXP if we can compute a function in polynomial time, we can certainly compute it in expo- nential time. ⊆

• We show that the times to compute a function using a Turing Ma- chine and using a RAM machine (or NAND-RAM program) are polynomially related. In particular this means that the classes P and EXP are identical regardless of whether they are defined using Turing Machines or RAM machines / NAND-RAM programs. modeling running time 419

• We give an efficient universal NAND-RAM program and use this to establish the time hierarchy theorem that in particular implies that P is a strict subset of EXP. • We relate the notions defined here to the non uniform models of Boolean circuits and NAND-CIRC programs defined in Chapter 3.

We define P poly to be the class of functions that can be computed by a sequence of polynomial-sized circuits. We prove that P P poly / and that P poly contains uncomputable functions. / ⊆ 13.1 FORMALLY/ DEFINING RUNNING TIME

Our models of computation such Turing Machines, NAND-TM and NAND-RAM programs and others all operate by executing a se- quence of instructions on an input one step at a time. We can define the running time of an algorithm in one of these models by measur- ing the number of steps takes on input as a function of the length of the input. We start by defining푀 running time with respect to Tur- ing Machines: 푀 푥 |푥| Definition 13.1 — Running time (Turing Machines). Let be some function mapping natural numbers to natural numbers. We say that a function is computable푇 ∶ ℕ in → ℕ Turing- Machine time (TM-time for short)∗ if there∗ exists a Turing Machine such that for every퐹 ∶ sufficiently {0, 1} → {0,large 1}and every 푇 (푛) , when given input , the machine halts after executing at most 푛 푀 steps and outputs . 푛 푥 ∈ {0, 1}

We define푥TIMETM 푀to be the set of Boolean functions푇 (푛) (functions mapping퐹 (푥) to ) that are computable in TM time. (푇 (푛))∗ {0, 1} {0, 1} 푇 (푛)

 Big Idea 17 For a function and , we can formally define what it means∗ for to be computable in time at most where is퐹 the size∶ {0, of the 1} input.→ {0, 1} 푇 ∶ ℕ → ℕ 퐹 푇 (푛) 푛

P Definition 13.1 is not very complicated but is one of the most important definitions of this book. As usual,

TIMETM is a class of functions, not of machines. If is a Turing Machine then a statement such as “ is a member(푇 (푛)) of TIMETM ” does not make sense. 푀The concept of TM-time as2 defined here is sometimes푀 known as “single-tape Turing(푛 ) machine time” in the literature, since some texts consider Turing machines with more than one working tape. 420 introduction to theoretical computer science

The relaxation of considering only “sufficiently large” ’s is not very important but it is convenient since it allows us to avoid dealing explicitly with un-interesting “edge cases”. We will mostly푛 anyway be interested in determining running time only up to constant and even polynomial factors. While the notion of being computable within a certain running time can be defined for every function, the class TIMETM is a class of Boolean functions that have a single bit of output. This choice is not very important, but is made for simplicity and convenience(푇 (푛)) later on. In fact, every non-Boolean function has a computationally equivalent Boolean variant, see Exercise 13.3.

Solved Exercise 13.1 — Example of time bounds. Prove that TIMETM

TIMETM . 3 푛 (10⋅푛 ) ⊆■ (2 ) Solution:

The proof is illustrated in Fig. 13.2. Suppose that TIMETM and hence there some number and a machine such that for3 every , and , outputs퐹 ∈ within(10⋅ at 0 푛most) steps. Since 푁∗ , there is푀 some number 0 such that푛3 > for 푁 every 푥 ∈ {0,3, 1} 푀(푥)푛 . Hence퐹 (푥) for every 10max ⋅ 푛 , 10will ⋅ 푛output= 표(23 )within푛 at most steps, 1 1 Figure 13.2 푁just demonstrating that푛 >TIME 푁 TM10 ⋅ 푛. < 2 푛 : Comparing with 0 1 (on the right figure the Y axis is3 in log ′scale). 푛 > {푁 , 푁 } 푀(푥) 푛퐹 (푥) 2 ■ Since푛 for every large enough푇 (푛) =, 10푛 푇 (푛), = 퐹 ∈ (2 ) 2TIMETM TIMETM ′. 13.1.1 Polynomial and Exponential Time ′푛 푇 (푛) ≥ 푇 (푛) Unlike the notion of computability, the exact running time can be a (푇 (푛)) ⊆ (푇 (푛)) function of the model we use. However, it turns out that if we only care about “coarse enough” resolution (as will most often be the case) then the choice of the model, whether Turing Machines, RAM ma- chines, NAND-TM/NAND-RAM programs, or C/Python programs, does not matter. This is known as the extended Church-Turing Thesis. Specifically we will mostly care about the difference between polyno- mial and exponential time. The two main time complexity classes we will be interested in are the following:

• Polynomial time: A function is computable in

polynomial time if it is in the class P ∗ TIMETM . That is, P if there is an algorithm퐹 ∶ to {0, compute 1} → {0,that 1} runs in푐 time at 푐∈{1,2,3,…} most polynomial (i.e., at most for= some ∪ constant ) in the(푛 length) of the퐹 ∈ input. 푐 퐹 푛 푐 • Exponential time: A function is computable in

exponential time if it is in the class EXP ∗ TIMETM . 푐 퐹 ∶ {0, 1} → {0, 1} 푛 푐∈{1,2,3,…} = ∪ (2 ) modeling running time 421

That is, EXP if there is an algorithm to compute that runs in time at most exponential (i.e., at most for some constant ) in the 푐 length of퐹 the ∈ input. 푛 퐹 2 푐 In other words, these are defined as follows:

Definition 13.2 — P and EXP. Let . We say that P if there is a polynomial ∗ and a Turing Machine such that for every 퐹 ∶, {0, when 1} given→ input {0, 1} , the Turing 퐹machine ∈ halts within at most 푝∗ ∶steps ℕ → and ℝ outputs . 푀 We say that EXP푥 ∈if there {0, 1} is a polynomial 푥 and a Turing Machine such that푝(|푥|) for every 퐹, (푥)when given input , halts퐹 within ∈ at most steps and outputs푝 ∶∗ ℕ →. ℝ 푀 푝(|푥|) 푥 ∈ {0, 1} 푥 푀 2 퐹 (푥)

P Please take the time to make sure you understand these definitions. In particular, sometimes students think of the class EXP as corresponding to functions that are not in P. However, this is not the case. If is in EXP then it can be computed in exponential time. This does not mean that it cannot be computed in퐹 polynomial time as well.

Solved Exercise 13.2 — Differerent definitions of P. Prove that P as defined in

Definition 13.2 is equal to TIMETM 푐 ■ 푐∈{1,2,3,…} ∪ (푛 ) Solution: To show these two sets are equal we need to show that P

TIMETM and TIMETM P. We start with the former inclusion.푐 Suppose that P.푐 Then there is some⊆ 푐∈{1,2,3,…} 푐∈{1,2,3,…} ∪polynomial (푛 ) ∪ and a Turing machine(푛 ) ⊆ such that computes and halts on every input 퐹within ∈ at most steps. We can푝 write ∶ ℕ the → polynomial ℝ 푀 in the form푀 퐹 푀 where 푥 , and we assume푝(|푥|) that is nonzero (or푑 otherwise푖 we just let 푝correspond ∶ ℕ → to the ℝ largest 푖=0 푖 0 푑 푑 푝(푛)number = such ∑ that푎 푛 is nonzero).푎 , … , The 푎 degree∈ ℝ if the number . 푎 Since , no matter what푑 is the coefficient , for large 푑 enough푑 , 푑+1푎 which means that the푝 Turing machine푑 푑 will halt푛 on= inputs 표(푛 of) length푑+1 within fewer than steps,푎 and

hence 푛 푝(푛)TIMETM < 푛 TIMETM 푑+1. 푀

For the second inclusion,푑+1 suppose푛 that 푛푐 TIMETM . 푐∈{1,2,3,…} Then there퐹 ∈ is some(푛 positive) ⊆ ∪ such that (푛TIME) TM which푐 푐∈{1,2,3,…} means that there is a Turing Machine and퐹 ∈ some ∪ number푐 such(푛 ) 푐 ∈ ℕ 퐹 ∈ (푛 ) 0 푀 푁 422 introduction to theoretical computer science

that computes and for every , halts on length inputs within at most steps. Let be the maximum number 0 of steps푀 that takes퐹 on푐 inputs of length푛 > at most 푁 푀 . Then if we 푛 0 define the polynomial푛 푇 then we see that halts on 0 every input 푀within at most 푐 steps and hence푁 the existence of 0 demonstrates that 푝(푛)P. = 푛 + 푇 푀 푥 푝(|푥|) ■ 푀Since exponential time퐹 is∈ much larger than polynomial time, P EXP. All of the problems we listed in Chapter 12 are in EXP, but as we’ve seen, for some of them there are much better algorithms that⊆ demonstrate that they are in fact in the smaller class P.

P EXP (but not known to be in P) Shortest path Longest Path Min cut Max cut 2SAT 3SAT Linear eqs Quad. eqs Zerosum Nash Determinant Permanent Primality Factoring

Table : A table of the examples from Chapter 12. All these problems are in EXP but only the ones on the left column are currently known to be in P as well (i.e., they have a polynomial-time algorithm). See also Fig. 13.3.

R Remark 13.3 — Boolean versions of problems. Many of the problems defined in Chapter 12 correspond to non Boolean functions (functions with more than one bit of output) while P and EXP are sets of Boolean functions. However, for every non-Boolean function we can always define a computationally-equivalent Boolean function by letting be the -th bit 퐹of (see Exercise 13.3). Hence the table above, as well as Fig. 13.3퐺, refer to the퐺(푥, computationally- 푖) 푖 Figure 13.3: Some examples of problems that are equivalent퐹 (푥) Boolean variants of these problems. known to be in P and problems that are known to be in EXP but not known whether or not they are in P. Since both P and EXP are classes of Boolean functions, in this figure we always refer to the Boolean (i.e., Yes/No) variant of the problems. 13.2 MODELING RUNNING TIME USING RAM MACHINES / NAND- RAM

Turing Machines are a clean theoretical model of computation, but do not closely correspond to real-world computing architectures. The discrepancy between Turing Machines and actual computers does not matter much when we consider the question of which functions modeling running time 423

are computable, but can make a difference in the context of efficiency. Even a basic staple of undergraduate algorithms such as “merge sort” cannot be implemented on a Turing Machine in log time (see Section 13.8). RAM machines (or equivalently, NAND-RAM programs) match more closely actual computing architecture푂(푛 and what푛) we mean when we say or log algorithms in algorithms courses or whiteboard coding interviews. We can define running time with respect to NAND-RAM푂(푛) 푂(푛 programs푛) just as we did for Turing Machines.

Definition 13.4 — Running time (RAM). Let be some func- tion mapping natural numbers to natural numbers. We say that a function is푇computable ∶ ℕ → in ℕ RAM time (RAM-time for short) if∗ there exists a∗ NAND-RAM program such that for every퐹 sufficiently ∶ {0, 1} → large {0, 1}and every 푇 (푛), when given input , the program halts after executing at most 푛 lines푃 and outputs . 푛 푥 ∈ {0, 1}

We푥 define TIMERAM푃 to be the set of Boolean푇 functions (푛) (functions퐹 (푥) mapping to ) that are computable in RAM time. (푇 (푛))∗ {0, 1} {0, 1} 푇 (푛) Because NAND-RAM programs correspond more closely to our natural notions of running time, we will use NAND-RAM as our “default” model of running time, and hence use TIME (without any subscript) to denote TIMERAM . However, it turns out that as long as we only care about the difference between(푇 exponential (푛)) and polynomial time, this does not make(푇 (푛)) much difference. The reason is that Turing Machines can simulate NAND-RAM programs with at most a polynomial overhead (see also Fig. 13.4):

Theorem 13.5 — Relating RAM and Turing machines. Let be a function such that for every and the map can be computed by a Turing machine in time 푇. Then ∶ ℕ → ℕ 푇 (푛) ≥ 푛 푛 푛 ↦ 푇 (푛) TIMETM TIMERAM 푂(푇TIME (푛))TM (13.1) 4 (푇 (푛)) ⊆ (10 ⋅ 푇 (푛)) ⊆ (푇 (푛) ).

P The technical details of Theorem 13.5, such as the con- dition that is computable in time or the constants and in (13.1) (which are not tight and can be푛 improved), ↦ 푇 (푛) are not very important.푂(푇 (푛)) In par- ticular, all non pathological10 4 time bound functions we encounter in practice such as , log , etc. will satisfy the conditions of Theo- rem 13.5, see also푛 Remark 13.6푇. (푛) = 푛 푇 (푛) = 푛 푛 푇 (푛) = 2 424 introduction to theoretical computer science

The main message of the theorem is Turing Machines and RAM machines are “roughly equivalent” in the sense that one can simulate the other with polyno- mial overhead. Similarly, while the proof involves some technical details, it’s not very deep or hard, and merely follows the simulation of RAM machines with Turing Machines we saw in Theorem 8.1 with more careful “book keeping”.

For example, by instantiating Theorem 13.5 with and

using the fact that , we see that TIMETM 푎

TIMERAM TIME푎 TM 푎+1 which means that푇 (by (푛)Solved =푎 푛 Ex- ercise 13.2)푎+1 10푛 = 표(푛4푎+4 ) (푛 ) ⊆ (푛 ) ⊆ (푛 ) P TIMETM TIMERAM (13.2) 푎 푎 That is, we could푎=1,2,… have equally well defined푎=1,2,… P as the class of functions = ∪ (푛 ) = ∪ (푛 ). computable by NAND-RAM programs (instead of Turing Machines) Figure 13.4 that run in time polynomial in the length of the input. Similarly, by : The proof of Theorem 13.5 shows that we can simulate steps of a Turing Machine with instantiating Theorem 13.5 with we see that the class EXP steps of a NAND-RAM program, and can simulate 푎 can also be defined as the set of functions푛 computable by NAND-RAM steps of a NAND-RAM푇 program with 푇 steps of a Turing Machine. Hence TIMETM programs in time at most where푇 (푛) =is 2 some polynomial. Similar 4 TIME푇 RAM TIMETM . 표(푇 ) equivalence results are known푝(푛) for many models including cellular 4 (푇 (푛)) ⊆ automata, C/Python/Javascript2 programs,푝 parallel computers, and a (10 ⋅ 푇 (푛)) ⊆ (푇 (푛) ) great many other models, which justifies the choice of P as capturing a technology-independent notion of tractability. (See Section 13.3 for more discussion of this issue.) This equivalence between Turing machines and NAND-RAM (as well as other models) allows us to pick our favorite model depending on the task at hand (i.e., “have our cake and eat it too”) even when we study questions of efficiency, as long as we only care about the gap between polynomial and exponential time. When we want to design an algorithm, we can use the extra power and convenience afforded by NAND-RAM. When we want to analyze a program or prove a negative result, we can restrict our attention to Turing machines.

 Big Idea 18 All “reasonable” computational models are equiv- alent if we only care about the distinction between polynomial and exponential.

The adjective “reasonable” above refers to all scalable computa- tional models that have been implemented, with the possible excep- tion of quantum computers, see Section 13.3 and Chapter 23.

Proof Idea:

The direction TIMETM TIMERAM is not hard to show, since a NAND-RAM program can simulate a Turing Machine (푇 (푛)) ⊆ (10 ⋅ 푇 (푛)) 푃 modeling running time 425

with constant overhead by storing the transition table of in an array (as is done in the proof of Theorem 9.1). Simulating every step푀 of the Turing machine can be done in a constant number푀 of steps of RAM, and it can be shown this constant is smaller than

. Thus the heart of the theorem is to prove that TIMERAM 푐

TIMETM . This proof closely follows the proof푐 of Theorem 8.1, where10 we have4 shown that every function that is computable(푇 (푛)) by ⊆ a NAND-RAM(푇 (푛) ) program is computable by a Turing Machine (or equivalently a NAND-TM program) . To퐹 prove Theorem 13.5, we follow the exact same proof푃 but just check that the overhead of the simulation of by is polynomial.푀 The proof has many details, but is not deep. It is therefore much more important that you understand the statement of푃 this푀 theorem than its proof.

⋆ Proof of Theorem 13.5. We only focus on the non-trivial direction

TIMERAM TIMETM . Let TIMERAM . can be computed in time by some4 NAND-RAM program and we need to show(푇 (푛)) that ⊆ it can also(푇 be (푛) computed) 퐹 ∈in time (푇 (푛))by a Turing퐹 Machine . This will푇 follow (푛) from showing that can be4 computed푃 in time by a NAND-TM program, since for every푇 (푛) NAND-TM program 푀there4 is a Turing Machine simulating퐹 it such that each iteration푇 of (푛) corresponds to a single step of . As mentioned푄 above, we follow the푀 proof of Theorem 8.1 (simula- tion of NAND-RAM푄 programs using NAND-TM푀 programs) and use the exact same simulation, but with a more careful accounting of the number of steps that the simulation costs. Recall, that the simulation of NAND-RAM works by “peeling off” features of NAND-RAM one by one, until we are left with NAND-TM. We will not provide the full details but will present the main ideas used in showing that every feature of NAND-RAM can be simulated by NAND-TM with at most a polynomial overhead:

1. Recall that every NAND-RAM variable or array element can con- tain an integer between and where is the number of lines that have been executed so far. Therefore if is a NAND-RAM pro- gram that computes in0 푇time, then푇 on inputs of length , all integers used by are of magnitude at푃 most . This means that the largest value i can퐹 ever푇 (푛) reach is at most and so each푛 one of ’s variables can푃 be thought of as an array of푇 at (푛) most indices, each of which holds a natural number of magnitude푇 (푛) at most . 푃We let log be the number of bits needed to푇 encode (푛) such numbers. (We can start off the simulation by computing 푇and (푛) .) ℓ = ⌈ 푇 (푛)⌉ 푇 (푛) ℓ 426 introduction to theoretical computer science

2. We can encode a NAND-RAM array of length containing numbers in as an Boolean (i.e., NAND-TM) array of log bits, which we can≤ also 푇 (푛) think of as a two dimensional{0, … array , 푇 (푛)as − we 1} did in the proof of Theorem 8.1. We encode푇 (푛)ℓ a =NAND-RAM 푂(푇 (푛) scalar푇 (푛)) containing a number in simply by a shorter NAND-TM array of bits. {0, … , 푇 (푛) − 3.1} We can simulate the two dimensional arraysℓ using one- dimensional arrays of length log . All the arithmetic operations on integers use the grade-school algorithms, that take time that is polynomial푇 (푛)ℓ in the = 푂(푇 number (푛) of푇 bits(푛) of the integers, which is log in our case. Hence we can simulate steps of NAND-RAM with logℓ steps of a model that uses random푝표푙푦( access푇 (푛)) memory but only Boolean-valued 푇one-dimensional (푛) arrays. 푂(푇 (푛)푝표푙푦( 푇 (푛))

4. The most expensive step is to translate from random access mem- ory to the sequential memory model of NAND-TM/Turing Ma- chines. As we did in the proof of Theorem 8.1 (see Section 8.2), we can simulate accessing an array Foo at some location encoded in an array Bar by:

a. Copying Bar to some temporary array Temp b. Having an array Index which is initially all zeros except at the first location. c. Repeating the following until Temp encodes the number 1: (Number of repetitions is at most .) • Decrease the number encoded temp by . (Take number0 of steps polynomial in log .)푇 (푛) • Decrease i until it is equal to . (Take 1 steps.) • Scan Index untilℓ = we⌈ reach푇 (푛)⌉ the point in which it equals and then change this to and go0 one step푂(푇 further (푛) and write in this location. (Takes steps.) 1 d. When we are done1 we know0 that if we scan Index until we reach1 the point in which Index[i]푂(푇 (푛)) then i contains the value that was encoded by Bar (Takes steps.) = 1 The total cost for each such operation푂(푇 (푛) is log steps. 2 In sum,2 we simulate a single step of NAND-RAM푂(푇 (푛) +푇 using (푛)푝표푙푦( 푇 (푛))) = 푂(푇 (푛) ) log steps of NAND-TM, and hence the total simulation2 time is log which is smaller than 푂(푇for sufficiently (푛) 푝표푙푦( 푇large (푛))). 3 4 푂(푇 (푛) 푝표푙푦( 푇 (푛))) 푇 (푛)■ 푛 modeling running time 427

R Remark 13.6 — Nice time bounds. When considering general time bounds such we need to make sure to rule out some “pathological” cases such as functions that don’t give enough time for the algorithm to read the input, or functions where the time bound itself is 푇 uncomputable. We say that a function is a nice time bound function (or nice function for short) if for every , (i.e., 푇allows ∶ ℕ →enough ℕ time to read the input), for every , (i.e., allows푛 more ∈ ℕ time푇 (푛) on ≥ longer 푛 ′ inputs),푇 and′ the map (i.e., mapping푛 ≥ a 푛 string푇 (푛 of) ≥ length 푇 (푛) to a푇 sequence of 푇 (|푥|) ones) can be computed by a NAND-RAM퐹 (푥) = program 1 in time. 푛All the “normal” time푇 (푛) complexity bounds we en- counter in applications such푂(푇 as (푛)) , log , , etc. are “nice”. √ Hence from now2 on we will only푇 (푛) care푛 about = the 100푛 class푇 (푛)TIME = 푛 when푛 푇 (푛)is a= “nice” 2 function. The computability condition is in particular typically easily satisfied. For(푇 (푛)) example,푇 for arithmetic functions such as , we can typically compute the binary representation of3 in time polynomial in the num- ber푇 of (푛) bits of = 푛 and hence poly-logarithmic in . Hence the time to푇 write (푛) the string in such cases will be 푇 (푛) log 푇 (푛) . 푇 (푛) 1 푇 (푛) + 푝표푙푦( 푇 (푛)) = 푂(푇 (푛))

13.3 EXTENDED CHURCH-TURING THESIS (DISCUSSION)

Theorem 13.5 shows that the computational models of Turing Machines and RAM Machines / NAND-RAM programs are equivalent up to poly- nomial factors in the running time. Other examples of polynomially equivalent models include:

• All standard programming languages, including C/Python/- JavaScript/Lisp/etc.

• The calculus (see also Section 13.8).

• Cellular automata 휆 • Parallel computers

• Biological computing devices such as DNA-based computers.

The Extended Church Turing Thesis is the statement that this is true for all physically realizable computing models. In other words, the extended Church Turing thesis says that for every scalable computing device (which has a finite description but can be in principle used to run computation on arbitrarily large inputs), there is some con- stant 퐶such that for every function that can ∗ 푎 퐹 ∶ {0, 1} → {0, 1} 퐶 428 introduction to theoretical computer science

compute on length inputs using an amount of physical re- sources, is in TIME . This is a strengthening of the (“plain”) Church-Turing푛 Thesis, discussed푎 in Section푆(푛) 8.8, which states that the set of computable퐹 functions(푆(푛) ) is the same for all physically realizable models, but without requiring the overhead in the simulation between different models to be at most polynomial. All the current constructions of scalable computational models and programming languages conform to the Extended Church-Turing Thesis, in the sense that they can be simulated with polynomial over- head by Turing Machines (and hence also by NAND-TM or NAND- RAM programs). Consequently, the classes P and EXP are robust to the choice of model, and we can use the programming language of our choice, or high level descriptions of an algorithm, to determine whether or not a problem is in P. Like the Church-Turing thesis itself, the extended Church-Turing thesis is in the asymptotic setting and does not directly yield an ex- perimentally testable prediction. However, it can be instantiated with more concrete bounds on the overhead, yielding experimentally- testable predictions such as the Physical Extended Church-Turing Thesis we mentioned in Section 5.6. In the last hundred+ years of studying and mechanizing com- putation, no one has yet constructed a scalable computing device that violates the extended Church Turing Thesis. However, quan- tum computing, if realized, will pose a serious challenge to the ex- tended Church-Turing Thesis (see Chapter 23). However, even if the promises of quantum computing are fully realized, the extended Church-Turing thesis is “morally” correct, in the sense that, while we do need to adapt the thesis to account for the possibility of quantum computing, its broad outline remains unchanged. We are still able to model computation mathematically, we can still treat programs as strings and have a universal program, we still have time hierarchy and uncomputability results, and there is still no reason to doubt the (“plain”) Church-Turing thesis. Moreover, the prospect of quantum computing does not seem to make a difference for the time complexity of many (though not all!) of the concrete problems that we care about. In particular, as far as we know, out of all the example problems men- tioned in Chapter 12 the complexity of only one— integer factoring— is affected by modifying our model to include quantum computers as well. modeling running time 429

13.4 EFFICIENT UNIVERSAL MACHINE: A NAND-RAM INTER- PRETER IN NAND-RAM

We have seen in Theorem 9.1 the “universal Turing Machine”. Exam- ining that proof, and combining it with Theorem 13.5 , we can see that the program has a polynomial overhead, in the sense that it can sim- ulate steps of a given NAND-TM (or NAND-RAM) program on an input in푈 steps. But in fact, by directly simulating NAND- RAM푇 programs we4 can do better with only a constant multiplicative푃 overhead.푥 That푂(푇 is, there) is a universal NAND-RAM program such that for every NAND-RAM program , simulates steps of using only steps. (The implicit constant in the notation can푈 depend on the program but does not depend푃 푈 on the length푇 of the푃 input.) 푂(푇 ) 푂 Theorem 13.7 —푃 Efficient universality of NAND-RAM. There exists a NAND- RAM program satisfying the following:

1. ( is a universal푈 NAND-RAM program.) For every NAND-RAM program and input , where by we denote푈 the output of on a string encoding the pair . 푃 푥 푈(푃 , 푥) = 푃 (푥) 푈(푃 , 푥) 2. ( is efficient.) There푈 are some constants such that(푃 for , 푥) ev- ery NAND-RAM program , if halts on input after most 푈steps, then halts after at most 푎, 푏 steps where . 푃 푃 푥 푇 푏 푈(푃 , 푥) 퐶 ⋅ 푇 퐶 ≤ 푎|푃 |

P As in the case of Theorem 13.5, the proof of Theo- rem 13.7 is not very deep and so it is more important to understand its statement. Specifically, if you under- stand how you would go about writing an interpreter for NAND-RAM using a modern programming lan- guage such as Python, then you know everything you need to know about the proof of this theorem.

Proof of Theorem 13.7. To present a universal NAND-RAM program in full we would need to describe a precise representation scheme, Figure 13.5: The universal NAND-RAM program as well as the full NAND-RAM instructions for the program. While simulates an input NAND-RAM program this can be done, it is more important to focus on the main ideas, and by storing all of ’s variables inside a single array 푈Vars of . If has variables, then the array푃Vars so we just sketch the proof here. A specification of NAND-RAM is is divided into blocks푃 of length , where the -th given in the appendix, and for the purposes of this simulation, we can coordinate푈 of푃 the -th푡 block contains the -th element of the -th array of . If the -th variable of is simply use the representation of the code NAND-RAM as an ASCII 푡 푗 scalar, then we just푖 store its value in the zeroth푖 block string. of Vars푗. 푃 푗 푃 430 introduction to theoretical computer science

The program gets as input a NAND-RAM program and an input and simulates one step at a time. To do so, does the fol- lowing: 푈 푃 푥 푃 푈 1. maintains variables program_counter, and number_steps for the current line to be executed and the number of steps executed so far. 푈 2. initially scans the code of to find the number of unique vari- able names that uses. It will translate each variable name into a number푈 between and 푃and use an array Program푡 to store ’s code where for every푃 line , Program[ ] will store the -th line of where the variable0 names푡 − have 1 been translated to numbers. (More푃 concretely, we will use a constantℓ numberℓ of arrays to separatelyℓ 푃 encode the operation used in this line, and the variable names and indices of the operands.)

3. maintains a single array Vars that contains all the values of ’s variables. We divide Vars into blocks of length . If is a num- ber푈 corresponding to an array variable Foo of , then we store 푃 Foo[0] in Vars[ ], we store Foo[1] in Var_values[푡 푠 ], Foo[2] in Vars[ ] and so on and so forth (see Fig.푃 13.5). Generally,if the -th variable푠 of is a scalar variable, then its value푡 + will 푠 be stored in2푡 location + 푠 Vars[ ]. If it is an array variable then the value of its -th푠 element will푃 be stored in location Vars[ ]. 푠 Pro- 4. To푖 simulate a single step of , the program recovers푡 ⋅ 푖 + 푠 from gram the line corresponding to program_counter and executes it. Since NAND-RAM has a constant푃 number of푈 arithmetic operations, we can implement the logic of which operation to execute using a sequence of a constant number of if-then-else’s. Retrieving from Vars the values of the operands of each instruction can be done using a constant number of arithmetic operations.

The setup stages take only a constant (depending on but not on the input ) number of steps. Once we are done with the setup, to simulate a single step of , we just need to retrieve the corresponding|푃 | line and do a푥 constant number of “if elses” and accesses to Vars to simulate it. Hence the total푃 running time to simulate steps of the program is at most when suppressing constants that depend on the program . 푇 푃 푂(푇 ) ■ 푃 13.4.1 Timed Universal Turing Machine One corollary of the efficient universal machine is the following. Given any Turing Machine , input , and “step budget” , we can simulate the execution of for steps in time that is polynomial in 푀 푥 푇 푀 푇 modeling running time 431

. Formally, we define a function TIMEDEVAL that takes the three parameters , , and the time budget, and outputs if halts 푇within at most steps, and outputs otherwise. The timed univer- sal Turing Machine푀 푥 computes TIMEDEVAL in polynomial푀(푥) time푀 (see Fig. 13.6). (Since푇 we measure time as0 a function of the input length, we define TIMEDEVAL as taking the input represented in unary: a string of ones.) 푇 Theorem푇 13.8 — Timed Universal Turing Machine. Let TIMEDEVAL be the function defined as ∗ ∗ ∶ {0, 1} → {0, 1} halts within steps on TIMEDEVAL otherwise 푇 ⎧ {푀(푥) 푀 ≤ 푇 (13.3)푥 (푀, 푥, 1 ) = ⎨ . Then TIMEDEVAL P. ⎩{0

∈ Proof. We only sketch the proof since the result follows fairly directly from Theorem 13.5 and Theorem 13.7. By Theorem 13.5 to show that TIMEDEVAL P, it suffices to give a polynomial-time NAND-RAM program to compute TIMEDEVAL. Such a program∈ can be obtained as follows. Given a Turing Ma- chine , by Theorem 13.5 we can transform it in time polynomial in Figure 13.6: The timed universal Turing Machine takes its description into a functionally-equivalent NAND-RAM program as input a Turing machine , an input , and a time bound , and outputs if halts within at such푀 that the execution of on steps can be simulated by the most steps. Theorem 13.8푀states that there푥 is such a execution of on steps. We can then run the universal NAND- machine푇 that runs in time푀(푥) polynomial푀 in . 푃RAM machine of Theorem 13.7푀 to simulate푇 for steps, using 푇 푇 time, and푃 output푐 ⋅ 푇 if the execution did not halt within this bud- get. This shows that TIMEDEVAL can be computed푃 푐 ⋅ by푇 a NAND-RAM 푂(푇program ) in time polynomial0 in and linear in , which means TIMEDEVAL P. |푀| 푇 ■ ∈ 13.5 THE TIME HIERARCHY THEOREM

Some functions are uncomputable, but are there functions that can be computed, but only at an exorbitant cost? For example, is there a function that can be computed in time , but can not be computed in time ? It turns out that the answer is푛 Yes: 0.9푛 2 Theorem2 13.9 — Time Hierarchy Theorem. For every nice function , there is a function in TIME log TIME . ∗ 푇⧵ ∶ ℕ → ℕ 퐹 ∶ {0, 1} → {0, 1} (푇 (푛) 푛) There(푇 is (푛)) nothing special about log , and we could have used any other efficiently computable function that tends to infinity with . 푛 푛 432 introduction to theoretical computer science

 Big Idea 19 If we have more time, we can compute more functions.

R Remark 13.10 — Simpler corollary of the time hierarchy theorem. The generality of the time hierarchy theorem can make its proof a little hard to read. It might be easier to follow the proof if you first try to prove by yourself the easier statement P EXP. You can do so by showing that the following function is in EXP⊊ P: for every Turing Machine ∗ and input , ⧵ if and only if log 퐹halts ∶ {0, on 1} within∶→ {0, at 1} most steps. One can log show that푀 TIME푥 퐹 (푀, 푥)|푥| = 1EXP using the universal푀 Turing푥 machine (or푂(|푥| the푛) efficient universal NAND-RAM퐹 program ∈ of(푛Theorem) 13.7 ⊆ ). On the other harnd, we can use similar ideas to those used to show the uncomputability of HALT in Section 9.3.2 to prove that P.

퐹 ∉

Figure 13.7: The Time Hierarchy Theorem (Theo- rem 13.9) states that all of these classes are distinct.

Proof Idea: In the proof of Theorem 9.6 (the uncomputability of the Halting problem), we have shown that the function HALT cannot be com- puted in any finite time. An examination of the proof shows that it gives something stronger. Namely, the proof shows that if we fix our computational budget to be steps, then not only we can’t distinguish between programs that halt and those that do not, but cannot even distinguish between programs푇 that halt within at most steps and those that take more than that (where is some number′ depending on ). Therefore, the proof of Theorem 13.9′ follows the푇 ideas of the 푇 푇 modeling running time 433

uncomputability of the halting problem, but again with a more careful accounting of the running time.

Proof⋆ of Theorem 13.9. Our proof is inspired by the proof of the un- computability of the halting problem. Specifically, for every function as in the theorem’s statement, we define the Bounded Halting func- tion HALT as follows. The input to HALT is a pair such that 푇 log log encodes some NAND-RAM program. We define 푇 푇 (푃 , 푥) |푃 | ≤ |푥| halts on within steps HALT otherwise ⎧ 푇 {1, 푃 푥 ≤ 100 ⋅ 푇 (|푃 | + |푥|) (13.4) (푃 , 푥) = ⎨ . (The constant ⎩{and0, the function log log are rather arbitrary, and are chosen for convenience in this proof.) Theorem 13.9100is an immediate consequence푛 of the following two claims: Claim 1: HALT TIME log and 푇 Claim 2: HALT ∈ TIME(푇 (푛) ⋅. 푛) Please make sure you understand why indeed the theorem follows 푇 directly from the combination∉ (푇 (푛))of these two claims. We now turn to proving them. Proof of claim 1: We can easily check in linear time whether an input has the form where log log . Since is a nice function, we can evaluate it in time. Thus, we can compute HALT as follows:푃 , 푥 |푃 | ≤ |푥| 푇 (⋅) 푂(푇 (푛)) 푇 1. Compute(푃 , 푥) in steps.

0 0 2. Use the universal푇 = 푇 (|푃 NAND-RAM | + |푥|) 푂(푇 program) of Theorem 13.7 to simu- late steps of on the input using at most steps. (Recall that we use to denote a quantity that is bounded by 0 0 for100⋅푇 some constants푃 .) 푥 푝표푙푦(|푃 |)푇 푏 푝표푙푦(ℓ) 3.푎ℓ If halts within these 푎, 푏 steps then output , else output .

0 The푃 length of the input100 is ⋅ 푇 . Since 1 and 0 log log log for every , the running time will be 푏 log and hence푛 =the |푃 above | + |푥| algorithm|푥| demonstrates ≤ 푛 that (HALT |푥|)TIME= 표( |푥|)log , completing푏 the proof of Claim 1. 표(푇Proof (|푃 | + of |푥|) claim푛) 2: This proof is the heart of Theorem 13.9, and is 푇 very reminiscent∈ (푇 of (푛) the ⋅ proof푛) that HALT is not computable. Assume, for the sake of contradiction, that there is some NAND-RAM program that computes HALT within steps. We are going ∗ 푇 푃 (푃 , 푥) 푇 (|푃 | + |푥|) 434 introduction to theoretical computer science

to show a contradiction by creating a program and showing that under our assumptions, if runs for less than steps when given (a padded version of) its own code as input then푄 it actually runs for more than steps and vice푄 versa. (It is worth푇 (푛) re-reading the last sentence twice or thrice to make sure you understand this logic. It is very similar푇 to(푛) the direct proof of the uncomputability of the halting problem where we obtained a contradiction by using an assumed “halting solver” to construct a program that, given its own code as input, halts if and only if it does not halt.) We will define to be the program that on input a string does the following: ∗ 푄 푧 1. If does not have the form where represents a NAND- RAM program and log log푚 then return . (Recall that 푧denotes the string of ones.)푧 = 푃 1 푃 푚 |푃 | < 0.1 푚 0 2.1 Compute (at푚 a cost of at most steps, under our assumptions).∗ 푏 = 푃 (푃 , 푧) 푇 (|푃 | + |푧|) 3. If then goes into an infinite loop, otherwise it halts. ∗ Let푏 =be 1 the length푄 description of as a string, and let be larger than . We will reach a contradiction∗ by splitting into cases ac- 1000ℓ cording2ℓ to whether or not HALT 푄 equals or .푚 On2 the one hand, if HALT ∗ ∗ 푚 , then under our as- 푇 sumption that computes HALT∗(푄, ∗, 푄푚will1 ) go into an0 infinite1 loop 푇 on input ∗ , and hence(푄 in, particular 푄 1∗ ) = 1 does not halt within 푇 푃∗steps푚 on the input .푄 But this contradicts∗ our assump- tion that∗HALT푧 = 푄 1 . 푄 100푇This (|푄 means| + 푚) that∗ it must∗ 푚 hold that푧 HALT . But 푇 in this case, since(푄 we, 푄assume1 ) = 1 computes HALT∗ ,∗ 푚 does not do 푇 anything in phase 3 of its computation,∗ and(푄 so the, 푄 only1 ∗) computa- = 0 푇 tion costs come in phases 1 and푃 2 of the computation.푄 It is not hard to verify that Phase 1 can be done in linear and in fact less than steps. Phase 2 involves executing , which under our assumption requires steps. In total∗ we can perform both phases5|푧| in less than ∗ in steps,푃 which by definition means that HALT 푇 (|푄 | +∗ 푚) , but this is of course a contradiction. This completes10푇∗ the (|푄∗ proof푚| + of 푚) Claim 2 and hence of Theorem 13.9. 푇 (푄 , 푄 1 ) = 1 ■

Solved Exercise 13.3 — P vs EXP. Prove that P EXP. ■ ⊊ Solution: modeling running time 435

We show why this statement follows from the time hierarchy theorem, but it can be an instructive exercise to prove it directly, see Remark 13.10. We need to show that there exists EXP P. Let log and log . Both are nice functions.⧵ Since 푛 ′log , by Theorem푛/2 13.9 there퐹 ∈ exists some in푇TIME (푛) =′ 푛 푇TIME(푛) = 푛 . Since for sufficiently large , log 푇 (푛)/푇, ′ (푛)TIME = 휔( 푛) EXP. On the other hand, P. 퐹Indeed,푛 suppose(푇푛 (푛)) otherwise ⊊ 푛 that(푇 there (푛)) was a constant and푛 2a Turing> 푛 Machine퐹 ∈ computing(2 ) ⊆on -length input in at most퐹 ∉ steps for all sufficiently large . Then since for large푐 enough >푐 0 log , it would have퐹 followed푛 that TIME 푛log contradicting푐 푛/2 our choice of .푛 푛 푛/2 푛 < 푛 퐹 ∈ (푛 ■) The time hierarchy theorem퐹 tells us that there are functions we can compute in time but not , in time, but not , etc.. In √ particular there2 are most definitely functions푛 that we can푛 compute in time but푂(푛 not ) . We have seen푂(푛) that2 we have no shortage2 of natu- ral functions푛 for which the best known algorithm requires roughly time,2 and that many푂(푛) people have invested significant effort in trying푛 to improve that. However, unlike in the finite vs. infinite case,2 for all of the examples above at the moment we do not know how to rule out even an time algorithm. We will however see that there is a single unproven conjecture that would imply such a result for most of these problems.푂(푛) The time hierarchy theorem relies on the existence of an efficient universal NAND-RAM program, as proven in Theorem 13.7. For other models such as Turing Machines we have similar time hierarchy results showing that there are functions computable in time and not in time where corresponds to the overhead in the corresponding universal machine. 푇 (푛) 푇 (푛)/푓(푛) 푓(푛) 13.6 NON UNIFORM COMPUTATION Figure 13.8: Some complexity classes and some of the We have now seen two measures of “computation cost” for functions. functions we know (or conjecture) to be contained in them. In Section 4.6 we defined the complexity of computing finite functions using circuits / straightline programs. Specifically, for a finite function and number , SIZE if there is a circuit of at most푛 NAND gates (or equivalently a -line NAND-CIRC 푔program) ∶ {0, 1} that→ {0, computes 1} . To relate푇 ∈ thisℕ 푔 to ∈ the classes(푇 ) TIME defined 푇in this chapter we first need푇 to extend theclass SIZE from finite functions to푔 functions with unbounded input(푇 length. (푛)) (푇 (푛)) Definition 13.11 — Non uniform computation. Let and be a nice time bound. For every ∗ , define 퐹 ∶ {0, 1} → {0, 1} 푇 ∶ ℕ → ℕ 푛 ∈ ℕ 436 introduction to theoretical computer science

to be the restriction of to inputs of size . That is, is푛 the function mapping to such that for ↾푛 퐹every∶ {0, 1} →, {0, 1} . 푛 퐹 푛 ↾푛 We say퐹 that 푛 is non-uniformly computable{0, 1} in{0, at most 1} size, de- ↾푛 noted푥 by ∈ {0, 1}SIZE퐹 (푥) = 퐹if (푥) there exists a sequence of NAND circuits퐹 such that: 푇 (푛) 0 1 2 퐹 ∈ (푇 (푛)) (퐶 , 퐶 , 퐶 , …) • For every , computes the function

푛 ↾푛 • For every sufficiently푛 ∈ ℕ 퐶 large , has at most퐹 gates.

푛 푛 퐶 푇 (푛) The non uniform analog to the class P is the class P poly defined as

/ P poly SIZE (13.5) 푐 There is a big difference/ between푐∈ℕ non uniform computation and uni- = ∪ (푛 ). form complexity classes such as TIME or P. The condition P means that there is a single Turing machine that computes

on all inputs in polynomial time. The(푇 condition (푛)) P poly only 퐹means ∈ that for every input length there can be a different푀 circuit / 퐹that computes using polynomially many gates on퐹 inputs ∈ of these 푛 lengths. As we will see, P poly푛does not necessarily imply that퐶 P. However,퐹 the other direction is true: / 퐹 ∈ 퐹Theorem ∈ 13.12 — Nonuniform computation contains uniform computa- tion. There is some s.t. for every nice and , ∗ 푎 ∈ ℕ 푇 ∶ ℕ → ℕ 퐹 ∶ {0, 1} → {0, 1}TIME SIZE (13.6) 푎 In particular, Theorem 13.12(푇 (푛))shows ⊆ that(푇 for (푛) every). , TIME SIZE and hence P P poly. 푐 푐푎 푐 (푛 ) ⊆ Figure 13.9: We can think of an infinite function Proof Idea: / (푛 ) ⊆ as a collection of finite functions The idea behind the proof is to “unroll the loop”. Specifically, we ∗ where is the restriction퐹 ∶ {0, 1} of→ {0,to 1}inputs of length푛 . We say is in will use the programming language variants of non-uniform and uni- 0 1 2 ↾푛 P퐹 poly, 퐹 if, 퐹 for,… every , the퐹 function∶ {0, 1} →is {0, computable 1} form computation: namely NAND-CIRC and NAND-TM. The main by a polynomial퐹 size NAND-CIRC푛 program, or퐹 / ↾푛 difference between the two is that NAND-TM has loops. However, for equivalently, a polynomial푛 sized Boolean퐹 circuit. every fixed , if we know that a NAND-TM program runs in at most steps, then we can replace its loop by simply “copying and past- ing” its code푛 times, similar to how in Python we can replace code such푇 (푛) as 푇 (푛) for i in range(4): print(i)

with the “loop free” code modeling running time 437

print(0) print(1) print(2) print(3)

To make this idea into an actual proof we need to tackle one tech- nical difficulty, and this is to ensure that the NAND-TM programis oblivious in the sense that the value of the index variable i in the -th iteration of the loop will depend only on and not on the contents of the input. We make a digression to do just that in Section 13.6.1 and푗 then complete the proof of Theorem 13.12푗.

⋆ 13.6.1 Oblivious NAND-TM programs Our approach for proving Theorem 13.12 involves “unrolling the loop”. For example, consider the following NAND-TM to compute the XOR function on inputs of arbitrary length: temp_0 = NAND(X[0],X[0]) Y_nonblank[0] = NAND(X[0],temp_0) temp_2 = NAND(X[i],Y[0]) temp_3 = NAND(X[i],temp_2) temp_4 = NAND(Y[0],temp_2) Y[0] = NAND(temp_3,temp_4) MODANDJUMP(X_nonblank[i],X_nonblank[i])

Setting (as an example) , we can attempt to translate this NAND-TM program into a NAND-CIRC program for computing XOR by푛 simply = 3 “copying and pasting” the loop three times (dropping3 the MODANDJMP line): 3 ∶ {0, 1} → {0, 1} temp_0 = NAND(X[0],X[0]) Y_nonblank[0] = NAND(X[0],temp_0) temp_2 = NAND(X[i],Y[0]) temp_3 = NAND(X[i],temp_2) temp_4 = NAND(Y[0],temp_2) Y[0] = NAND(temp_3,temp_4) temp_0 = NAND(X[0],X[0]) Y_nonblank[0] = NAND(X[0],temp_0) temp_2 = NAND(X[i],Y[0]) temp_3 = NAND(X[i],temp_2) temp_4 = NAND(Y[0],temp_2) Y[0] = NAND(temp_3,temp_4) temp_0 = NAND(X[0],X[0]) Y_nonblank[0] = NAND(X[0],temp_0) 438 introduction to theoretical computer science

temp_2 = NAND(X[i],Y[0]) temp_3 = NAND(X[i],temp_2) temp_4 = NAND(Y[0],temp_2) Y[0] = NAND(temp_3,temp_4)

However, the above is still not a valid NAND-CIRC program since it contains references to the special variable i. To make it into a valid NAND-CIRC program, we replace references to i in the first iteration with , references in the second iteration with , and references in the third iteration with . (We also create a variable zero and use it for the first0 time any variable is instantiated, as1 well as remove assignments to non-output variables2 that are never used later on.) The resulting pro- gram is a standard “loop free and index free” NAND-CIRC program that computes XOR (see also Fig. 13.10): temp_0 = NAND(X[03],X[0]) one = NAND(X[0],temp_0) zero = NAND(one,one) temp_2 = NAND(X[0],zero) temp_3 = NAND(X[0],temp_2) temp_4 = NAND(zero,temp_2) Y[0] = NAND(temp_3,temp_4) temp_2 = NAND(X[1],Y[0]) temp_3 = NAND(X[1],temp_2) temp_4 = NAND(Y[0],temp_2) Y[0] = NAND(temp_3,temp_4) temp_2 = NAND(X[2],Y[0]) temp_3 = NAND(X[2],temp_2) temp_4 = NAND(Y[0],temp_2) Y[0] = NAND(temp_3,temp_4)

Key to this transformation was the fact that in our original NAND- TM program for XOR, regardless of whether the input is , , or any other string, the index variable i is guaranteed to equal in the first iteration, in the second iteration, in the third iteration,011 100 and so on and so forth. The particular sequence is immaterial:0 the crucial property1 is that the NAND-TM program2 for XOR is oblivious in the sense that the value of the index i in0, 1,the 2, …-th iteration depends only on and does not depend on the particular choice of the input. Figure 13.10: A NAND circuit for XOR obtained by Luckily, it is possible to transform every NAND-TM푗 program into “unrolling the loop” of the NAND-TM program for computing XOR three times. 3 a functionally푗 equivalent oblivious program with at most quadratic overhead. (Similarly we can transform any Turing machine into a functionally equivalent oblivious Turing machine, see Exercise 13.6.) modeling running time 439

Theorem 13.13 — Making NAND-TM oblivious. Let be a nice

function and let TIMETM . Then there is a NAND-TM program that computes in steps푇 and∶ ℕ satisfying → ℕ the following. For every퐹 ∈ there(푇 is (푛)) a2 sequence such that for every푃 퐹 , if푂(푇is (푛) executed) on input then in the 0 1 푚−1 -th iteration the variable푛 ∈i ℕis푛 equal to . 푖 , 푖 , … , 푖 푥 ∈ {0, 1} 푃 푥 푗 푗In other words, Theorem 13.13 implies푖 that if we can compute in steps, then we can compute it in steps with a program in which the position of i in the -th iteration2 depends only on퐹 푇and (푛) the length of the input, and not on푂(푇 the (푛) contents) of the input. Such 푃a program can be easily translated푗 into a NAND-CIRC program of푗 lines by “unrolling the loop”. 2 Proof Idea: 푂(푇 (푛) ) We can translate any NAND-TM program into an oblivious program by making “sweep” its arrays. That′ is, the index i in will always move all the way from position푃 to position and back푃 again. We can푃 then simulate the program with at most 푃 overhead: if wants to move i left when0 we are′ in a푇 rightward (푛) − 1 sweep then we simply′ wait the at most steps until푃 the next time 푇we (푛) are back in the푃 same position while sweeping to the left. 2푇 (푛)

Proof⋆ of Theorem 13.13. Let be a NAND-TM program computing in steps. We construct an′ oblivious NAND-TM program for computing as follows (see푃 also Fig. 13.11). 퐹 푇 (푛) 푃 1. On input퐹 , will compute and set up arrays Atstart and Atend satisfying Atstart[ ] and Atstart[ ] for and Atend[푥 푃 ] and Atend[푇 = 푇i] (|푥|) for all . We can do Figure 13.11: We simulate a -time NAND-TM this because is a nice function.0 Note= 1 that since this푖 computation= 0 푖 > 0 program with an oblivious NAND-TM program Atstart Atend does not depend푇 − 1 on= 1but only on its= length, 0 it푖 is ≠ oblivious. 푇 − 1 by adding special′ arrays 푇 (푛) and to mark 푇 positions푃 and respectively. The program 푃 2. will also have a special array Marker initialized to all zeroes. will simply “sweep” its arrays from right to left and 푥 back again.0 If the푇 original − 1 program would have푃 moved i in a different direction then we wait 3. The index variable of will change direction of movement to ′ 푃 steps until we reach the same point푃 back again, and so the right whenever Atstart[i] and to the left whenever runs in time. 푂(푇 ) Atend[i] . 푃 2 = 1 푃 푂(푇 (푛) ) 4. The program= 1 simulates the execution of . However, if the MODANDJMP instruction in attempts to move′ to the right when is moving left푃 (or vice versa)′ then will set푃Marker[i] to and enter into a special “waiting푃 mode”. In this mode will wait until푃 the next time in which Marker[i] 푃 (at the next sweep) at1 which points zeroes Marker[i] and continues with the푃 simulation. In = 1 푃 440 introduction to theoretical computer science

the worst case this will take steps (if has to go all the way from one end to the other and back again.) 2푇 (푛) 푃 5. We also modify to ensure it ends the computation after simu- lating exactly steps of , adding “dummy steps” if ends early. 푃 ′ ′ 푇 (푛) 푃 푃 We see that simulates the execution of with an overhead of steps of per one step of , hence completing′ the proof. ■ 푃 ′ 푃 푂(푇 (푛)) 푃 푃 Theorem 13.13 implies Theorem 13.12. Indeed, if is a -line obliv- ious NAND-TM program computing in time then for every we can obtain a NAND-CIRC program of 푃 lines푘 by simply making copies of (dropping the퐹 final MODANDJMP푇 (푛) line). In the푛 -th copy we replace all references of the form(푘 −Foo[i] 1) ⋅ 푇 (푛)to foo_ where is the value푇 (푛) of i in the푃 -th iteration. 푗 푗 푖 푗 푖13.6.2 “Unrolling the loop”:푗 algorithmic transformation of Turing Machines to circuits The proof of Theorem 13.12 is algorithmic, in the sense that the proof yields a polynomial-time algorithm that given a Turing Machine and parameters and , produces a circuit of gates that agrees with on all inputs (as long as runs2 for less than푀 steps these inputs.)푇 We푛 record this푛 fact in the following푂(푇 ) theorem, since it will푀 be useful for us푥 later ∈ {0, on: 1} 푀 푇

Theorem 13.14 — Turing-machine to circuit compiler. There is algorithm UNROLL such that for every Turing Machine and numbers , UNROLL runs for steps and outputs a NAND circuit 푇 with푛 inputs, gates, and푀 one output, such푛, 푇 that (푀, 1 , 1 ) 푝표푙푦(|푀|,2 푇 , 푛) 퐶 푛 푂(푇 ) halts in steps and outputs Figure 13.12: The function UNROLL takes as input a (13.7) otherwise Turing Machine , an input length parameter , a step budget parameter , and outputs a circuit of {⎧푦 푀 ≤ 푇 푦 퐶(푥) = . size that푀 takes bits of inputs and outputs푛 ⎨ Proof. We only⎩{ sketch0 the proof since it follows by directly translat- if halts on within푇 at most steps. 퐶 ing the proof of Theorem 13.12 into an algorithm together with the 푝표푙푦(푇 ) 푛 푀(푥) 푀 푥 푇 simulation of Turing machines by NAND-TM programs (see also Fig. 13.13). Specifically, UNROLL does the following:

1. Transform the Turing Machine into an equivalent NAND-TM program . 푀 2. Transform푃 the NAND-TM program into an equivalent oblivious program following the proof of Theorem 13.13. The program takes ′ steps to simulate 푃steps of . ′ ′ 푃 2 푃 푇 = 푂(푇 ) 푇 푃 modeling running time 441

3. “Unroll the loop” of by obtaining a NAND-CIRC program of lines (or equivalently′ a NAND circuit with gates) corresponding′ to the푃 execution of iterations of . 2 푂(푇 ) 푂(푇 ) ′ ′ ■ 푇 푃

Figure 13.13: We can transform a Turing Machine , input length parameter , and time bound into an sized NAND circuit that agrees with on푀 all inputs2 on which푛 halts in at푇 most 푂(푇steps.) The transformation푛 is obtained by first푀 using the equivalence푥 ∈ {0, 1} of Turing Machines푀 and NAND-푇 TM programs , then turning into an equivalent oblivious NAND-TM program via Theorem 13.13,

then “unrolling”푃 iterations푃′ of the loop of to obtain an 2line NAND-CIRC푃 program that′ agrees with 푂(푇on2 ) length inputs, and finally translating푃 this program푂(푇′ ) into an equivalent circuit. 푃 푛

 Big Idea 20 By “unrolling the loop” we can transform an al- gorithm that takes steps to compute into a circuit that uses gates to compute the restriction of to . 푇 (푛) 퐹 푛 푝표푙푦(푇 (푛)) 퐹 {0, 1}

P Reviewing the transformations described in Fig. 13.13, as well as solving the following two exercises is a great way to get more comfort with non-uniform complexity and in particular with P poly and its relation to P.

/

Solved Exercise 13.4 — Alternative characterization of P. Prove that for every , P if and only if there is a polynomial- time Turing∗ Machine such that for every , outputs a 퐹description ∶ {0, 1} of→ an {0, 1}input퐹 circuit∈ that computes the restriction푛 of to inputs in 푀. 푛 ∈ ℕ 푀(1 ) 푛 ↾푛 푛 푛 퐶 퐹 ■ 퐹 {0, 1} Solution: We start with the “if” direction. Suppose that there is a polynomial- time Turing Machine that on input outputs a circuit that 푛 푛 푀 1 퐶 442 introduction to theoretical computer science

computes . Then the following is a polynomial-time Turing Machine to compute . On input , will: ↾푛 퐹′ ∗ ′ 1. Let 푀 and compute퐹 푥 ∈. {0, 1} 푀 푛 푛 2. Return푛 = the |푥| evaluation of 퐶 on= 푀(1. )

푛 Since we can evaluate a Boolean퐶 푥 circuit on an input in poly- nomial time, runs in polynomial time and computes on every input . ′ For the “only푀 if” direction, if is a Turing Machine that퐹 (푥) com- putes in polynomial-time,푥 then (applying′ the equivalence of Tur- ing Machines and NAND-TM as푀 well as Theorem 13.13) there is also an퐹 oblivious NAND-TM program that computes in time for some polynomial . We can now define to be the Turing Machine that on input outputs the NAND푃 circuit obtained퐹 by 푝(푛)“unrolling the loop” of 푛 for푝 iterations. The푀 resulting NAND circuit computes and1 has gates. It can also be trans- formed to a Boolean circuit푃 with푝(푛) AND/OR/NOT gates. ↾푛 퐹 푂(푝(푛)) ■

Solved Exercise 13.5 — P poly characterization푂(푝(푛)) by advice. Let

. Then P poly/ if and only if there exists a polynomial ∗ , a polynomial-time Turing Machine and a sequence퐹 ∶ {0, 1} →of / {0,strings, 1} such퐹 that ∈ for every : 푝 ∶ ℕ → 푛 푛∈ℕ ℕ 푀 {푎 } • 푛 ∈ ℕ • For every , . 푛 |푎 | ≤ 푝(푛) 푛 푛 푥 ∈ {0, 1} 푀(푎 , 푥) = 퐹 (푥) ■

Solution: We only sketch the proof. For the “only if” direction, if

P poly then we can use for simply the description of the cor- responding circuit and for the program that computes퐹 in ∈ / 푛 polynomial time the evaluation푎 of a circuit on its input. 푛 For the “if” direction,퐶 we can푀 use the same “unrolling the loop” technique of Theorem 13.12 to show that if is a polynomial-time NAND-TM program, then for every , the map can be computed by a polynomial size NAND-CIRC푃 program . 푛 푛 ∈ ℕ 푥 ↦ 푃 (푎 , 푥)■ 푛 푄 13.6.3 Can uniform algorithms simulate non uniform ones? Theorem 13.12 shows that every function in TIME is in SIZE . One can ask if there is an inverse relation. Suppose that is such that has a “short” NAND-CIRC(푇 program (푛)) for every . Can(푝표푙푦(푇 we say (푛))) that it must be in TIME for some “small” ? The ↾푛 퐹 퐹 푛 (푇 (푛)) 푇 modeling running time 443

answer is an emphatic no. Not only is P poly not contained in P, in fact

P poly contains functions that are uncomputable! /

/ Theorem 13.15 — P poly contains uncomputable functions. There exists an

uncomputable function/ such that P poly. ∗ / Proof Idea: 퐹 ∶ {0, 1} → {0, 1} 퐹 ∈

Since P poly corresponds to non uniform computation, a function

is in P poly if for every , the restriction to inputs of length / has a small circuit/program, even if the circuits for different values / ↾푛 퐹of are completely different푛 ∈ ℕ from one another.퐹 In particular, if has 푛the property that for every equal-length inputs and , 푛 then this means that is either the constant function′ zero퐹 or the′ constant function one for every . Since푥 the푥 constant퐹 (푥) = ↾푛 function퐹 (푥 ) has a (very!) small퐹 circuit, such a function will always be in P poly (indeed even in smaller classes).푛 ∈ ℕ Yet by a reduction from the Halting problem, we can obtain a function with this퐹 property that / is uncomputable.

Proof⋆ of Theorem 13.15. Consider the following “unary halting func- tion” UH defined as follows. We let be the function that∗ on input , outputs the string that corre- ∗ sponds to∶ the {0, binary 1} → representation {0, 1} of the number without푆 ∶ ℕ → the {0, most 1} significant digit. Note that 푛is ∈onto ℕ . For every , we de- fine UH HALTONZERO . That is, if is푛 the length of , then UH 1 if and only if푆 the string encodes푥 ∈ {0, a NAND-TM 1} program(푥) that = halts on the input(푆(|푥|)). 푛 푥 UH is(푥) uncomputable, = 1 since otherwise푆(푛) we could compute HALTONZERO by transforming0 the input program into the integer such that and then running UH (i.e., UH on the string of ones). On the other hand, for every , UH푛 is푃 either equal 푛to for all inputs푃 = 푆(푛)or equal to on all inputs(1 ), and hence can be 푛 computed푛 by a NAND-CIRC program of푛 a constant(푥) number of lines. 0 푥 1 푥 ■

The issue here is of course uniformity. For a function , if is in TIME then we have a single algorithm that ∗ can compute for every . On the other hand, might퐹 ∶ {0, be in1} → {0,SIZE 1} 퐹 for every (푇using (푛)) a completely different algorithm for ev- ↾푛 ↾푛 ery input length.퐹 For this reason푛 we typically use퐹P poly not as a model of efficient(푇 (푛))computation푛 but rather as a way to model inefficient compu- / tation. For example, in cryptography people often define an encryp- tion scheme to be secure if breaking it for a key of length requires

푛 444 introduction to theoretical computer science

more than a polynomial number of NAND lines. Since P P poly, this in particular precludes a polynomial time algorithm for doing so, / but there are technical reasons why working in a non uniform⊆ model makes more sense in cryptography. It also allows to talk about se- curity in non asymptotic terms such as a scheme having “ bits of security”. While it can sometimes be a real issue, in many natural128 settings the difference between uniform and non-uniform computation does not seem so important. In particular, in all the examples of problems not known to be in P we discussed before: longest path, 3SAT, factoring, etc., these problems are also not known to be in P poly either. Thus, for “natural” functions, if you pretend that TIME is roughly the / same as SIZE , you will be right more often than wrong. (푇 (푛)) (푇 (푛)) Figure 13.14: Relations between P, EXP, and P poly. It is known that P EXP, P P poly and that P poly contains uncomputable functions (which in particular/ are outside of EXP⊆). It is not⊆ known/ whether or/ not EXP P poly though it is believed that EXP P poly.

⊆ / ⊈ /

13.6.4 Uniform vs. Nonuniform computation: A recap To summarize, the two models of computation we have described so far are: • Uniform models: Turing machines, NAND-TM programs, RAM ma- chines, NAND-RAM programs, C/JavaScript/Python, etc. These mod- els include loops and unbounded memory hence a single program can compute a function with unbounded input length.

• Non-uniform models: Boolean Circuits or straightline programs have no loops and can only compute finite functions. The time to execute them is exactly the number of lines or gates they contain. For a function and some nice time bound , we know that: ∗ 퐹 ∶ {0, 1} → {0, 1} • If is uniformly computable in time then there is a sequence 푇 ∶ ℕ → ℕ of circuits where has gates and computes 퐹 (i.e., restriction of to ) for푇 (푛)every . 1 2 푛 퐶 , 퐶 ,… 퐶 푛 푝표푙푦(푇 (푛)) • The↾푛 reverse direction is not necessarily true - there are examples of 퐹 퐹 {0, 1} 푛 functions such that can be computed by even a constant size푛 circuit but is uncomputable. ↾푛 퐹 ∶ {0, 1} → {0, 1} 퐹 퐹 modeling running time 445

This means that non uniform complexity is more useful to establish hardness of a function than its easiness.

Chapter Recap

• We can define the time complexity of a function ✓ using NAND-TM programs, and similarly to the notion of computability, this appears to capture the inherent complexity of the function. • There are many natural problems that have polynomial-time algorithms, and other natural problems that we’d love to solve, but for which the best known algorithms are exponential. • The definition of polynomial time, and hence the class P, is robust to the choice of model, whether it is Turing machines, NAND-TM, NAND-RAM, modern programming languages, and many other models. • The time hierarchy theorem shows that there are some problems that can be solved in exponential, but not in polynomial time. However, we do not know if that is the case for the natural examples that we described in this lecture. • By “unrolling the loop” we can show that every function computable in time can be computed by a sequence of NAND-CIRC programs (one for every input length) each of size푇 (푛) at most

푝표푙푦(푇 (푛))

13.7 EXERCISES

Exercise 13.1 — Equivalence of different definitions of P and EXP.. Prove that the classes P and EXP defined in Definition 13.2 are equal to TIME and TIME respectively. (If 푐 is a collection푐 of sets then the set푛 is 푐∈{1,2,3,…} 푐∈{1,2,3,…} ∪the set of all elements(푛 ) such∪ that there exists(2 some) 1 2 3 푐∈{1,2,3,…} 푐 푆where, 푆 , 푆 ,… .) 푆 = ∪ 푆 푒 푐 ∈ {1, 2, 3, …} ■ 푐 푒 ∈ 푆 Exercise 13.2 — Robustness to representation. Theorem 13.5 shows that the classes P and EXP are robust with respect to variations in the choice of the computational model. This exercise shows that these classes are also robust with respect to our choice of the representation of the input. Specifically, let be a function mapping graphs to , and let be the functions defined as follows. For every ′ ″ : ∗ 퐹 {0, 1} 퐹 , 퐹 ∶ {0,∗ 1} → {0, 1} 푥 ∈ {0, 1} 446 introduction to theoretical computer science

• iff represents a graph via the adjacency matrix representation′ such that . 퐹 (푥) = 1 푥 퐺 • iff represents퐹 a (퐺) graph = 1 via the adjacency list represen- tation″ such that . 퐹 (푥) = 1 푥 퐺 Prove that P퐹 (퐺)iff = 1 P. More generally,′ for every″ function , the answer to the question퐹 of∈ whether퐹 ∈ P (or whether ∗ EXP) is unchanged by switching representations, as long퐹 as ∶ transforming {0, 1} → {0, one 1} represen- tation to the other can be done퐹 ∈ in polynomial time퐹 ∈ (which essentially holds for all reasonable representations). ■

Exercise 13.3 — Boolean functions. For every function , define to be the function mapping to such∗ that ∗ on input a (string representation of a) triple ∗퐹with ∶ {0, 1} → {0, 1}, and퐵표표푙(퐹 ) , {0, 1} {0, 1} ∗ (푥, 푖, 휎) 푥 ∈ {0, 1} 푖 ∈ ℕ 휎 ∈ {0, 1} (13.8) 푖 ⎧퐹 (푥) 휎 = 0, 푖 < |퐹 (푥)| { otherwise 퐵표표푙(퐹 )(푥, 푖, 휎) = ⎨1 휎 = 1, 푖 < |퐹 (푥)| where is the -th bit of the{ string . {0 Prove that P if and only⎩ if P. 푖 퐹 (푥) 푖 퐹 (푥) ■ 퐹 ∈ 퐵표표푙(퐹 ) ∈ Exercise 13.4 — Composition of polynomial time. Prove that if are in P then their composition , which is the function∗ s.t.∗ , is also in P. 퐹 , 퐺 ∶ {0, 1} → {0, 1} 퐹 ∘ 퐺 ■ 퐻 퐻(푥) = 퐹 (퐺(푥)) Exercise 13.5 — Non composition of exponential time. Prove that there is some s.t. EXP but is not in EXP. ∗ ∗ ■ 퐹 , 퐺 ∶ {0, 1} → {0, 1} 퐹 , 퐺 ∈ 퐹 ∘ 퐺 Exercise 13.6 — Oblivious Turing Machines. We say that a Turing machine is oblivious if there is some function such that for every input of length , and it holds that: 푀 푇 ∶ ℕ × ℕ → ℤ • If 푥 takes more푛 than 푡 ∈steps ℕ to halt on the input , then in the - th step ’s head will be in the position . (Note that this position푀 depends only푡 on the length of and not푥 its contents.) 푡 푀 푇 (푛, 푡) • If halts before the -th step then 푥 . 1 Hint: This is the Turing Machine analog of Theo- Prove푀 that if P then푡 there exists푇 an (푛,oblivious 푡) = −1 Turing machine rem 13.13. We replace one step of the original TM computing with a “sweep” of the obliviouss TM that computes in polynomial time. See footnote for hint.1 ′ in which it goes steps to the right and then steps푀 퐹 ∈ 푀■ to the left. 퐹 푀 퐹 푇 푇 modeling running time 447

Exercise 13.7 Let EDGE be the function such that on input a string representing a triple∗ , where is the adjacency list representation of an∶ {0,vertex 1} → graph {0, 1} , and and are numbers in , EDGE if the edge (퐿, 푖,is 푗) present in퐿 the graph. EDGE outputs on all other inputs.푛 퐺 푖 푗 [푛] (퐿, 푖, 푗) = 1 {푖, 푗} 1. Prove0 that EDGE P.

2. Let PLANARMATRIX∈ be the function that on input an adjacency matrix ∗outputs if and only if the graph represented by is planar∶ {0,(that 1} is,→ can {0, be 1} drawn on the plane with- out edges crossing one another).퐴 For this1 question, you can use without proof the퐴 fact that PLANARMATRIX P. Prove that PLANARLIST P where PLANARLIST is the function that on input an adjacency list outputs∈∗ if and only if represents a planar∈ graph. ∶ {0, 1} → {0, 1} 퐿 1 퐿 ■

Exercise 13.8 — Evaluate NAND circuits. Let NANDEVAL be the function such that for every string representing∗ a pair where is an -input -output NAND-CIRC (not∶ {0, NAND-TM!) 1} → {0,program 1} and , NANDEVAL . On all other (푄,inputs 푥) NANDEVAL푄 outputs푛 푛 1. Prove that NANDEVAL푥 ∈ {0, 1} P. (푄, 푥) = 푄(푥) 0 ■ ∈ Exercise 13.9 — Find hard function. Let NANDHARD be the function such that on input a string representing a∗ pair where ∶ {0, 1} → {0, 1} (푓, 푠) • for some 푛 • 2 푓 ∈ {0, 1} 푛 ∈ ℕ 푠NANDHARD ∈ ℕ if there is no NAND-CIRC program of at most lines that computes the function whose truth table(푓, is 푠) the = string 1 . That is, NANDHARD 푛 푄 if for every푠 NAND-CIRC program of at most퐹 ∶lines, {0, 1} there→ {0,exists 1} some such that 푓 where denote(푓, the 푠) =-the 1 coordinate of , using푛 the binary representation푄 푠 to identify 푥 푥 with the푥 ∈ numbers {0, 1} 푄(푥). ≠ 푓 푓 푥 푛 푓 푛 {0, 1} 1. Prove that NANDHARD{0, … , 2 − 1}EXP.

2. (Challenge) Prove that there∈ is an algorithm FINDHARD such that if is sufficiently large, then FINDHARD runs in time and outputs a string that is the푛 truth table of 푂(푛) 푛 a2 function푛 that is not contained in SIZE2 (1 ) . (In other 2 푓 ∈ {0, 1} 푛 (2 /(1000푛)) 448 introduction to theoretical computer science

words, if is the string output by FINDHARD then if we let

be the function such that 푛 outputs the -th 2 2 Hint: Use Item 1, the existence of functions requir- coordinate푛푓 of , then SIZE . (1 ) ing exponentially hard NAND programs, and the fact 퐹 ∶ {0, 1} → {0, 1} 푛 퐹 (푥) 푥 that there are only finitely many functions mapping ■ 푓 퐹 ∉ (2 /(1000푛)) to . 푛 Exercise 13.10 Suppose that you are in charge of scheduling courses in {0, 1} {0, 1} computer science in University X. In University X, computer science students wake up late, and have to work on their startups in the af- ternoon, and take long weekends with their investors. So you only have two possible slots: you can schedule a course either Monday- Wednesday 11am-1pm or Tuesday-Thursday 11am-1pm. Let SCHEDULE be the function that takes as input a list of courses and a list∗ of conflicts (i.e., list of pairs of courses that cannot share the∶ {0, same 1} time→ {0, slot) 1} and outputs if and only if there is a “conflict free”퐿 scheduling of the퐶 courses in , where no pair in is scheduled in the same time slot. 1 More precisely, the list is a list of strings 퐿 and the list퐶 is a list of pairs of the form . SCHEDULE if and 0 푛−1 only if there exists a partition퐿 of into(푐 , two … , 푐parts) so that there 푖 푗 퐶is no pair such that(푐 both, 푐 ) and are in(퐿, the 퐶) same = 1 part. 0 푛−1 Prove that SCHEDULE P. As푐 , usual, … , 푐 you do not have to provide 푖 푗 푖 푗 the full code(푐 , to 푐 ) show ∈ 퐶 that this is the case,푐 and푐 can describe operations as a high level, as well as appeal∈ to any data structures or other results mentioned in the book or in lecture. Note that to show that a function is in P you need to both (1) present an algorithm that computes in polynomial time, (2) prove that does indeed run in polynomial 퐹time, and does indeed compute the correct answer. 퐴 퐹 Try to think whether or not your algorithm퐴 extends to the case where there are three possible time slots. ■

13.8 BIBLIOGRAPHICAL NOTES

Because we are interested in the maximum number of steps for inputs of a given length, running-time as we defined it is often known as worst case complexity. The minimum number of steps (or “best case” complexity) to compute a function on length inputs is typically not a meaningful quantity since essentially every natural problem will have some trivially easy instances. However, the푛 average case complexity (i.e., complexity on a “typical” or “random” input) is an interesting concept which we’ll return to when we discuss cryptography. That said, worst-case complexity is the most standard and basic of the complexity measures, and will be our focus in most of this book. Some lower bounds for single-tape Turing machines are given in [Maa85]. modeling running time 449

For defining efficiency in the calculus, one needs to be careful about the order of application of the reduction steps, which can matter for computational efficiency, 휆see for example this paper.

The notation P poly is used for historical reasons. It was introduced by Karp and Lipton, who considered this class as corresponding to / functions that can be computed by polynomial-time Turing Machines that are given for any input length an advice string of length polyno- mial in . 푛 푛

Learning Objectives: • Introduce the notion of polynomial-time reductions as a way to relate the complexity of problems to one another. • See several examples of such reductions. • 3SAT as a basic starting point for reductions.

14 Polynomial-time reductions

Consider some of the problems we have encountered in Chapter 12:

1. The 3SAT problem: deciding whether a given 3CNF formula has a satisfying assignment.

2. Finding the longest path in a graph.

3. Finding the maximum cut in a graph.

4. Solving quadratic equations over variables .

0 푛−1 All of these problems have the following푛 properties:푥 , … , 푥 ∈ ℝ

• These are important problems, and people have spent significant effort on trying to find better algorithms for them.

• Each one of these is a search problem, whereby we search for a solution that is “good” in some easy to define sense (e.g., a long path, a satisfying assignment, etc.).

• Each of these problems has a trivial exponential time algorithm that involve enumerating all possible solutions.

• At the moment, for all these problems the best known algorithm is not much faster than the trivial one in the worst case.

In this chapter and in Chapter 15 we will see that, despite their apparent differences, we can relate the computational complexity of these and many other problems. In fact, it turns out that the prob- lems above are computationally equivalent, in the sense that solving one of them immediately implies solving the others. This phenomenon, known as NP completeness, is one of the surprising discoveries of the- oretical computer science, and we will see that it has far-reaching ramifications.

Compiled on 8.26.2020 18:10 452 introduction to theoretical computer science

This chapter: A non-mathy overview This chapter introduces the concept of a polynomial time re- duction which is a central object in computational complexity and this book in particular. A polynomial-time reduction is a way to reduce the task of solving one problem to another. The way we use reductions in complexity is to argue that if the first problem is hard to solve efficiently, then the second must also be hard. We see several examples for reductions in this chapter, and reductions will be the basis for the theory of NP completeness that we will develop in Chapter 15.

Figure 14.1: In this chapter we show that if the SAT problem cannot be solved in polynomial time, then neither can the QUADEQ, LONGESTPATH, ISET3 and MAXCUT problems. We do this by using the reduction paradigm showing for example “if pigs could whistle” (i.e., if we had an efficient algorithm for QUADEQ) then “horses could fly” (i.e., we would have an efficient algorithm for SAT.)

3

In this chapter we will see that for each one of the problems of find- ing a longest path in a graph, solving quadratic equations, and finding the maximum cut, if there exists a polynomial-time algorithm for this problem then there exists a polynomial-time algorithm for the 3SAT problem as well. In other words, we will reduce the task of solving 3SAT to each one of the above tasks. Another way to interpret these results is that if there does not exist a polynomial-time algorithm for 3SAT then there does not exist a polynomial-time algorithm for these other problems as well. In Chapter 15 we will see evidence (though no proof!) that all of the above problems do not have polynomial-time algorithms and hence are inherently intractable.

14.1 FORMAL DEFINITIONS OF PROBLEMS

For reasons of technical convenience rather than anything substantial, we concern ourselves with decision problems (i.e., Yes/No questions) or in other words Boolean (i.e., one-bit output) functions. We model the polynomial-time reductions 453

problems above as functions mapping to in the following way: ∗ {0, 1} {0, 1} 3SAT. The 3SAT problem can be phrased as the function SAT that takes as input a 3CNF formula (i.e., a formula of the∗ form where each is the OR of three3 variables∶ {0,or their 1} → negation) {0, 1} and maps to if there exists some휑 assignment to 0 푚−1 푖 the variables퐶 of∧ ⋯that ∧ 퐶 causes it to evalute퐶 to true, and to otherwise. For example 휑 1 휑 0

SAT (14.1)

0 1 2 1 2 3 0 2 3 3since("(푥 the assignment∨ 푥 ∨ 푥 ) ∧ (푥 ∨ 푥 satisfies∨ 푥 ) ∧ ( the푥 ∨ input푥 ∨ 푥 formula.)") = 1 In the above we assume some representation of formulas as strings, and define the function to푥 output = 1101if its input is not a valid representation; we use the same convention for all the other functions below. 0 Quadratic equations. The quadratic equations problem corresponds to the function QUADEQ that maps a set of quadratic equations to if there is an∗ assignment that satisfies all equations, and to otherwise. ∶ {0, 1} → {0, 1} 퐸 1 푥 Longest path. The longest path problem corresponds to the function 0 LONGPATH that maps a graph and a number to if there is a simple∗ path in of length at least , and maps to otherwise.∶ {0, The 1} longest→ {0, path 1} problem is a generalization퐺 of the 푘 well-known1 Hamiltonian Path퐺 Problem of determining푘 whether(퐺, a path 푘) of0 length exists in a given vertex graph.

Maximum cut. The maximum cut problem corresponds to the function 푛 푛 MAXCUT that maps a graph and a number to if there is a cut in∗ that cuts at least edges, and maps to otherwise. ∶ {0, 1} → {0, 1} 퐺 푘 1 All of the problems퐺 above are in EXP푘 but it is not known(퐺, whether 푘) 0 or not they are in P. However, we will see in this chapter that if either QUADEQ , LONGPATH or MAXCUT are in P, then so is SAT.

14.2 POLYNOMIAL-TIME REDUCTIONS 3

Suppose that are two Boolean functions. A polynomial-time reduction (or∗ sometimes just “reduction” for short) from to is a way퐹 , to 퐺 show ∶ {0, 1}that→is {0, “no 1} harder” than , in the sense that a polynomial-time algorithm for implies a polynomial-time 퐹algorithm퐺 for . 퐹 퐺 퐺 퐹 454 introduction to theoretical computer science

Definition 14.1 — Polynomial-time reductions. Let . We say that reduces to , denoted by ∗if there is a polynomial-time computable 퐹 , 퐺 ∶ {0, 1} →such {0, that 1} 푝 for every 퐹 , 퐺 퐹∗ ≤ 퐺 ∗ ∗ 푅 ∶ {0, 1} → {0, 1} 푥 ∈ {0, 1} (14.2)

We say that and have퐹equivalent (푥) = 퐺(푅(푥)) complexity . if and . 푝 푝 퐹 퐺 퐹 ≤ 퐺 퐺 ≤ 퐹The following exercise justifies our intuition that signifies that ” is no harder than . 푝 퐹 ≤ 퐺 Solved Exercise 14.1 — Reductions and P. Prove that if and P 퐹 퐺 then P. 푝 퐹 ≤ 퐺 퐺 ∈ ■ 퐹 ∈ Figure 14.2: If then we can transform a P polynomial-time algorithm that computes As usual, solving this exercise on your own is an excel- into a polynomial-time퐹 ≤푝 퐺 algorithm that computes lent way to make sure you understand Definition 14.1. . To compute we can퐵 run the reduction퐺 guaranteed by the fact that 퐴 to obtain 퐹 and then run퐹(푥) our algorithm for to compute푅 Solution: . 퐹 ≤푝 퐺 푦 = 퐹(푥) 퐵 퐺 Suppose there was an algorithm that compute in time 퐺(푦) where is its input size. Then, (14.2) directly gives an algorithm to compute (see Fig. 14.2). Indeed,퐵 on input 퐺 푝(푛), Al- gorithm푝 will run the polynomial-time reduction to obtain ∗ 퐴 and퐹 then return . By (14.2), 푥 ∈ {0, 1} and hence Algorithm퐴 will indeed compute . 푅 푦We = 푅(푥)now show that runs퐵(푦) in polynomial time.퐺(푅(푥)) By assumption, = 퐹 (푥) can be computed퐴 in time for some polynomial퐹 . In particular, this means that 퐴 (as just writing down takes steps).푅 Computing will take푞(푛) at most 푞 steps. Thus the total running|푦| time ≤ 푞(|푥|) of on inputs of length is푦 at most|푦| the time to compute 퐵(푦), which is bounded by푝(|푦|), and ≤ the 푝(푞(|푥|)) time to compute , which is bounded by퐴 , and since the푛 composition of two polynomials푦 is a polynomial, 푞(푛)runs in polynomial time. 퐵(푦) 푝(푞(푛)) ■ 퐴

 Big Idea 21 A reduction shows that is “no harder than ” or equivalently that is “no easier than ”. 푝 퐹 ≤ 퐺 퐹 퐺 퐺 퐹

14.2.1 Whistling pigs and flying horses A reduction from to can be used for two purposes:

퐹 퐺 polynomial-time reductions 455

• If we already know an algorithm for and then we can use the reduction to obtain an algorithm for . This is a widely 푝 used tool in algorithm design. For example퐺 퐹 in Section ≤ 퐺 12.1.4 we saw how the Min-Cut Max-Flow theorem allows to퐹 reduce the task of computing a minimum cut in a graph to the task of computing a maximum flow in it.

• If we have proven (or have evidence) that there exists no polynomial- time algorithm for and then the existence of this reduction allows us to conclude that there exists no polynomial-time algo- 푝 rithm for . This퐹 is the “if퐹 ≤pigs퐺 could whistle then horses could fly” interpretation we’ve seen in Section 9.4. We show that if there was an hypothetical퐺 efficient algorithm for (a “whistling pig”) then since then there would be an efficient algorithm for (a “flying horse”). In this book we 퐺often use reductions forthis 푝 second purpose,퐹 ≤ although퐺 the lines between the two is sometimes blurry퐹 (see the bibliographical notes in Section 14.9).

The most crucial difference between the notion in Definition 14.1 and the reductions we saw in the context of uncomputability (e.g., in Section 9.4) is that for relating time complexity of problems, we need the reduction to be computable in polynomial time, as opposed to merely computable. Definition 14.1 also restricts reductions to have a very specific format. That is, to show that , rather than allow- ing a general algorithm for that uses a “magic box” that computes 푝 , we only allow an algorithm that computes퐹 ≤ 퐺 by outputting . This restricted form퐹 is convenient for us, but people have defined퐺 and used more general reductions퐹 (푥) as well (see Section 14.9). 퐺(푅(푥))In this chapter we use reductions to relate the computational com- plexity of the problems mentioned above: 3SAT, Quadratic Equations, Maximum Cut, and Longest Path, as well as a few others. We will reduce 3SAT to the latter problems, demonstrating that solving any one of them efficiently will result in an efficient algorithm for3SAT. In Chapter 15 we show the other direction: reducing each one of these problems to 3SAT in one fell swoop.

Transitivity of reductions. Since we think of as saying that (as far as polynomial-time computation is concerned) is “easier or 푝 equal in difficulty to” , we would expect퐹 that ≤ if 퐺 and , then it would hold that . Indeed this is the case:퐹 푝 푝 퐺 퐹 ≤ 퐺 퐺 ≤ 퐻 Solved Exercise 14.2 — Transitivity푝 of polynomial-time reductions. For every 퐹 ≤ 퐻 , if and then . ∗ ■ 푝 푝 푝 퐹 , 퐺, 퐻 ∶ {0, 1} → {0, 1} 퐹 ≤ 퐺 퐺 ≤ 퐻 퐹 ≤ 퐻 456 introduction to theoretical computer science

Solution: If and then there exist polynomial-time com- putable functions and mapping to such that 푝 푝 for every퐹 ≤ 퐺 퐺, ≤ 퐻 and∗ for every∗ , 1 2 푅 ∗ . Combining푅 these{0, two 1} equalities,{0, 1} we see that∗ 1 for every 푥 ∈ {0, 1} ,퐹 (푥) = 퐺(푅 (푥)) and so to푦 show∈ {0,that 1} 2 퐺(푦) =, it 퐻(푅 is sufficient(푦))∗ to show that the map is 2 1 computable푥 in∈ polynomial {0, 1} 퐹 (푥) time. = But 퐻(푅 if there(푅 (푥))) are some constants 푝 2 1 퐹such ≤ that퐻 is computable in time and 푥 ↦is 푅computable(푅 (푥)) in time then is computable푐 in time 푐, 푑 1 2 which is polynomial.푅푑 (푥) |푥| 푅 (푦) 푐 푑 푐푑 2 1 |푦| 푅 (푅 (푥)) (|푥| ) = |푥| ■

14.3 REDUCING 3SAT TO ZERO ONE AND QUADRATIC EQUATIONS

We now show our first example of a reduction. The Zero-One Lin- ear Equations problem corresponds to the function EQ whose input is a collection of linear equations in variables∗ , and the output is iff there is an assignment01 ∶ {0, 1} → {0,of 1} values to the variables that퐸 satisfies all the equations. For푛 exam- 0 푛−1 푥ple,, … if , the 푥 input is a string encoding1 the set of equations푥 ∈ {0, 1} 0/1 퐸

0 1 2 (14.3) 푥 + 푥 + 푥 = 2 0 2 푥 + 푥 = 1 then EQ since the assignment1 2 satisfies all three 푥 + 푥 = 2 equations. We specifically restrict attention to linear equations in variables01 (퐸) = 1in which every equation푥 = has 011 the form 1 If you are familiar with matrix notation you may 1 where and . note that such equations can be written as b 0 푛−1 푖∈푆 푖 If we asked푥 , … the , 푥 question of whether there is a solution ∑ 푥of = where is an matrix with entries in and b . 퐴푥 = 푏real numbers푆 ⊆to [푛] , then푏 ∈ this ℕ can be solved using the famous Gaussian푛 푚퐴 푚 × 푛 0/1 elimination algorithm in polynomial time. However, there푥 is ∈ no ℝ known ∈ ℕ efficient algorithm퐸 to solve EQ. Indeed, such an algorithm would imply an algorithm for SAT as shown by the following theorem: 01 Theorem 14.2 — Hardness3 of . SAT EQ

01퐸푄 푝 Proof Idea: 3 ≤ 01 A constraint can be written as . This is a linear inequality but since the sum on the left-hand side is 2 5 7 2 5 7 at most three, we푥 can∨ 푥 also∨ 푥 turn it into an equality푥 by+ (1 adding − 푥 ) two + 푥 new≥ 1 variables and writing it as . (We will use fresh such variables for every constraint.) Finally, for every 2 5 7 variable 푦,we 푧 can add a variable푥 + (1corresponding − 푥 ) + 푥 + to푦 + its 푧 negation = 3 by 푦, 푧 ′ 푖 푖 푥 푥 polynomial-time reductions 457

adding the equation , hence mapping the original constraint to ′ . The main takeaway 푖 푖 technique from this reduction푥 ′+ 푥 = 1 is the idea of adding auxiliary variables 2 5 7 2 5 7 푥to replace∨ 푥 ∨ an 푥 equation푥 + 푥 such+ 푥 as+ 푦 + 푧 = 3 that is not quite in the form we want with the equivalent (for valued variables) equation 1 2 3 which푥 is+ in 푥 the+ form 푥 ≥ we 1 want. 0/1 1 2 3 푥 + 푥 + 푥 + 푢 + 푣 = 3 ⋆

Figure 14.3: Left: Python code implementing the reduction of SAT to EQ. Right: Example output of the reduction. Code is in our repository. 3 01

Proof of Theorem 14.2. To prove the theorem we need to:

1. Describe an algorithm for mapping an input for SAT into an input for EQ. 푅 휑 3 퐸 01

2. Prove that the algorithm runs in polynomial time.

3. Prove that EQ SAT for every 3CNF formula .

01 (푅(휑)) = 3 (휑) 휑

We now proceed to do just that. Since this is our first reduction, we will spell out this proof in detail. However it straightforwardly follows the proof idea. 458 introduction to theoretical computer science

Algorithm 14.3 — to reduction. Input: 3CNF formula with variables and 3푆퐴푇 01퐸푄 clauses. 0 푛−1 Output: Set of linear휑 equations푛 over 푥 such, … , 푥 that 푚 -iff . 1: Let ’s variables퐸 be , 0/1 , , 3푆퐴푇 (휑) =. 1 01퐸푄(퐸) = 1 ′ ′ 0 푛−1 0 푛−1 0 푚−1 2: for 퐸 do 푥 , … , 푥 푥 , … , 푥 푦 , … , 푦 0 푚−1 3: 푧 , …add , 푧 to the equation 4: end푖 for ∈ [푛] ′ 푖 푖 5: for j∈[m] do퐸 푥 + 푥 = 1 6: Let -th clause be where are literals. 0 1 2 0 1 2 7: for 푗 do 푤 ∨ 푤 ∨ 푤 푤 , 푤 , 푤 8: if is variable then 9: 푎 ∈set [3] 푎 푖 10: end푤 if 푥 푎 푖 11: if is푡 negation← 푥 then 12: set 푎 푖 13: end푤 if ′ ¬푥 푎 푖 14: end for 푡 ← 푥 15: Add to the equation . 16: end for 0 1 2 푗 푗 17: return 퐸 푡 + 푡 + 푡 + 푦 + 푧 = 3

The reduction퐸 is described in Algorithm 14.3, see also Fig. 14.3. If the input formula has variable and steps, Algorithm 14.3 creates a set of equations over variables. Algorithm 14.3 makes an initial loop of steps푛 (each taking푚 constant time) and then another loop퐸 of 푛+steps 푚 (each taking constant2푛+ 2푚 time) to create the equations, and hence it runs푛 in polynomial time. Let 푚be the function computed by Algorithm 14.3. The heart of the proof is to show that for every 3CNF , EQ SAT . We split푅 the proof into two parts. The first part, traditionally known as the completeness property, is to show휑 that01 if SAT(푅(휑)) = 3 then(휑) EQ . The second part, traditionally known as the sound- ness property, is to show that if EQ 3 then(휑)SAT = 1 . (The푂1 names(푅(휑)) “completeness” = 1 and “soundness” derive viewing a so- lution to as a “proof” that 01is satisfiable,(푅(휑)) =1 in which3 (휑)case = these 1 conditions corresponds to completeness and soundness as defined in Section푅(휑) 11.1.1. However, if you휑 find the names confusing you can simply think of completeness as the “ -instance maps to -instance”

1 1 polynomial-time reductions 459

property and soundness as the “ -instance maps to -instance” prop- erty.) We complete the proof by showing0 both parts: 0

• Completeness: Suppose that SAT , which means that there is an assignment that satisfies . If we use the assignment and 3 푛 (휑) = 1 for the first variables of then푥 ∈ we{0, 1}will satisfy all equations휑 of the form 0 푛−1 0 푛−1 .푥 Moreover,, … , 푥 for every1 − 푥 , … , 1, if − 푥 2푛 is′ the equation퐸 = 푅(휑) arising from the th clause of (with 푖 푖 0 1 2 푗 푗 being푥 + 푥 variables= 1 of the form or 푗 ∈depending [푛] 푡 + on 푡 the+ 푡 literals+ 푦 + of 푧 the= 0 1 2 clause)3(∗) then our assignment to the′ first푗 variables휑 ensures푡 , that푡 , 푡 푖 푖 (since satisfied푥 푥 ) and hence we can assign values to and that will ensure that the equation2푛 is satisfied. Hence 0 1 2 푡in+ this 푡 case+ 푡 ≥ 1 is푥 satisfied,휑 meaning that EQ . 푗 푗 푦 푧 (∗) • Soundness: Suppose that EQ , which means that the 퐸 = 푅(휑) 01 (푅(휑)) = 1 set of equations has a satisfying assignment , , , 01 (푅(휑)). Then, = 1 since the equations 0 푛−1 contain′ ′ the condition퐸 = 푅(휑) , for every , is푥 the, … negation , 푥 0 푛−1 0 푚−1 0 푚−1 푥of , …, , and 푥 morover,푦 , … , 푦 for every푧 ,′ … , 푧 , if has the form′ 푖 푖 푖 is the -th clause of ,푥 then+ 푥 the= corresponding 1 푖 ∈ assignment [푛] 푥 will 푖 0 1 2 ensure푥 that 푗, implying∈ [푚] 퐶 that is satisfied.푤 ∨ 푤 Hence∨ 푤 in this case푗 SAT 퐶 . 푥 0 1 2 푤 + 푤 + 푤 ≥ 1 퐶 ■ 3 (휑) = 1

14.3.1 Quadratic equations Now that we reduced SAT to EQ, we can use this to reduce SAT to the quadratic equations problem. This is the function QUADEQ in which the input is a list3 of -variate01 polynomials 3 that are all of degree at most two (i.e., they are quadratic) and with푛 0 푚−1 integer coefficients. (The푛 latter condition is푝 for, … ,convenience 푝 ∶ ℝ → andcan ℝ be achieved by scaling.) We define QUADEQ to equal if and only if there is a solution to the equations , 0 푚−1 , , . 푛 (푝 , … , 푝 ) 1 0 For example, the following is푥 a ∈set ℝ of quadratic equations푝 (푥) over = the0 1 푚−1 푝variables(푥) = 0 … 푝 :(푥) = 0

0 1 2 푥 , 푥 , 푥 2 0 0 푥 − 푥 = 0 (14.4) 2 1 1 푥 − 푥 = 0 2 2 2 푥 − 푥 = 0 You can verify that satisfies0 1 this0 1set of equations if and only if 1 − 푥 − 푥 + 푥 푥 = 0 and 3 . 3 푥 ∈ ℝ 0 1 푥 ∈ {0, 1} 푥 ∨ 푥 = 1 460 introduction to theoretical computer science

Theorem 14.4 — Hardness of quadratic equations.

SAT QUADEQ (14.5)

푝 3 ≤ Proof Idea: Using the transitivity of reductions (Solved Exercise 14.2), it is enough to show that EQ QUADEQ, but this follows since we can phrase the equation as the quadratic constraint 푝 . The takeaway01 technique≤ of this reduction is that we can2 use 푖 푖 nonlinearity to force continuous푥 ∈ {0, variables 1} (e.g., variables taking푥 values− 푖 푥in =) 0to be discrete (e.g., take values in ).

ℝ {0, 1} ⋆

Proof of Theorem 14.4. By Theorem 14.2 and Solved Exercise 14.2, it is sufficient to prove that EQ QUADEQ. Let be an in- stance of EQ with variables . We map to the set of 푝 quadratic equations that01 is obtained≤ by taking the퐸 linear equations 0 푚−1 in and adding01 to them′ the 푥quadratic, … , 푥 equations 퐸 for all . (See Algorithm퐸 14.5.) The map 2can be com- 푖 푖 puted퐸 in polynomial time. We푛 claim that EQ 푥′ −if 푥 and= only 0 if QUADEQ푖 ∈ [푛] . Indeed, the only difference퐸 ↦ 퐸 between the two instances is that:′ 01 (퐸) = 1 (퐸 ) = 1

• In the EQ instance , the equations are over variables in . 0 푛−1 01 퐸 푥 , … , 푥 {0, 1} • In the QUADEQ instance , the equations are over variables but we have′ the extra constraints for all . 퐸 2 0 푛−1 푖 푖 푥 , … , 푥 ∈ ℝ 푥 − 푥 = 0 푖 ∈ [푛] Since for every , if and only if , the two sets of equations are equivalent2 and EQ QUADEQ which is what we wanted푎 to ∈ prove. ℝ 푎 − 푎 = 0 푎 ∈ {0, 1} ′ 01 (퐸) = (퐸 ) ■ polynomial-time reductions 461

Algorithm 14.5 — to reduction. Input: Set of linear equations over variables . 3푆퐴푇 01퐸푄 Output: Set of quadratic eqations ovar variables 0 푛−1 퐸 ′ such that there is an 푛 assignment푥 , … , 푥 퐸 푚 0 푚−1 1: 푤satisfying, … , 푤 푛 the equations of -iff there0/1 is an assignment 푥 ∈ {0, 1}satisfying the equations of . 2: That is,푚 퐸 . ′ 3: Let푤 ∈ ℝ . ′퐸 4: Variables푂1퐸푄(퐸) of are = set 푄푈퐴퐷퐸푄(퐸 to be same variable) as 푚. ← 푛 ′ 0 푛−1 5: for every equation퐸 do 푥 , … , 푥 6: 퐸Add to 7: end for ′ 푒 ∈ 퐸 8: for 푒 do퐸 9: Add to the equation . 10: end푖 for ∈ [푛] ′ 2 푖 푖 11: return 퐸 푥 − 푥 = 0 ′ 퐸 14.4 THE INDEPENDENT SET PROBLEM

For a graph , an independent set (also known as a stable set) is a subset such that there are no edges with both end- points in (in퐺 =other (푉 , words, 퐸) ). Every “singleton” (set consisting of a single푆 ⊆ 푉 vertex) is trivially an∅ independent set, but find- ing larger푆 independent sets can퐸(푆, be 푆) challenging. = The maximum indepen- dent set problem (henceforth simply “independent set”) is the task of finding the largest independent set in the graph. The independent set problem is naturally related to scheduling problems: if we put an edge between two conflicting tasks, then an independent set corresponds to a set of tasks that can all be scheduled together without conflicts. The independent set problem has been studied in a variety of settings, including for example in the case of algorithms for finding structure in protein-protein interaction graphs. As mentioned in Section 14.1, we think of the independent set prob- lem as the function ISET that on input a graph and a number outputs if and∗ only if the graph contains an in- dependent set of size at least∶ {0, 1}. We→ now {0, reduce 1} 3SAT to Independent퐺 set. 푘 1 퐺 푘 Theorem 14.6 — Hardness of Independent Set. SAT ISET.

푝 3 ≤ Proof Idea: 462 introduction to theoretical computer science

The idea is that finding a satisfying assignment to a 3SAT formula corresponds to satisfying many local constraints without creating any conflicts. One can think of“ ” and “ ” as two conflicting events, and of the constraints as creating 17 17 a conflict between the events “ 푥 =”, 0 “ 푥” and= 1“ ”, 17 5 9 saying that these three cannot simultaneously푥 ∨ co-occur.푥 ∨ 푥 Using these 17 5 9 ideas, we can we can think of solving푥 = 0a 3SAT푥 problem= 1 as푥 trying= 0 to schedule non conflicting events, though the devil is, as usual, inthe details. The takeaway technique here is to map each clause of the original formula into a gadget which is a small subgraph (or more generally “subinstance”) satisfying some convenient properties. We will see these “gadgets” used time and again in the construction of polynomial-time reductions.

⋆Algorithm 14.7 — to reduction. Input: formula with variables and clauses. 3푆퐴푇 퐼푆 Output: Graph and number , such that has3푆퐴푇 an independent휑 set of푛 size -iff has a푚 satisfying Figure 14.4: An example of the reduction of SAT assignment. 퐺 = (푉 , 퐸) 푘 퐺 to ISET for the case the original input formula is 1: That is, 푘 , 휑 3 . We map each clause of to a triangle of three vertices, 2: Initialize 휑each = (푥 tagged0 ∨푥1 above∨푥2)∧( with푥0 “∨푥1 ∨푥2”)∧(푥 or “1 ∨푥2 ∨”푥3) 3: for every3푆퐴푇 clause (휑)∅ = 퐼푆퐸푇∅ (퐺, 푘)of do depending on the value휑 of that would satisfy the particular literal. We put an푖 edge between푖 every two 4: Add three푉 ← vertices, 퐸 ← ′ ″ to 푥 = 0 푥 = 1 literals that are conflicting (i.e.,푥푖 tagged with “ ” 5: Add edges 퐶 = 푦 ∨ 푦 ∨ 푦, ′ 휑 ″ , and “ ” respectively). 푖 to . (퐶, 푦),′ (퐶, 푦 ), (퐶,′ 푦 ) ″ 푉 푥 = 0 푥푖 = 1 6: end for″ {(퐶, 푦), (퐶, 푦 )} {(퐶, 푦 ), (퐶, 푦 )} 7: for{(퐶,every 푦 ), (퐶, distinct 푦)} clauses퐸 in do 8: for every do ′ 9: if contains literal퐶, 퐶and 휑 contains literal then 푖 ∈ [푛] ′ 푖 푖 10: 퐶Add edge 푥 퐶 to 푥 11: end if 푖 푖 12: end for {(퐶, 푥 ), (퐶, 푥 )} 퐸 13: end for 14: return

Proof of Theorem퐺 14.6 = (푉. Given , 퐸) a 3SAT formula on variables and with clauses, we will create a graph with vertices as follows. (See Algorithm 14.7, see also Fig. 14.4 for an휑 example푛 and Fig. 14.5 for Python푚 code.) 퐺 3푚

• A clause in has the form where are literals (variables or their negation). For′ each″ such clause′ ,″ we will 퐶 휑 퐶 = 푦 ∨ 푦 ∨ 푦 푦, 푦 , 푦 퐶 polynomial-time reductions 463

add three vertices to , and label them , , and respectively. We will also add the three edges between′ all pairs″ of these vertices, so they퐺 form a triangle. Since(퐶, 푦) there(퐶, are 푦 ) clauses(퐶, 푦 in) , the graph will have vertices. 푚 휑 • In addition퐺 to the above3푚 edges, we also add an edge between every pair vertices of the form and where and are con- flicting literals. That is, we add an edge′ between′ and′ if there is an such that (퐶, 푦) and (퐶 , 푦 ) or vice푦 versa.푦 ′ ′ ′ (퐶, 푦) (퐶 , 푦 ) 푖 푖 The algorithm푖 constructing푦 = of 푥 based푦 = on푥 takes polynomial time since it involves two loops, the first taking steps and the second taking steps (see Algorithm퐺 14.7). Hence휑 to prove the theo- rem we need2 to show that is satisfiable 푂(푚)if and only if contains an independent푂(푚 푛) set of vertices. We now show both directions of this equivalence: 휑 퐺 Part 1: Completeness.푚 The “completeness” direction is to show that if has a satisfying assignment , then has an independent set of vertices. Let us now show this.∗ ∗ 휑Indeed, suppose that has a푥 satisfying퐺 assignment 푆. Then푚 for every clause of , one of the literals∗ 푛 must evaluate to true under휑 the assignment′ ″ (as otherwise푥 ∈ {0, it would 1}′ ″ not satisfy ). We let 퐶be = a 푦 set ∨ 푦 of∨ 푦vertices휑 that∗ is obtained by푦, 푦 choos-, 푦 ing for every clause one vertex of the form푥 such that eval- uates to true휑 under 푆. (If there is푚 more than one such vertex for the same , we arbitrarily퐶∗ choose one of them.) (퐶, 푦) 푦 We claim that is푥 an independent set. Indeed, suppose otherwise that there퐶 was a pair of vertices and in that have an edge between them.푆 Since we picked one vertex′ ′ out of each triangle corresponding to a clause, it must(퐶, be 푦) that (퐶 , 푦 ). Hence푆 the only way that there is an edge between and ′ is if and are conflicting literals (i.e. and 퐶for ≠ some 퐶 ′ ). But then′ they can’t both evaluate to true under the(퐶,′ assignment 푦) (퐶, 푦 ), which푦 contradicts푦 푖 푖 the way we constructed푦 the = 푥 set . This푦 = completes푥 the∗ 푖 proof of the completeness condition. 푥 Part 2: Soundness. The “soundness”푆 direction is to show that if has an independent set of vertices, then has a satisfying assignment . Let∗ us now show this. 퐺 Indeed, suppose∗ that푛 푆has an푚 independent set휑 with vertices. We will define푥 ∈ {0,an 1}assignment for the variables of as follows. For every 퐺, we set∗ according푛 to the푆∗ following푚 rules: 푥 ∈∗ {0, 1} 휑 푖 • If contains a vertex푖 ∈ [푛] of the form푥 then we set . ∗ ∗ 푖 푖 • If 푆 contains a vertex of the form (퐶, 푥 ) then we set 푥 = 1. ∗ ∗ 푖 푖 푆 (퐶, 푥 ) 푥 = 0 464 introduction to theoretical computer science

• If does not contain a vertex of either of these forms, then it does not∗ matter which value we give to , but for concreteness we’ll set 푆 . ∗ 푖 ∗ 푥 푖 The푥 = first 0 observation is that is indeed well defined, in the sense that the rules above do not conflict∗ with one another, and ask to set to be both and . This follows푥 from the fact that is an independent∗ 푖 set and hence if it contains a vertex of the form ∗ then it cannot푥 contain a vertex0 of1 the form . 푆 푖 We now claim that is a satisfying′ assignment(퐶, 푥 for) . Indeed, since 푖 is an independent set,∗ it cannot(퐶 , 푥 ) have more than one vertex inside each∗ one of the triangles푥 corresponding휑 to a 푆clause of . Hence since , it must′ have″ exactly one vertex in each such triangle.푚 For every∗(퐶, clause 푦), (퐶, 푦of), (퐶,, if 푦 ) is the vertex in in the triangle휑 corresponding|푆 | = 푚 to , then by the way we defined , the∗ literal must evaluate to true, which퐶 means휑 (퐶, that 푦) satisfies this ∗ clause.푆 Therefore satisfies all clauses퐶 of , which is∗ the definition푥 of a satisfying푦 assignment.∗ 푥 This completes푥 the proof of Theorem 14.6휑 ■

Figure 14.5: The reduction of 3SAT to Independent Set. On the righthand side is Python code that implements this reduction. On the lefthand side is a sample output of the reduction. We use black for the “triangle edges” and red for the “conflict edges”. Note that the satisfying assignment corresponds to the

independent set ∗ , , . 푥 = 0110 (0, ¬푥3) (1, ¬푥0) (2, 푥2)

14.5 SOME EXERCISES AND ANATOMY OF A REDUCTION.

Reductions can be confusing and working out exercises is a great way to gain more comfort with them. Here is one such example. As usual, I recommend you try it out yourself before looking at the solution. Solved Exercise 14.3 — Vertex cover. A vertex cover in a graph is a subset of vertices such that every edge touches at least one vertex of (see ??). The vertex cover problem is the task to퐺 determine, = (푉 , 퐸) given a graph푆 ⊆ 푉and a number , whether there exists a vertex cover in the graph푆 with at most vertices. Formally, this is the function VC 퐺 such that푘 for every and , ∗ 푘 ∶ {0, 1} → {0, 1} 퐺 = (푉 , 퐸) 푘 ∈ ℕ polynomial-time reductions 465

VC if and only if there exists a vertex cover such that . Prove(퐺, 푘) that = 1 SAT VC. 푆 ⊆ 푉 |푆| ≤ 푘 ■ 푝 3 ≤ Solution: The key observation is that if is a vertex cover that touches all vertices, then there is no edge such that both ’s end- points are in the set 푆, and ⊆ vice 푉 versa. In other words, is a vertex cover if and only if⧵ is an independent푒 set. Since푠 the size of is 푆 = 푉, we see푆 that the polynomial-time map 푆 (where푆 is the number of vertices of ) satisfies that푆 VC|푉 | − |푆| ISET which means that it is a Figure 14.6:A vertex cover in a graph is a subset of 푅(퐺, 푘) = (퐺, 푛 − 푘) 푛 퐺 vertices that touches all edges. In this -vertex graph, reduction from independent set to vertex cover. the filled vertices are a vertex cover. (푅(퐺, 푘)) = (퐺, 푘) ■ 7 3 Solved Exercise 14.4 — Clique is equivalent to independent set. The maximum clique problem corresponds to the function CLIQUE such that for a graph and a number , CLIQUE ∗iff there is a subset of vertices such that for every distinct ∶ {0, 1}, the→ {0, edge 1} is in . Such a set퐺 is known as a clique푘 . (퐺, 푘) = 1 Prove푆 that CLIQUE푘 ISET and ISET CLIQUE푢,. 푣 ∈ 푆 푢, 푣 퐺 ■ 푝 푝 ≤ ≤ Solution: If is a graph, we denote by its complement which is the graph on the same vertices and such that for every distinct 퐺 =, the(푉 , edge퐸) is present in if퐺 and only if this edge is not present in . 푉 푢, 푣This ∈ means 푉 that for{푢, every 푣} set , is an퐺 independent set in if and only if is퐺 a clique in . Therefore for every , ISET CLIQUE . Since the map 푆 푆 can be computed efficiently,퐺 this yields a푆 reduction ISET퐺 CLIQUE. Moreover,푘 since(퐺, 푘) = this yields(퐺, a 푘) reduction in the other퐺 ↦ direction퐺 as well. 푝 ≤ 퐺 = 퐺■

14.5.1 Dominating set In the two examples above, the reduction was almost “trivial”: the reduction from independent set to vertex cover merely changes the number to , and the reduction from independent set to clique flips edges to non-edges and vice versa. The following exercise re- quires a푘 somewhat푛 − 푘 more interesting reduction. Solved Exercise 14.5 — Dominating set. A dominating set in a graph is a subset of vertices such that for every is a neighbor in of some (see Fig. 14.7). The dominating set⧵퐺 problem = (푉 , 퐸) 푆 ⊆ 푉 푢 ∈ 푉 푆 퐺 푠 ∈ 푆 466 introduction to theoretical computer science

is the task, given a graph and number , of determining whether there exists a dominating set with . Formally, this is the function DS 퐺 = (푉 , 퐸) such that푘DS iff there is a dominating set in of∗ at most푆 ⊆vertices. 푉 |푆| ≤ 푘 Prove that ISET DS∶ {0,. 1} → {0, 1} (퐺, 푘) = 1 퐺 푘 ■ 푝 ≤ Solution: Since we know that ISET VC, using transitivity, it is enough to show that VC DS. As Fig. 14.7 shows, a dominating set is 푝 not the same thing as a vertex≤ cover. However, we can still relate 푝 the two problems. The≤ idea is to map a graph into a graph such that a vertex cover in would translate into a dominating set in and vice versa. We do so by including in 퐺 all the vertices퐻 퐺 and edges of , but for every edge of we also add to a Figure 14.7: A dominating set is a subset of vertices new퐻 vertex and connect it to both and .퐻 Let be the number such that every vertex in the graph is either in or a of isolated vertices퐺 in . The idea behind{푢, 푣} the퐺 proof is that we퐻 can neighbor of . The figure above are two푆 copies of the 푢,푣 same graph. The red vertices on the left are a vertex 푤 푢 푣 ℓ 푆 transform a vertex cover of vertices in into a dominating set cover that is푆 not a dominating set. The blue vertices of vertices in 퐺 by adding to all the isolated vertices, and on the right are a dominating set that is not a vertex cover. moreover we can transform푆 every푘 sized퐺 dominating set in into푘 a + vertex ℓ cover in 퐻. We now give푆 the details. Description of the algorithm. Given푘 + ℓ an instance for the 퐻 vertex cover problem,퐺 we will map into an instance for the dominating set problem as follows (see Fig. 14.8(퐺,for 푘)Python′ implementation): 퐺 (퐻, 푘 )

Algorithm 14.8 — to reduction. Input: Graph and number . 푉 퐶 퐷푆 Output: Graph and number , such that has a vertex퐺 = cover (푉 , 퐸) of′ size′ -iff 푘has a dominating′ set of size 퐻 = (푉 , 퐸 ) 푘 1: 퐺That is, ′ 푘 퐻, 2: Initialize 푘 ′ 3: for every퐷푆(퐻, edge′ 푘 ) =′ 퐼푆퐸푇do (퐺, 푘) 4: Add vertex푉 ← 푉 , 퐸to← 퐸 5: Add edges {푢, 푣} ∈ 퐸,′ to . 푢,푣 6: end for 푤 푉 ′ 푢,푣 푢,푣 7: Let number{푢, of 푤 isolated} {푣, vertices 푤 } in퐸 8: return ℓ ← ′ ′ 퐺 Algorithm 14.8(퐻 =runs (푉 in, 퐸 polynomial) , 푘 + ℓ) time, since the loop takes steps where is the number of edges, with each step can be implemented in constant or at most linear time (depending on the 푂(푚) 푚 polynomial-time reductions 467

representation of the graph ). Counting the number of isolated vertices in an vertex graph can be done in time if is represented in the adjacency퐻 matrix representation and2 time if it is represented푛 in the adjacency퐺 list representation.푂(푛 Regardless) 퐺 the algorithm runs in polynomial time. 푂(푛) To complete the proof we need to prove that for every , if is the output of Algorithm 14.8 on input , then DS′ VC . We split the proof into two퐺, parts. 푘 The completeness퐻, 푘 ′ part is that if VC then (퐺,DS 푘) . The soundness(퐻, 푘part) is = that if(퐺,DS 푘) then VC ′ . Completeness. Suppose that(퐺,′VC 푘) = 1 . Then(퐻, 푘 there) = is 1 a - tex cover of at most(퐻, 푘 )vertices. = 1 Let (퐺,be 푘)the = set 1 of isolated vertices in and be their number.(퐺, Then 푘) = 1 . We claim that 푆is a ⊆ dominating 푉 set in푘 . Indeed for퐼 every vertex of there are three퐺 cases:ℓ ′ |푆 ∪ 퐼| ≤ 푘 + ℓ ′ 푆 ∪ 퐼 퐻 푣 퐻 • Case 1: is an isolated vertex of . In this case is in .

• Case 2: 푣 is a non-isolated vertex퐺 of and hence푣 there푆 is ∪ an 퐼 edge of for some . In this case since is a vertex cover, one of has푣 to be in , and hence either퐺 or a neighbor of has to {푢,be in 푣} 퐺 . 푢 푆 푢, 푣 푆 푣 푣 • Case푆 3: ⊆is 푆 of∪ 퐼 the form for some two neighbors in . But then since is a vertex cover,′ one of has to be in′ and 푢,푢 hence 푣contains a neighbor푤 of . ′ 푢, 푢 퐺 푆 푢, 푢 푆 We conclude푆 that is푣 a dominating set of size at most in and hence under the assumption that VC , DS′ .′ 푆 ∪ 퐼 푘 =Soundness. 푘′ + ℓ′ 퐻 Suppose that DS . Then there(퐺.푘) is a domi- = 1 nating(퐻 , set 푘 ) =of 1 size at most ′in . For every edge in the graph , if contains the′ vertex(퐻, 푘 ) =then 1 we remove this ver- tex and add퐷 in its place. The푘 = only 푘 + two ℓ neighbors퐻 of are{푢, 푣}and 푢,푣 , and since퐺 is퐷 a neighbor of both 푤 and of , replacing 푢,푣 with maintains푢 the property that it is a dominating푤 set. Moreover,푢 푢,푣 푢,푣 푣this change cannot푢 increase the size푤 of . Thus푣 following this푤 mod- ification,푣 we can assume that is a dominating set of at most vertices that does not contain any vertices퐷 of the form . Let be the set of isolated vertices퐷 in . These vertices are also푘 + ℓ 푢,푣 isolated in and hence must be included in (an isolated푤 ver- tex must퐼 be in any dominating set, since퐺 it has no neighbors). We let 퐻 . Then . We claim퐷 that is a vertex cover in . Indeed,⧵ for every edge of , either the vertex or 푆 = 퐷 퐼 |푆| ≤ 퐼 푆 푢,푣 퐺 {푢, 푣} 퐺 푤 468 introduction to theoretical computer science

one of its neighbors must be in by the dominating set property. But since we ensured doesn’t contain any of the vertices of the form , it must be the case that푆 either or is in . This shows that is a vertex cover푆 of of size at most , hence proving that 푢,푣 VC 푤 . 푢 푣 푆 푆 퐺 푘 ■ A corollary(퐺, 푘) = 1 of Algorithm 14.8 and the other reduction we have seen so far is that if DS P (i.e., dominating set has a polynomial-time al- gorithm) then SAT P (i.e., SAT has a polynomial-time algorithm). By the contra-positive,∈ if SAT does not have a polynomial-time algo- rithm then neither3 does∈ dominating3 set. 3 Figure 14.8: Python implementation of the reduction from vertex cover to dominating set, together with an example of an input graph and the resulting output graph. This reduction allows to transform a hypothet- ical polynomial-time algorithm for dominating set (a “whistling pig”) into a hypothetical polynomial-time algorithm for vertex-cover (a “flying horse”).

14.5.2 Anatomy of a reduction

Figure 14.9: The four components of a reduction, illustrated for the particular reduction of vertex cover to dominating set. A reduction from problem to problem is an algorithm that maps an input for into an input for . To show that the reduction퐹 is correct퐺 we need to show the properties of efficiency푥 퐹: algorithm runs푅(푥) in polynomial퐺 time, completeness: if then , and soundness: if 푅 then . 퐹(푥) = 1 퐺(푅(푥)) = 1 퐹(푅(푥)) = 1 퐺(푥) = 1

The reduction of Solved Exercise 14.5 gives a good illustration of the anatomy of a reduction. A reduction consists of four parts: • Algorithm description: This is the description of how the algorithm maps an input into the output. For example, in Solved Exercise 14.5 this is the description of how we map an instance of the vertex cover problem into an instance of the dominating set problem. ′ (퐺, 푘) (퐻, 푘 ) polynomial-time reductions 469

• Algorithm analysis: It is not enough to describe how the algorithm works but we need to also explain why it works. In particular we need to provide an analysis explaining why the reduction is both efficient (i.e., runs in polynomial time) and correct (satisfies that for every ). Specifically, the components of analysis of a reduction include: 퐺(푅(푥)) = 퐹 (푥) 푥 – Efficiency: We need to show that runs in polynomial time. In 푅 most reductions we encounter this part is straightforward, as the reductions we typically use involve푅 a constant number of nested loops, each involving a constant number of operations. For ex- ample, the reduction of Solved Exercise 14.5 just enumerates over the edges and vertices of the input graph. – Completeness: In a reduction demonstrating , the completeness condition is the condition that for every , 푝 if then 푅. Typically we construct퐹 ≤ 퐺 the ∗ reduction to ensure that this holds, by giving a way to푥 map∈ {0, a 1} “certificate/solution”퐹 (푥) = 1 퐺(푅(푥)) certifying = 1 that into a solution certifying that . For example, in Solved Exercise 14.5 we constructed the graph such that퐹 (푥)for every = 1 vertex cover in , the set 퐺(푅(푥))(where = 1 is the isolated vertices) would be a dominating set in . 퐻 푆 – Soundness:퐺 푆This ∪ 퐼 is the condition퐼 that if then or (taking퐻 the contrapositive) if then . This is sometimes straightforward퐹 (푥) but = can0 often be 퐺(푅(푥))harder to = show 0 than the completeness condition,퐺(푅(푥)) and in = more 1 advanced퐹 (푥) = 1 reductions (such as the reduction SAT ISET of Theorem 14.6) demonstrating soundness is the main part 푝 of the analysis. For example, in Solved Exercise 14.5≤ to show soundness we needed to show that for every dominating set in the graph , there exists a vertex cover of size at most in the graph (where is the number of isolated vertices). 퐷 This was challenging퐻 since the dominating푆 set might not|퐷| be − ℓ necessarily the퐺 one we “hadℓ in mind”. In particular, in the proof above we needed to modify to ensure that it퐷 does not contain vertices of the form , and it was important to show that this modification still maintains퐷 the property that is a dominating 푢,푣 set, and also does not푤 make it bigger. 퐷 Whenever you need to provide a reduction, you should make sure that your description has all these components. While it is sometimes tempting to weave together the description of the reduction and its analysis, it is usually clearer if you separate the two, and also break down the analysis to its three components of efficiency, completeness, and soundness. 470 introduction to theoretical computer science

14.6 REDUCING INDEPENDENT SET TO MAXIMUM CUT

We now show that the independent set problem reduces to the maxi- mum cut (or “max cut”) problem, modeled as the function MAXCUT that on input a pair outputs iff contains a cut of at least edges. Since both are graph problems, a reduction from independent set to max cut maps(퐺, one 푘) graph into1 the퐺 other, but as we will see the푘 output graph does not have to have the same vertices or edges as the input graph.

Theorem 14.9 — Hardness of Max Cut. ISET MAXCUT

푝 Proof Idea: ≤ We will map a graph into a graph such that a large indepen- dent set in becomes a partition cutting many edges in . We can think of a cut in as coloring퐺 each vertex퐻 either “blue” or “red”. We will add a special퐺 “source” vertex , connect it to all other퐻 vertices, and assume without퐻 loss of generality∗ that it is colored blue. Hence the more vertices we color red, the푠 more edges from we cut. Now, for every edge in the original graph we will add∗ a special “gad- get” which will be a small subgraph that involves , 푠, the source , and two other additional푢, 푣 vertices. We design퐺 the gadget in a way so∗ that if the red vertices are not an independent set in푢 푣 then the cor-푠 responding cut in will be “penalized” in the sense that it would not cut as many edges. Once we set for ourselves this퐺 objective, it is not hard to find a퐻 gadget that achieves it see the proof below. Once again the takeaway technique is to use (this time a slightly more clever) gadget. −

⋆ Figure 14.10: In the reduction of ISET to MAXCUT we map an -vertex -edge graph into the vertex and edge graph as follows. The푛 graph 푚contains a special퐺 “source” vertex푛 + 2푚 +, 1vertices 푛 + 5푚, and ver-퐻 tices ∗ 퐻 with each pair cor- 0 푛−1 responding0푠 푛1 to an0 edge푣 , of1 … , 푣. We put an2푚 edge be- tween푒0, 푒and0, … , 푒for푚−1 every, 푒푚−1 , and if the -th edge of ∗ was then퐺 we add the five edges 푠 푣푖 푖 ∈ [푛] . The푡 intent 푖 푗 is∗ that0 if퐺 cut∗ at1 most(푣 , 푣 one) 0 of 1 from0 1 then we’ll 푡 푡 푖 푡 푗 푡 푡 푡 (푠be able, 푒 ), to (푠 cut, 푒 ),out (푣 of, 푒 these), (푣 five, 푒 ), (푒edges,, 푒∗ ) while if we cut both and from then푣푖, 푣푗 we’ll be푠 able to cut at most three of them.4 ∗ 푣푖 푣푗 푠 Proof of Theorem 14.9. We will transform a graph of vertices and edges into a graph of vertices and edges in the following way (see also Fig. 14.10). The graph 퐺contains푛 all vertices푚 퐻 푛 + 1 + 2푚 푛 + 5푚 퐻 polynomial-time reductions 471

of (though not the edges between them!) and in addition also has: 퐺* A special vertex that is connected to all the vertices of퐻 * For every edge ∗ , two vertices such that is connected to and푠 is connected to , and moreover we add퐺 the 0 1 0 edges 푒 = {푢, 푣} ∈to 퐸(퐺). 푒 , 푒 푒 1 Theorem 14.9푢will∗ follow푒 by∗ showing푣 that contains an indepen- 0 1 0 1 dent set{푒 of, 푒 size}, {푒 at least, 푠 }, {푒if, and 푠 } only퐻 if has a cut cutting at least edges. We now prove both directions퐺 of this equivalence: Part 1: Completeness.푘 If is an independent퐻 -sized set in , then we푘 + can 4푚 define to be a cut in of the following form: we let con- tain all the vertices of and퐼 for every edge 푘 ,퐺 if and then푆 we add to 퐻; if and then we add푆 to ; and if and 퐼 then we add both푒 =and {푢, 푣}to ∈ 퐸(퐺). (We don’t푢 ∈ 퐼 1 0 need푣 to ∉ worry 퐼 about the푒 case푆 that푢 both ∉ 퐼 and푣 ∈are 퐼 in since it is푒 an 0 1 푆independent푢 ∉ set.)퐼 We푣 ∉ can 퐼 verify that in all cases푒 the푒 number푆 of edges from to its complement in the gadget푢 corresponding푣 퐼 to will be four (see Fig. 14.11). Since is not in , we also have edges from to , for a total푆 of edges.∗ 푒 ∗ Part 2: Soundness. 푠Suppose that푆 is a cut in 푘 that cuts at least푠 퐼 푘edges. + 4푚 We can assume that is not in (otherwise we can “flip” to its complement , since푆 this∗ does퐻 not change the size of퐶 the= 푘 cut). + 4푚 Now let be the set of vertices푠 in that푆 correspond to the original vertices푆 of . If was an푆 independent set of size then we would be done. This퐼 might not always be the푆 case but we will see that if is not an independent퐺 퐼 set then it’s also larger than . Specifically,푘 we define be the set of edges in that are contained in퐼 and let (i.e., if is an independent푘 set then 푖푛 and푚 = |퐸(퐼, 퐼)|). By the properties of our퐺 gadget we know 표푢푡 푖푛 that퐼 for every푚 edge= 푚 −of 푚 , we can퐼 cut at most three edges when 푖푛 표푢푡 푚both=and 0 are푚 in =, and푚 at most four edges otherwise. Hence the number of edges{푢, cut 푣} by 퐺satisfies 푢 푣 푆 . Since we 푖푛 표푢푡 get that 퐶 . Now푆 we can transform퐶 ≤ |퐼| + 3푚into an+ independent 4푚 = 푖푛 푖푛 푖푛 |퐼|set + 3푚by going+ 4(푚 over − every 푚 ) =one |퐼| of + the 4푚 − 푚edges that퐶 are = inside 푘 + 4푚and Figure 14.11: In the reduction of independent set 푖푛 to max cut, for every , we have a “gadget” removing′ |퐼| one − 푚 of the≥ endpoints푘 of the edge from퐼 it. The resulting set 푖푛 corresponding to the -th edge in the is an퐼 independent set in the graph 푚of size and퐼 so this ′ original graph. If we푡 think ∈ [푚] of the side of the cut 푖 푗 concludes the proof of the soundness condition. 퐼 containing the special푡 source vertex푒 = {푣 as, 푣 “white”} and the other side as “blue”, then the leftmost and center 푖푛 ■ ∗ 퐺 |퐼| − 푚 ≥ 푘 figures show that if and are not푠 both blue then we can cut four edges from the gadget. In contrast, by enumerating all possibilities푣푖 푣푗 one can verify that if both and are blue, then no matter how we color the intermediate vertices , we will cut at

most three푢 edges푣 from the gadget.0 The1 figure above contains only the gadget edges푒 and푡 , 푒푡 ignores the edges connecting to the vertices .

∗ 푠 푣0, … , 푣푛−1 472 introduction to theoretical computer science

Figure 14.12: The reduction of independent set to max cut. On the righthand side is Python code implementing the reduction. On the lefthand side is an example output of the reduction where we apply it to the independent set instance that is obtained by running the reduction of Theorem 14.6 on the 3CNF formula .

(푥0 ∨푥3 ∨푥2)∧(푥0 ∨푥1 ∨푥2)∧(푥1 ∨푥2 ∨푥3)

14.7 REDUCING 3SAT TO LONGEST PATH

Note: This section is still a little messy; feel free to skip it or just read it without going into the proof details. The proof appears in Section 7.5 in Sipser’s book. One of the most basic algorithms in Computer Science is Dijkstra’s algorithm to find the shortest path between two vertices. We now show that in contrast, an efficient algorithm for the longest path problem would imply a polynomial-time algorithm for 3SAT.

Theorem 14.10 — Hardness of longest path.

SAT LONGPATH (14.6) Figure 14.13: We can transform a 3SAT formula into 푝 a graph such that the longest path in the graph Proof Idea: 3 ≤ would correspond to a satisfying assignment in휑 . In To prove Theorem 14.10 need to show how to transform a 3CNF this graph,퐺 the black colored part corresponds to the퐺 variables of and the blue colored part corresponds휑 formula into a graph and two vertices such that has a path to the vertices. A sufficiently long path would have to of length at least if and only if is satisfiable. The idea of the reduc- first “snake”휑 through the black part, for each variable 휑 퐺 푠, 푡 퐺 choosing either the “upper path” (corresponding tion is sketched in Fig. 14.13 and Fig. 14.14. We will construct a graph to assigning it the value True) or the “lower path” that contains a potentially푘 long “snaking휑 path” that corresponds to (corresponding to assigning it the value False). Then all variables in the formula. We will add a “gadget” corresponding to achieve maximum length the path would traverse through the blue part, where to go between two to each clause of in a way that we would only be able to use the vertices corresponding to a clause such as gadgets if we have a satisfying assignment. , the corresponding vertices would have to have 17 32 휑 been not traversed before. 푥 ∨ 푥 ∨ 푥57 def⋆ TSAT2LONGPATH(φ): """Reduce 3SAT to LONGPATH""" def var(v): # return variable and True/False depending if positive or negated

↪ return int(v[2:]),False if v[0]=="¬" else int(v[1:]),True

n = numvars(φ)↪ Figure 14.14 clauses = getclauses(φ) : The graph above with the longest path marked on it, the part of the path corresponding to m = len(clauses) variables is in green and part corresponding to the G =Graph() clauses is in pink. polynomial-time reductions 473

G.edge("start","start_0") for i in range(n): # add 2 length-m paths per variable G.edge(f"start_{i}",f"v_{i}_{0}_T") G.edge(f"start_{i}",f"v_{i}_{0}_F") for j in range(m-1): G.edge(f"v_{i}_{j}_T",f"v_{i}_{j+1}_T") G.edge(f"v_{i}_{j}_F",f"v_{i}_{j+1}_F") G.edge(f"v_{i}_{m-1}_T",f"end_{i}") G.edge(f"v_{i}_{m-1}_F",f"end_{i}") if i

↪ for v in enumerate(C): i,sign = var(v[1]) s = "F" if sign else "T" G.edge(f"C_{j}_in",f"v_{i}_{j}_{s}") G.edge(f"v_{i}_{j}_{s}",f"C_{j}_out") if j

Proof of Theorem 14.10. We build a graph that “snakes” from to as follows. After we add a sequence of long loops. Each loop has an “upper path” and a “lower path”. A simple퐺 path cannot take both푠 the푡 upper path and푠 the lower path, and so푛 it will need to take exactly one of them to reach from . Our intention is that a path in the graph will correspond to an as- signment 푠 in푡 the sense that taking the upper path in the loop corresponds to푛 assigning and taking the lower path cor-푡ℎ responds to푥 ∈ assigning {0, 1} . When we are done snaking through all푖 푖 the loops corresponding to the푥 variables= 1 to reach we need to pass 푖 through “obstacles”:푥 for= each 0 clause we will have a small gad- get consisting푛 of a pair of vertices that have three푡 paths between them. For푚 example, if the clause had푗 the form then 푗 푗 one path would go through푡ℎ a vertex푠 ,in 푡 the lower loop corresponding 17 55 72 to , one path would go푗 through a vertex in the푥 upper∨ 푥 loop∨ 푥 corre- sponding to and the third would go through the lower loop cor- 17 responding푥 to . We see that if we went in the first stage according 55 to a satisfying푥 assignment then we will be able to find a free vertex to 72 travel from to푥 . We link to , to , etc and link to . Thus

푗 푗 1 2 2 3 푚 푠 푡 푡 푠 푡 푠 푡 푡 474 introduction to theoretical computer science

a satisfying assignment would correspond to a path from to that goes through one path in each loop corresponding to the variables, and one path in each loop corresponding to the clauses. We푠 can푡 make the loop corresponding to the variables long enough so that we must take the entire path in each loop in order to have a fighting chance of getting a path as long as the one corresponds to a satisfying assign- ment. But if we do that, then the only way if we are able to reach is if the paths we took corresponded to a satisfying assignment, since otherwise we will have one clause where we cannot reach from푡 without using a vertex we already used before. 푗 푗 푗 푡 푠 ■

14.7.1 Summary of relations We have shown that there are a number of functions for which we can prove a statement of the form “If P then SAT P”. Hence coming up with a polynomial-time algorithm for even퐹 one of these problems will entail a polynomial-time퐹 algorithm ∈ 3 for SAT∈ (see for example Fig. 14.16). In Chapter 15 we will show the inverse direction 3 Figure 14.15: The result of applying the reduction of (“If SAT P then P”) for these functions, hence allowing us to SAT to LONGPATH to the formula conclude that they have equivalent complexity to SAT. . 3 ∈ 퐹 ∈ 3 (푥0 ∨ ¬푥3 ∨ 푥2) ∧ (¬푥0 ∨ 푥1 ∨ ¬푥2) ∧ (푥1 ∨ 푥2 ∨ ¬푥3) 3 Figure 14.16: So far we have shown that P EXP and that several problems we care about such as SAT and MAXCUT are in EXP but it is not⊆ known whether or not they are in EXP. However, since 3SAT MAXCUT we can rule out the possiblity that MAXCUT P but SAT P. The relation of 푝 3P poly to≤ the class EXP is not known. We know that EXP does not contain∈ P 3poly since∉ the latter even contains/ uncomputable functions, but we do not / know whether ot not EXP P poly (though it is believed that this is not the case and in particular that / both SAT and MAXCUT are⊆ not in P poly). Chapter Recap 3 / • The computational complexity of many seemingly ✓ unrelated computational problems can be related to one another through the use of reductions. • If then a polynomial-time algorithm for can be transformed into a polynomial-time 푝 algorithm퐹 ≤ for 퐺. • Equivalently,퐺 if and does not have a polynomial-time퐹 algorithm then neither does . 푝 • We’ve developed퐹 many ≤ techniques퐺 퐹 to show that SAT for interesting functions . Sometimes퐺 we can do so by using transitivity of reductions: if 푝 3SAT ≤ 퐹and then SAT 퐹 .

푝 푝 푝 3 ≤ 퐺 퐺 ≤ 퐹 3 ≤ 퐹 polynomial-time reductions 475

14.8 EXERCISES

14.9 BIBLIOGRAPHICAL NOTES

Several notions of reductions are defined in the literature. The notion defined in Definition 14.1 is often known as a mapping reduction, many to one reduction or a Karp reduction. The maximal (as opposed to maximum) independent set is the task of finding a “local maximum” of an independent set: an independent set such that one cannot add a vertex to it without losing the in- dependence property (such a set is known as a vertex cover). Unlike finding푆 a maximum independent set, finding a maximal independent set can be done efficiently by a greedy algorithm, but this local maxi- mum can be much smaller than the global maximum. Reduction of independent set to max cut taken from these notes. Image of Hamiltonian Path through Dodecahedron by Christoph Sommer. We have mentioned that the line between reductions used for algo- rithm design and showing hardness is sometimes blurry. An excellent example for this is the area of SAT Solvers (see [Gom+08]). In this field people use algorithms for SAT (that take exponential time inthe worst case but often are much faster on many instances in practice) together with reductions of the form SAT to derive algorithms for other functions of interest. 푝 퐹 ≤ 퐹

Learning Objectives: • Introduce the class NP capturing a great many important computational problems • NP-completeness: evidence that a problem might be intractable. • The P vs NP problem.

15 NP, NP completeness, and the Cook-Levin Theorem

“In this paper we give theorems that suggest, but do not imply, that these problems, as well as many others, will remain intractable perpetually”, Richard Karp, 1972

“Sad to say, but it will be many more years, if ever before we really understand the Mystical Power of Twoness… 2-SAT is easy, 3-SAT is hard, 2-dimensional matching is easy, 3-dimensional matching is hard. Why? oh, Why?” Eugene Lawler

So far we have shown that 3SAT is no harder than Quadratic Equa- tions, Independent Set, Maximum Cut, and Longest Path. But to show that these problems are computationally equivalent we need to give re- ductions in the other direction, reducing each one of these problems to 3SAT as well. It turns out we can reduce all three problems to 3SAT in one fell swoop. In fact, this result extends far beyond these particular problems. All of the problems we discussed in Chapter 14, and a great many other problems, share the same commonality: they are all search problems, where the goal is to decide, given an instance , whether there exists a solution that satisfies some condition that can be verified inpoly- nomial time. For example, in 3SAT, the instance푥 is a formula and the solution is푦 an assignment to the variable; in Max-Cut the instance is a graph and the solution is a cut in the graph; and so on and so forth. It turns out that every such search problem can be reduced to 3SAT.

15.1 THE CLASS NP To make the above precise, we will make the following mathematical definition. We define the class NP to contain all Boolean functions that correspond to a search problem of the form above. That is, a Boolean function is in NP if has the form that on input a string , if and only if there exists a “solution” string such that the pair satisfies퐹 some polynomial-time퐹 checkable condition.푥 Formally,퐹NP (푥)is = 1 defined as follows: 푤 (푥, 푤)

Compiled on 8.26.2020 18:10 478 introduction to theoretical computer science

Figure 15.1: Overview of the results of this chapter. We define NP to contain all decision problems for which a solution can be efficiently verified. The main result of this chapter is the Cook Levin Theorem (The- orem 15.6) which states that SAT has a polynomial- time algorithm if and only if every problem in NP has a polynomial-time algorithm.3 Another way to state this theorem is that SAT is NP complete. We will prove the Cook-Levin theorem by defining the two intermediate problems3 NANDSAT and NAND, proving that NANDSAT is NP complete, and then proving that NANDSAT NAND SAT3 .

≤푝 3 ≤푝 3

Definition 15.1 — NP. We say that is in NP if there exists some integer and ∗ such that P and for every , 퐹 ∶ {0, 1}∗ → {0, 1} 푎 >푛 0 푉 ∶ {0, 1} → {0, 1} 푉 ∈ 푥 ∈ {0, 1} s.t. (15.1) 푛푎 푤∈{0,1} 퐹 (푥) = 1 ⇔ ∃ 푉 (푥푤) = 1 . In other words, for to be in NP, there needs to exist some polynomial-time computable verification function , such that if Figure 15.2: The class NP corresponds to problems then there must퐹 exist (of length polynomial in ) such where solutions can be efficiently verified. That is, this 푉 is the class of functions such that if there that , and if then for every such , . is a “solution” of length polynomial in that can Since퐹 (푥) =the 1 existence of this string푤 certifies that , |푥|is often be verified by a polynomial-time퐹 퐹(푥)algorithm = 1. referred푉 (푥푤) to as = a 1certificate퐹, (푥)witness = 0, or proof that 푤 .푉 (푥푤) = 0 푤 |푥| 푉 See also Fig. 15.2 for an illustration푤 of Definition퐹 (푥) =15.1. 1 The푤 name NP stands for “nondeterministic polynomial time”퐹 (푥) and = 1 is used for historical reasons; see the bibiographical notes. The string in (15.1) is sometimes known as a solution, certificate, or witness for the instance . 푤 Solved Exercise 15.1 — Alternative definition of NP. Show that the condition 푥 that in Definition 15.1 can be replaced by the condition that 푎 for some polynomial . That is, prove that for every |푤| = |푥| , NP if and only if there is a polynomial- time|푤| Turing ≤∗ 푝(|푥|) machine and a polynomial푝 such that for 퐹every ∶ {0, 1} → {0, 1} 퐹 ∈ if and only if there exists with such∗ that푉 . 푝 ∶ ℕ → ℕ ∗ 푥 ∈ {0, 1} 퐹 (푥) = 1 푤 ∈ {0, 1} ■ |푤| ≤ 푝(|푥|) 푉 (푥, 푤) = 1 , np completeness, and the cook-levin theorem 479

Solution: The “only if” direction (namely that if NP then there is an algorithm and a polynomial as above) follows immediately from Definition 15.1 by letting 퐹 ∈ . For the “if” direc- tion, the idea푉 is that if a string 푝 is of size at most푎 for degree polynomial , then there is some푝(푛)such = that 푛 for all , . Hence we can encode푤 by a string푝(푛) of exactly length 0 0 푑 by padding푑+1푝 it with and an appropriate푛 number푛 of zeroes. > 푛 |푤|Hence푑+1 < if there 푛 is an algorithm and polynomial푤 as above, then 푛we can define an algorithm1 that does the following on input with and ′푉: 푝 ′ ′ 푉 푎 푥,• 푤 If |푥| =then 푛 |푤 | = 푛ignores and enumerates over all of length at most ′ and′ outputs if′ there exists such that 0 푛 ≤ 푛 . (Since푉 (푥, 푤 ) , this only푤 takes a constant number of푤 steps.) 푝(푛) 1 푤 0 푉 (푥, 푤) = 1 푛 < 푛 • If then “strips out” the padding by dropping all the rightmost zeroes′ from′ until it reaches out the first 0 (which푛 > it 푛drops as푉 well)(푥, 푤 and) obtains a string . If then outputs . 푤 1 ′ 푤 |푤| ≤ 푝(푛) Since푉 runs in polynomial푉 (푥, 푤) time, runs in polynomial time as well, and by definition for every , there′ exists 푎 such that푉 if and only if푉 there exists ′ with|푥| ′such′ that . 푥 푤 ∈ {0,∗ 1} 푉 (푥푤 ) = 1 푤 ∈ {0, 1} ■ |푤| ≤ 푝(|푥|) 푉 (푥푤) = 1 The definition of NP means that for every NP and string , if and only if there is a short and efficiently verifiable proof∗ of this fact. That is, we can think퐹 of ∈ the function in Definition푥 ∈ {0, 1} 퐹15.1 (푥)as a =verifier 1 algorithm, similar to what we’ve seen in Section 11.1. The verifier checks whether a given string 푉 is a valid proof for the statement “ ”. Essentially all proof systems∗ considered in mathematics involve line-by-line checks that푤 ∈ can {0, be 1} car- ried out in polynomial time. Thus퐹 (푥) the = heart1 of NP is asking for state- ments that have short (i.e., polynomial in the size of the statements) proof. Indeed, as we will see in Chapter 16, Kurt Gödel phrased the question of whether NP P as asking whether “the mental work of a mathematician [in proving theorems] could be completely replaced by a machine”. =

R Remark 15.2 — NP not (necessarily) closed under com- plement. Definition 15.1 is asymmetric in the sense that 480 introduction to theoretical computer science

there is a difference between an output of and an output of . You should make sure you understand why this definition does not guarantee that1 if NP then the function0 (i.e., the map ) is in NP as well. 퐹 ∈ In fact, it is believed1 − that 퐹 there do exist푥 functions ↦ 1 − 퐹 (푥) such that NP but NP. For example, as shown below, SAT NP, but the function SAT퐹 that on input a퐹 3CNF ∈ formula1 −outputs 퐹 ∉ if and only if is not satisfiable3 is not∈ known (nor believed)3 to bein NP. This is in contrast to the휑 class P 1which does satisfy휑 that if P then is in P as well.

퐹 ∈ 1 − 퐹 15.1.1 Examples of functions in NP We now present some examples of functions that are in the class NP. We start with the canonical example of the SAT function.

■ Example 15.3 — NP. SAT3is in NP since for every - variable formula , SAT if and only if there exists a satisfying assignment3푆퐴푇 ∈ 3such that , and weℓ can check this condition휑 3 in(휑) polynomial =ℓ time. 1 The above reasoning푥 explains ∈ {0, why 1} SAT is in휑(푥)NP, but = since 1 this is our first example, we will now belabor the point and expand out in full formality the precise representation3 of the witness and the algorithm that demonstrate that SAT is in NP. Since demon- strating that functions are in NP is fairly straightforward,푤 in future cases we will푉 not use as much detail,3 and the reader can also feel free to skip the rest of this example. Using Solved Exercise 15.1, it is OK if witness is of size at most polynomial in the input length , rather than of precisely size for some integer . Specifically, we can represent푎 a 3CNF formula with variables and푛 clauses as a string of length 푛 log푎, since > every 0 one of the clauses involves three variables휑 and their푘 negation, and푚 the identity of each variable can 푛be represented = 푂(푚 using푘) log . We assume that푚 every variable par- ticipates in some clause (as otherwise it can be ignored) and hence 2 that , which in⌈ particular푘⌉ means that the input length is at least as large as and . We푚 can ≥ represent 푘 an assignment to the variables using a 푛- length string .푚 The following푘 algorithm checks whether a given satisfies the formula : 푘 푘 푤 푤 휑 np, np completeness, and the cook-levin theorem 481

Algorithm 15.4 — Verifier for . Input: 3CNF formula on variables and with 3푆퐴푇 clauses, string Output: iff satisfies휑 푘푘 푚 1: for do 푤 ∈ {0, 1} 2: Let1 푤 be휑 the -th clause of 3: if푗 ∈ [푚]violates all three literals then 1 2 푗 4: returnℓ ∨ ℓ ∨ ℓ 푗 휑 5: end푤 if 6: end for 0 7: return Algorithm 15.4 takes time to enumerate over all clauses, 1 and will return if and only if satisfies all the clauses. 푂(푚) Here are some more1 examples푦 for problems in NP. For each one of these problems we merely sketch how the witness is represented and why it is efficiently checkable, but working out the details can bea good way to get more comfortable with Definition 15.1:

• QUADEQ is in NP since for every -variable instance of quadratic equations , QUADEQ if and only if there exists an assign- ment that satisfies . Weℓ can check the condition that satisfies 퐸 in polynomialℓ (퐸) time = 1 by enumerating over all the equa- tions in푥 ∈, {0, and 1} for each such equation퐸 , plug in the values of and verify푥 that퐸 is satisfied. 퐸 푒 푥 • ISET is in NP푒 since for every graph and integer , ISET if and only if there exists a set of vertices that contains no pair of neighbors in . We can check퐺 the condition푘 that (퐺,is an 푘) = 1independent set of size in polynomial푆 푘 time by first checking that and then퐺 enumerating over all edges 푆in , and for each such edge verify≥ that 푘 either or . |푆| ≥ 푘 {푢, 푣} 퐺 • LONGPATH is in NP since for every푢 graph ∉ 푆 푣and ∉ 푆 integer , LONGPATH if and only if there exists a simple path in that is of length at least . We can check퐺 the condition푘 that is a simple path(퐺, 푘)of length = 1 in polynomial time by checking that푃 it has퐺 the form where푘 each is a vertex in , no is푃 repeated, and for every 푘 , the edge is present in the 0 1 푘 푖 푖 graph. (푣 , 푣 , … , 푣 ) 푣 퐺 푣 푖 푖+1 푖 ∈ [푘] {푣 , 푣 } • MAXCUT is in NP since for every graph and integer , MAXCUT if and only if there exists a cut in that cuts at least edges. We can check that condition퐺 that 푘 is a cut of value(퐺, at 푘) least = 1 in polynomial time by checking(푆, that푆) 퐺is a 푘 (푆, 푆) 푘 푆 482 introduction to theoretical computer science

subset of ’s vertices and enumerating over all the edges of , counting those edges such that and or vice versa. 퐺 {푢, 푣} 15.1.2퐺 Basic facts about NP 푢 ∈ 푆 푣 ∉ 푆 The definition of NP is one of the most important definitions of this book, and is worth while taking the time to digest and internalize. The following solved exercises establish some basic properties of this class. As usual, I highly recommend that you try to work out the solutions yourself. Solved Exercise 15.2 — Verifying is no harder than solving. Prove that P NP. ■ ⊆ Solution: Suppose that P. Define the following function : iff and .( outputs on all other inputs.) Since푛 P we can clearly퐹 ∈ compute in polynomial time as푉 well.푉 (푥0 ) = 1 Let푛 = |푥| 퐹be (푥) some = 1 string.푉 If 0 then . On 퐹the ∈ other hand, if푛 then푉 for every , 푛 . Therefore,푥 ∈ setting{0, 1} , we see퐹 that(푥) = 1satisfies푉 (푥0푛 (15.1)), = and 1 es- tablishes that 퐹NP (푥). = 0 푤 ∈ {0, 1} 푉 (푥푤) = 0 푎 = 푏 = 1 푉 ■ 퐹 ∈

R Remark 15.5 — NP does not mean non-polynomial!. People sometimes think that NP stands for “non poly- nomial time”. As Solved Exercise 15.2 shows, this is far from the truth, and in fact every polynomial-time computable function is in NP as well. If is in NP it certainly does not mean that is hard to compute (though it does not, as far as we know, necessarily퐹 mean that it’s easy to compute either).퐹 Rather, it means that is easy to verify, in the technical sense of Definition 15.1. 퐹

Solved Exercise 15.3 — NP is in exponential time. Prove that NP EXP. ■ ⊆ Solution: Suppose that NP and let be the polynomial-time com- putable function that satisfies (15.1) and the corresponding constant. Then given퐹 ∈ every 푉 , we can check whether in time 푎 푛 by enumerating over 푎 푎+1 all the strings 푥푛 and ∈ checking {0, 1}푛 whether , 푎 푎 퐹in (푥) which =푛 case 1 we return푝표푙푦(푛). If ⋅푛 2 = 표(2 for) every such then we return 2. By construction,푤 ∈ {0, the 1} algorithm above will run푉 in (푥푤) time = at 1 1 푉 (푥푤) = 0 푤 0 np, np completeness, and the cook-levin theorem 483

most exponential in its input length and by the definition of NP it will return for every . ■ Solved Exercise퐹 (푥) 15.2 and Solved푥 Exercise 15.3 together imply that

P NP EXP (15.2)

⊆ ⊆ . The time hierarchy theorem (Theorem 13.9) implies that P EXP and hence at least one of the two inclusions P NP or NP EXP is strict. It is believed that both of them are in fact strict inclusions.⊊ That is, it is believed that there are functions in⊆NP that cannot⊆ be computed in polynomial time (this is the P NP conjecture) and that there are functions in EXP for which we cannot even effi- ciently certify that for a given input≠ . One function that is believed to lie in EXP퐹 NP is the function SAT defined as SAT SAT퐹 (푥) =for 1⧵ every 3CNF formula푥 . The conjecture퐹 that SAT NP is known as the “NP co NP3” conjecture. It 3implies(휑) the =P 1 −NP 3 conjecture(휑) (see Exercise 15.2). 휑 We3 have∉ previously informally equated≠ the− notion of with being “no harder≠ than ” and in particular have seen in Solved 푝 Exercise 14.1 that if P and , then P as퐹 well. ≤ The퐺 following퐹 exercise shows퐺 that if then it is also “no harder to 푝 verify” than . That퐺 is, ∈ regardless퐹 of≤ whether퐺 or퐹 not ∈ it is in P, if has 푝 the property that solutions to it can퐹 ≤ be퐺 efficiently verified, then so does . 퐺 퐺 Solved Exercise 15.4 — Reductions and NP. Let . 퐹 Show that if and NP then NP. ∗ 퐹 , 퐺 ∶ {0, 1} → {0, 1} ■ 푝 퐹 ≤ 퐺 퐺 ∈ 퐹 ∈ Solution: Suppose that is in NP and in particular there exists and P such that for every , . Suppose also that퐺 ∗ and so in particular|푦| there푎 푎 is푉 a ∈- 푤∈{0,1} time computable function푦 ∈ {0, 1}such퐺(푦) that = 1 ⇔ ∃ 푉 (푦푤)for = all 1푏 푝 . Define퐹to ≤ be a Turing퐺 Machine that on input a pair푛 computes∗ ′ 푅 and returns퐹 (푥)if and = only 퐺(푅(푥)) if 푥and ∈ {0, 1} . Then푉 runs in polynomial time, and for every푎 (푥, 푤) , 푦 = 푅(푥)iff′ there exists 1of size |푤|which = is |푦| at most푉 polynomial (푦푤)∗ = in1 such푉 that , hence demonstrating푎 푥that ∈ {0,NP 1} . 퐹 (푥) = 1 ′ 푤 |푅(푥)| |푥| 푉 (푥, 푤) = 1 ■ 퐹 ∈ 484 introduction to theoretical computer science

15.2 FROM NP TO 3SAT: THE COOK-LEVIN THEOREM We have seen several examples of problems for which we do not know if their best algorithm is polynomial or exponential, but we can show that they are in NP. That is, we don’t know if they are easy to solve, but we do know that it is easy to verify a given solution. There are many, many, many, more examples of interesting functions we would like to compute that are easily shown to be in NP. What is quite amazing is that if we can solve 3SAT then we can solve all of them! The following is one of the most fundamental theorems in Com- puter Science:

Theorem 15.6 — Cook-Levin Theorem. For every NP, SAT.

푝 We will soon show the proof of Theorem 15.6퐹 ∈, but note퐹 ≤ that3 it im- mediately implies that QUADEQ, LONGPATH, and MAXCUT all reduce to SAT. Combining it with the reductions we’ve seen in Chap- ter 14, it implies that all these problems are equivalent! For example, to reduce 3QUADEQ to LONGPATH, we can first reduce QUADEQ to SAT using Theorem 15.6 and use the reduction we’ve seen in Theo- rem 14.10 from SAT to LONGPATH. That is, since QUADEQ NP, Theorem3 15.6 implies that QUADEQ SAT, and Theorem 14.10 implies that SAT3 LONGPATH, which by the transitivity of∈ reduc- 푝 tions (Solved Exercise 14.2) means that≤ QUADEQ3 LONGPATH. 푝 Similarly, since3 LONGPATH≤ NP, we can use Theorem 15.6 and 푝 Theorem 14.4 to show that LONGPATH SAT ≤ QUADEQ, concluding that LONGPATH∈and QUADEQ are computationally 푝 푝 equivalent. ≤ 3 ≤ There is of course nothing special about QUADEQ and LONGPATH here: by combining (15.6) with the reductions we saw, we see that just like SAT, every NP reduces to LONGPATH, and the same is true for QUADEQ and MAXCUT. All these problems are in some sense “the3 hardest in NP퐹” ∈ since an efficient algorithm for any one of them would imply an efficient algorithm for all the problems in NP. This motivates the following definition:

Definition 15.7 — NP-hardness and NP-completeness. Let . We say that is NP hard if for every NP, . ∗ We say that is NP complete if is NP hard and퐺 ∶NP {0,. 1} → 푝 {0, 1} 퐺 퐹 ∈ 퐹 ≤ 퐺 The Cook-Levin퐺 Theorem (Theorem퐺 15.6) can be rephrased퐺 ∈ as saying that SAT is NP hard, and since it is also in NP, this means that SAT is NP complete. Together with the reductions of Chapter 14, Theorem 15.63 shows that despite their superficial differences, 3SAT, quadratic3 equations, longest path, independent set, and maximum np, np completeness, and the cook-levin theorem 485

cut, are all NP-complete. Many thousands of additional problems have been shown to be NP-complete, arising from all the sciences, mathematics, economics, engineering and many other fields. (For a few examples, see this Wikipedia page and this website.)

 Big Idea 22 If a single NP-complete has a polynomial-time algo- rithm, then there is such an algorithm for every decision problem that corresponds to the existence of an efficiently-verifiable solution.

15.2.1 What does this mean? As we’ve seen in Solved Exercise 15.2, P NP. The most famous con- jecture in Computer Science is that this containment is strict. That is, it is widely conjectured that P NP. One⊆ way to refute the conjec- Figure 15.3: The world if P NP (left) and P NP (right). In the former case the set of NP-complete ture that P NP is to give a polynomial-time algorithm for even a problems is disjoint from P≠and Ladner’s theorem= single one of the NP-complete≠ problems such as 3SAT, Max Cut, or shows that there exist problems that are neither in the thousands≠ of others that have been studied in all fields of human P nor are NP-complete. (There are remarkably few natural candidates for such problems, with some endeavors. The fact that these problems have been studied by so many prominent examples being decision variants of people, and yet not a single polynomial-time algorithm for any of problems such as integer factoring, lattice shortest vector, and finding Nash equilibria.) In the latter case them has been found, supports that conjecture that indeed P NP. In that P NP the notion of NP-completeness loses its fact, for many of these problems (including all the ones we mentioned meaning, as essentially all functions in P (save for the above), we don’t even know of a -time algorithm! However,≠ to the trivial constant= zero and constant one functions) are NP-complete. frustration of computer scientists,표(푛) we have not yet been able to prove that P NP or even rule out the2 existence of an -time algorithm for 3SAT. Resolving whether or not P NP is known as the P vs NP problem≠. A million-dollar prize has been offered푂(푛)for the solution of this problem, a popular book has been= written, and every year a new paper comes out claiming a proof of P NP or P NP, only to wither under scrutiny. One of the mysteries of computation= is that people≠ have observed a certain empirical “zero-one law” or “dichotomy” in the computational Figure 15.4: A rough illustration of the (conjectured) complexity of natural problems, in the sense that many natural prob- status of problems in exponential time. Darker colors lems are either in P (often in TIME or TIME ), or they correspond to higher running time, and the circle in the middle is the problems in P. NP is a (conjectured are are NP hard. This is related to the fact that for most2 natural prob- to be proper) superclass of P and the NP-complete lems, the best known algorithm is either(푂(푛)) exponential(푂(푛 or polynomial,)) problems (or NPC for short) are the “hardest” prob- with not too many examples where the best running time is some lems in NP, in the sense that a solution for one of log them implies a solution for all other problems in NP. strange intermediate complexity such as . However, it is be- √ 푛 It is conjectured that all the NP-complete problems lieved that there exist problems in NP that2 are neither in P nor are NP- require at least exp time to solve for a constant , and many require exp time. The per- complete, and in fact a result known as “Ladner’s2 Theorem” shows 휖 manent is not believed(푛 ) to be contained in NP though that if P NP then this is indeed the case (see also Exercise 15.1 and 휖it is>NP 0 -hard, which means that(Ω(푛)) a polynomial-time Fig. 15.3). algorithm for it implies that P NP. ≠ = 486 introduction to theoretical computer science

15.2.2 The Cook-Levin Theorem: Proof outline We will now prove the Cook-Levin Theorem, which is the underpin- ning to a great web of reductions from 3SAT to thousands of problems across many great fields. Some problems that have been shown tobe NP-complete include: minimum-energy protein folding, minimum surface-area foam configuration, map coloring, optimal Nash equi- librium, quantum state entanglement, minimum supersequence of a genome, minimum codeword problem, shortest vector in a lattice, minimum genus knots, positive Diophantine equations, integer pro- gramming, and many many more. The worst-case complexity of all these problems is (up to polynomial factors) equivalent to that of 3SAT, and through the Cook-Levin Theorem, to all problems in NP. To prove Theorem 15.6 we need to show that SAT for every NP. We will do so in three stages. We define two intermediate 푝 problems: NANDSAT and NAND. We will shortly퐹 ≤ show3 the def- 퐹initions ∈ of these two problems, but Theorem 15.6 will follow from combining the following three3 results:

1. NANDSAT is NP hard (Lemma 15.8).

2. NANDSAT NAND (Lemma 15.9).

푝 3. NAND ≤SAT3 (Lemma 15.10).

푝 By3 the transitivity≤ 3 of reductions, it will follow that for every NP, 퐹 ∈ NANDSAT NAND SAT (15.3)

hence establishing푝 Theorem 15.6.푝 푝 퐹 ≤ ≤ 3 ≤ 3 We will prove these three results Lemma 15.8, Lemma 15.9 and Lemma 15.10 one by one, providing the requisite definitions as we go along.

15.3 THE NANDSAT PROBLEM, AND WHY IT IS NP HARD The function NANDSAT is defined as follows: ∗ • The input to NANDSAT∶ {0,is 1}a string→ {0, 1}representing a NAND-CIRC program (or equivalently, a circuit with NAND gates). 푄 • The output of NANDSAT on input is if and only if there exists a string (where is the number of inputs to ) such that . 푛 푄 1 푤 ∈ {0, 1} 푛 푄 Solved푄(푤) Exercise = 1 15.5 — NP. Prove that NANDSAT NP. ■ 푁퐴푁퐷푆퐴푇 ∈ ∈ np, np completeness, and the cook-levin theorem 487

Solution: We have seen that the circuit (or straightline program) evalua- tion problem can be computed in polynomial time. Specifically, given a NAND-CIRC program of lines and inputs, and , we can evaluate on the input in time which is polynomial in 푛and hence verify푄 whether푠 or not푛 . 푤 ∈ {0, 1} 푄 푤 ■ 푠 푄(푤) = 1 We now prove that NANDSAT is NP hard. Lemma 15.8 NANDSAT is NP hard.

Proof Idea:

The proof closely follows the proof that P P poly (Theorem 13.12 , see also Section 13.6.2). Specifically, if NP then there is a poly- / nomial time Turing machine and positive integer⊆ such that for every , iff there is퐹 ∈some such that 푎 . The푛 proof that P푀 P poly gave us way (via푎 “unrolling푛 the loop”)푥 to ∈ come {0, 1} up퐹 in (푥) polynomial = 1 time with a푤 Boolean ∈ {0, 1} circuit on / 푀(푥푤)inputs that = 1 computes the function⊆ . We can then translate푎 into an equivalent NAND circuit (or NAND-CIRC program)퐶 .푛 We see that there is a string 푤 ↦such 푀(푥푤) that if and only if 푎 퐶there is such satisfying 푛 which (by definition) happens푄 if and only if . Hence푤 ∈{0, the 1} translation of 푄(푤)into =the 1 circuit is a reduction showing푤 NANDSAT푀(푥푤) = 1. 퐹 (푥) = 1 푥 푄 푝 퐹 ≤ ⋆ P The proof is a little bit technical but ultimately follows quite directly from the definition of NP, as well as the ability to “unroll the loop” of NAND-TM programs as discussed in Section 13.6.2. If you find it confusing, try to pause here and think how you would implement in your favorite programming language the function unroll which on input a NAND-TM program and numbers outputs an -input NAND-CIRC program of lines such that for every input푃 , if푇 , 푛halts on within푛 at most steps and outputs 푄, then푛 푂(|푇 |) . 푧 ∈ {0, 1} 푃 푧 푇 푦 푄(푧) = 푦

Proof of Lemma 15.8. Let NP. To prove Lemma 15.8 we need to give a polynomial-time computable function that will map every to a NAND-CIRC퐹 program ∈ such that NANDSAT ∗ . Let∗ be such a string and let be its length.푥 By∈ Definition{0, 1} ∗ 15.1 there∗ exists P and푄 positive 퐹 (푥)∗such = that (푄) if and only푥 ∈ if {0, there 1} exists satisfying푛 = |푥 | . ∗ 푎 푉 ∈ 푛 푎ℕ ∗ 퐹 (푥 ) = 1 푤 ∈ {0, 1} 푉 (푥 푤) = 1 488 introduction to theoretical computer science

Let . Since P there is some NAND-TM program that computes on푎 inputs of the form with and ∗ in at most푚 = 푛 time푉 ∈ for some constant . Using our푛 “unrolling푃 푚 the loop NAND-TM푉 푐 to NAND compiler”푥푤 of푥Theorem ∈ {0, 1} 13.14,푤 we ∈ can{0, 1} obtain a NAND-CIRC(푛 + 푚) program that has푐 inputs and at most lines such that ′ for every and 2푐 . ′ 푄 ∗ 푛 + 푚 푛 푂((푛We + can 푚) then) 푚 use a simple “hardwiring”푄 (푥푤) = 푃 (푥푤) technique, reminiscent푥 ∈ {0, 1} of Remark푤 ∈ 9.11 {0, 1}to map into a circuit/NAND-CIRC program on inputs such that ′ for every . CLAIM: There is a푄 polynomial-time′ ∗ algorithm that푚 on input푄 a 푚 NAND-CIRC program푄(푤) = 푄on(푥 푤) inputs푤 and ∈ {0, 1} , outputs a NAND-CIRC program ′ such that for every ∗ ,푛 . 푄 푛 + 푚 푥 ∈ {0, 1}푛 ′ PROOF∗ OF CLAIM: We푄 can do so by adding푤 a ∈ few {0, lines 1} 푄(푤) to ensure = 푄that(푥 the푤) variables zero and one are and respectively, and then simply replacing any reference in to an input with the corresponding value based on . See0′ Fig.1 15.5 for an implementation 푖 of this reduction in Python. ∗ 푄 푥 푖 ∈ [푛] 푖 Our final reduction maps푥 an input , into the NAND-CIRC pro- gram obtained above. By the above discussion,∗ this reduction runs in polynomial time. Since we know that푥 if and only if there exists 푄 such that , this∗ means that if and only if NANDSAT푚 ∗, which∗ is what퐹 (푥 ) we = 1wanted to prove.∗ 푤 ∈ {0, 1} 푃 (푥 푤) = 1 퐹 (푥 ) = 1 ■ (푄) = 1

Figure 15.5: Given an -line NAND-CIRC program that has inputs and some ,

we can transform into푇 a line∗ NAND-CIRC푛 푄program 푛that + 푚 computes the map푥 ∈ {0, 1} for ′ by푄 simply푇 adding + 3 code to compute∗ zero one the and푄 푚 constants, replacing푤 all ↦ references 푄(푥 푤) to X[ ]푤with ∈ {0, either 1} zero or one depending on the value of , and then replacing the remaining references X[ ] X[ ] to 푖 ∗ with . Above is Python code that implements푥푖 this transformation, as well as an example of its푗 execution푗 on − a 푛 simple program.

15.4 THE NAND PROBLEM The NAND problem is defined as follows: 3 • The3 input is a logical formula on a set of variables which is an AND of constraints of the form NAND . 0 푟−1 Ψ 푧 , … , 푧 푖 푗 푘 푧 = (푧 , 푧 ) np, np completeness, and the cook-levin theorem 489

• The output is if and only if there is an input that satisfies all of the constraints. 푟 1 푧 ∈ {0, 1} For example, the following is a NAND formula with variables and constraints: 3 5 3 NAND NAND NAND (15.4) 3 0 2 1 0 2 4 3 1 Ψ =In (푧 this= case NAND(푧 , 푧 ))∧(푧 =since the(푧 assignment, 푧 ))∧(푧 = (푧satisfies, 푧 )) . it. Given a NAND formula on variables and an assignment , we can3 check in(Ψ) polynomial = 1 time whether 푧 = 01010, and hence NAND푟 NP3 . We now proveΨ that푟 NAND is NP hard: 푧 ∈ {0, 1} Ψ(푧) = 1 Lemma 15.9 NANDSAT NAND. 3 ∈ 3 푝 Proof Idea: ≤ 3 To prove Lemma 15.9 we need to give a polynomial-time map from every NAND-CIRC program to a 3NAND formula such that there exists such that if and only if there exists satisfying . For every line of , we define푄 a corresponding variableΨ of . If the line푤 has the form푄(푤)foo = 1 = NAND(bar,blah) then we푧 will add theΨ 푖 clause NAND푖 푄 where and are the last lines푧 in whichΨ bar and blah푖 were written to. We will also set variables corresponding 푖 푗 푘 to the input푧 = variables,(푧 , as푧 ) well as add푗 a clause푘 to ensure that the final output is . The resulting reduction can be implemented in about a dozen lines of Python, see Fig. 15.6. 1

⋆ Figure 15.6: Python code to reduce an instance of NANDSAT to an instance of NAND. In the exam- ple above we transform the NAND-CIRC program푄 xor5 which has input variablesΨ 3 and lines, into a NAND formula that has variables and clauses. Since xor55 outputs on the input16 , there3 exists an assignmentΨ 24 to such20 that 1 and24 evaluates1, 0, 0, 1, to 1 true on . 푧 ∈ {0, 1} Ψ (푧0, 푧1, 푧2, 푧3, 푧4) = (1, 0, 0, 1, 1) Ψ 푧 490 introduction to theoretical computer science

Proof of Lemma 15.9. To prove Lemma 15.9 we need to give a reduction from NANDSAT to NAND. Let be a NAND-CIRC program with inputs, one output, and lines. We can assume without loss of generality that contains3 the variables푄 one and zero as usual. 푛 We map to a NAND푚formula as follows: 푄 • has 푄 variables3 Ψ .

0 푚+푛−1 • TheΨ first푚 +variables 푛 푧 , … , 푧 will corresponds to the inputs of . The next variables will correspond to the lines 0 푛−1 of . 푛 푧 , … , 푧 푄 푛 푛+푚−1 푚 푧 , … , 푧 푚 • For푄 every , if the -th line of the program is foo = NAND(bar,blah) then we add to the constraint NAND ℓ ∈ {푛,where 푛 + 1, … , 푛and + 푚} correspondℓ − 푛 to the last lines ℓ in푄 which the variables bar and blah (respectively)Ψ were written푧 = to. 푗 푘 If one or(푧 both, 푧 ) of bar and푗 − 푛blah was푘 − not 푛 written to before then we use instead of the corresponding value or in the constraint, where is the line in which zero is assigned a value. If one or ℓ0 푗 푘 both푧 of bar and blah is an input variable X[i]푧 then푧 we use in the 0 constraint.ℓ − 푛 푖 푧 • Let be the last line in which the output y_0 is assigned a value. Then∗ we add the constraint NAND where is as aboveℓ the last line in which zero∗ is assigned a value. Note that this ℓ ℓ0 ℓ0 0 is effectively the constraint 푧 =NAND (푧 , 푧 ). ℓ − 푛

∗ ℓ To complete the proof we need푧 = to show(0, that 0) there = 1 exists s.t. if and only if there exists that satisfies all푛 constraints in . We now show both sides of this푛+푚 equivalence.푤 ∈ {0, 1} Part푄(푤) I: Completeness.= 1 Suppose that there푧 ∈ {0, is 1} s.t. . Let Ψ be defined as follows: for , 푛 and for 푛+푚 equals the value푤 that ∈ {0, is 1} assigned푄(푤) in = 푖 푖 1the 푧 ∈-th {0, 1}line of when executed on . Then푖 ∈ by [푛] construction푧 = 푤 푖 satisfies푖 ∈ {푛, 푛all + of1, …the , 푛 +constraints 푚} 푧 of (including the constraint that (푖NAND − 푛) 푄since .) 푤 푧 ∗ Part II: Soundness. Suppose thatΨ there exists satisfy- ℓ 푧ing= . Soundness(0, 0) =will 1 follow푄(푤) by showing = 1 that 푛+푚 (and hence in particular there exists , namely푧 ∈ {0, 1} , 0 푛−1 suchΨ that ). To do this we will prove푛 푄(푧 the following, … , 푧 ) claim = 1 0 푛−1 : for every , equals푤 ∈ the {0, value 1} assigned푤 in = the 푧 ⋯-th 푧 step of the execution푄(푤) = of 1the program on . Note that because ℓ+푛 (∗)satisfies the ℓconstraints ∈ [푚] 푧 of , is sufficient to prove theℓ soundness 0 푛−1 condition since these constraints푄 imply푧 that, … , the푧 last value assigned to푧 the variable y_0 in the executionΨ (∗) of on is equal to . To prove suppose, towards a contradiction, that it is false, and let be 0 푛−1 푄 푧 ⋯ 푤 1 (∗) ℓ np, np completeness, and the cook-levin theorem 491

the smallest number such that is not equal to the value assigned in the -th step of the execution of on . But since sat- ℓ+푛 isfies the constraints of , we get푧 that NAND where 0 푛−1 (by theℓ assumption above that is푄smallest푧 ,with … , 푧 this property)푧 these ℓ+푛 푖 푗 values do correspond toΨ the values last푧 assigned= to the(푧 variables, 푧 ) on the righthand side of the assignmentℓ operator in the -th line of the pro- gram. But this means that the value assigned in the -th step is indeed simply the NAND of and , contradicting ourℓ assumption on the choice of . ℓ 푖 푗 푧 푧 ■ ℓ

15.5 FROM NAND TO SAT The final step in the proof of Theorem 15.6 is the following: 3 3 Lemma 15.10 NAND SAT.

푝 Proof Idea: 3 ≤ 3 To prove Lemma 15.10 we need to map a 3NAND formula into a 3SAT formula such that is satisfiable if and only if is. The Figure 15.7:A NAND instance that is obtained by idea is that we can transform every NAND constraint of the form휑 taking a NAND-TM program for computing the AND function,3 unrolling it to obtain a NANDSAT NAND 휓into the AND휑 of ORs involving the variables휓 instance, and then composing it with the reduction of and their negations, where each of the ORs contains at most three Lemma 15.9. terms.푎 = The construction(푏, 푐) is fairly straightforward, and the details푎, 푏, are 푐 given below.

⋆ P It is a good exercise for you to try to find a 3CNF for- mula on three variables such that is true if and only if NAND . Once you do so, try to see휉 why this implies a reduction푎, 푏, 푐 from 휉(푎,NAND 푏, 푐)to SAT, and hence completes푎 = the(푏, proof 푐) of Lemma 15.10 3 3

Figure 15.8: Code and example output for the reduc- tion given in Lemma 15.10 of NAND to SAT.

3 3 492 introduction to theoretical computer science

Proof of Lemma 15.10. The constraint NAND (15.5) is satisfied if whenever푖 푗 푘 . By going through all 푧 = (푧 , 푧 ) cases, we can verify that (15.5) is equivalent to the constraint 푖 푗 푘 푧 = 1 (푧 , 푧 ) ≠ (1, 1) (15.6)

Indeed if 푖 푗 then푘 the first푖 푗 constraint푖 푘 of Eq. (15.6) is only (푧 ∨ 푧 ∨ 푧 ) ∧ (푧 ∨ 푧 ) ∧ (푧 ∨ 푧 ). true if . On the other hand, if either of or equals then un- 푗 푘 less either푧 = the푧 = second 1 or third constraints will fail. This means 푖 푗 푘 that, given푧 = any 0 3NAND formula over variables푧 푧 0 , we can 푖 obtain푧 = a 3SAT 1 formula over the same variables by replacing every 0 푛−1 휑 푛 푧 , … , 푧 1 NAND constraint of with three OR constraints as in Eq. (15.6). 1 The resulting formula will have some of the OR’s Because of the equivalence휓 of (15.5) and (15.6), the formula sat- involving only two variables. If we wanted to insist on 3isfies that 휑 3 for every assignment each formula involving three distinct variables we can always add a “dummy variable” and include it to the variables. In particular is satisfiable휓 if in all the OR’s involving only two variables, and add a 0 푛−1 0 푛−1 푛+푚 and only if휓(푧is,, … thus , 푧푛 completing) = 휑(푧 , the … , proof. 푧 ) constraint requiring this dummy푧 variable to be zero. 0 푛−1 푧 , … , 푧 ∈ {0, 1} 휓 ■ 휑

Figure 15.9: An instance of the independent set problem obtained by applying the reductions NANDSAT NAND SAT ISAT starting with the xor5 NAND-CIRC program. ≤푝 3 ≤푝 3 ≤푝

15.6 WRAPPING UP

We have shown that for every function in NP, NANDSAT NAND SAT, and so SAT is NP-hard. Since in Chapter 14 we 푝 푝 saw that SAT QUADEQ, SAT 퐹ISET, SAT퐹 ≤ MAXCUT≤ 푝 3and SAT≤ 3 LONGPATH3, all these problems are NP-hard as well. 푝 푝 푝 Finally, since3 all≤ the aforementioned3 problems≤ are3 in NP≤ , they are 푝 all in3 fact NP≤ -complete and have equivalent complexity. There are thousands of other natural problems that are NP-complete as well. Finding a polynomial-time algorithm for any one of them will imply a polynomial-time algorithm for all of them. np, np completeness, and the cook-levin theorem 493

Figure 15.10: We believe that P NP and all NP complete problems lie outside of P, but we cannot rule out the possiblity that P ≠NP. However, we can rule out the possiblity that some NP-complete problems are in P and other do= not, since we know that if even one NP-complete problem is in P then P NP. The relation between P poly and NP is not known though it can be shown that if one NP- / complete= problem is in P poly then NP P poly.

/ ⊆ /

Chapter Recap

• Many of the problems for which we don’t know ✓ polynomial-time algorithms are NP-complete, which means that finding a polynomial-time algo- rithm for one of them would imply a polynomial- time algorithm for all of them. • It is conjectured that NP P which means that we believe that polynomial-time algorithms for these problems are not merely≠ unknown but are nonexistent. • While an NP-hardness result means for example that a full-fledged “textbook” solution to a problem such as MAX-CUT that is as clean and general as the algorithm for MIN-CUT probably does not exist, it does not mean that we need to give up whenever we see a MAX-CUT instance. Later in this course we will discuss several strategies to deal with NP-hardness, including average-case complexity and approximation algorithms.

15.7 EXERCISES

Exercise 15.1 — Poor man’s Ladner’s Theorem. Prove that if there is no log time algorithm for SAT then there is some NP such 2 Hint: Use the function that on input a formula 2 2 and a string of the form , outputs if and only if that푂( 푛) P and is not NP complete. log is satisfiable and 퐹푡 . 휑 푛 3 퐹 ∈ ■ 1 |휑| 1 휑 퐹 ∉ 퐹 푡 = |휑| Exercise 15.2 — NP co NP NP P. Let SAT be the function that on input a 3CNF formula return SAT . Prove that if 3 Hint: Prove and then use the fact that P is closed SAT NP then P≠ NP−. See⇒ footnote≠ for hint.3 3 under complement. 휑 1 − 3 (휑) ■ 3 ∉ ≠ Exercise 15.3 Define WSAT to be the following function: the input is a CNF formula where each clause is the OR of one to three variables (without negations), and a number . For example, the following formula can be휑 used for a valid input to WSAT: . The푘 ∈ output ℕ WSAT if and 5 2 1 only if there exists a satisfying assignment to in휑 which = (푥 ∨ exactly 푥 ∨ 푥 ) ∧ 1 3 0 2 4 0 (푥of the∨ 푥 variables∨ 푥 ) ∧ get (푥 the∨ 푥value∨ 푥 .) For example for the(휑, formula 푘) = 1 above 휑 푘 1 494 introduction to theoretical computer science

WSAT since the assignment satisfies all the clauses. However WSAT since there is no single variable appearing(휑, 2) in = all 1 clauses. (1, 1, 0, 0, 0, 0) Prove that WSAT is NP(휑,-complete. 1) = 0 ■

Exercise 15.4 In the employee recruiting problem we are given a list of potential employees, each of which has some subset of potential skills, and a number . We need to assemble a team of employees such that for every skill there would be one member of푚 the team with this skill. 푘 푘 For example, if Alice has the skills “C programming”, “NAND programming” and “Solving Differential Equations”, Bob has the skills “C programming” and “Solving Differential Equations”, and Charlie has the skills “NAND programming” and “Coffee Brewing”, then if we want a team of two people that covers all the four skills, we would hire Alice and Charlie. Define the function EMP s.t. on input the skills of all potential employees (in the form of a sequence of lists , each containing distinct numbers between and ), and퐿 a number , 1 푛 EMP if and only if there is a퐿 subset푛 of퐿 ,potential … , 퐿 em- ployees such that for every skill in 0 , there푚 is an employee in푘 that has the(퐿, skill 푘) =. 1 푆 푘 Prove that EMP is NP complete.푗 [푚] 푆 푗 ■

Exercise 15.5 — Balanced max cut. Prove that the “balanced variant” of the maximum cut problem is NP-complete, where this is defined as BMC where for every graph and , BMC ∗ if and only if there exists a cut in cutting at least edges∶ such {0, 1} that→ {0, 1} . 퐺 = (푉 , 퐸) 푘 ∈ ℕ (퐺, 푘) = 1 푆 퐺 푘■ |푆| = |푉 |/2 Exercise 15.6 — Regular expression intersection. Let MANYREGS be the fol- lowing function: On input a list of regular expressions exp (represented as strings in some standard way), output if and only if 0 푚 there is a single string that matches all of them.푒푥푝 Prove,…, that MANYREGS is NP-hard. ∗ 1 푥 ∈ {0, 1} ■

15.8 BIBLIOGRAPHICAL NOTES

Aaronson’s 120 page survey [Aar16] is a beautiful and extensive ex- position to the P vs NP problem, its importance and status. See also as well as Chapter 3 in Wigderson’s excellent book [Wig19]. Johnson [Joh12] gives a survey of the historical development of the theory of NP completeness. The following web page keeps a catalog of failed np, np completeness, and the cook-levin theorem 495

attempts at settling P vs NP. At the time of this writing, it lists about 110 papers claiming to resolve the question, of which about 60 claim to prove that P NP and about 50 claim to prove that P NP. Eugene Lawler’s quote on the “mystical power of twoness” was taken from the= wonderful book “The Nature of Computation”≠ by Moore and Mertens. See also this memorial essay on Lawler by Lenstra.

Learning Objectives: • Explore the consequences of P NP • Search-to-decision reduction: transform algorithms that solve decision version= to search version for NP-complete problems. • Optimization and learning problems • Quantifier elimination and solving problems in the polynomial hierarchy. • What is the evidence for P NP vs P NP?

16 = ≠ What if P equals NP?

“You don’t have to believe in God, but you should believe in The Book.”, Paul 1 Paul Erdős (1913-1996) was one of the most prolific Erdős, 1985.1 mathematicians of all times. Though he was an athe- ist, Erdős often referred to “The Book” in which God “No more half measures, Walter”, Mike Ehrmantraut in “Breaking Bad”, keeps the most elegant proof of each mathematical 2010. theorem.

“The evidence in favor of [P NP] and [ its algebraic counterpart ] is so overwhelming, and the consequences of their failure are so grotesque, that their status may perhaps be compared≠ to that of physical laws rather than that of ordinary mathematical conjectures.”, Volker Strassen, laudation for , 1986.

“Suppose aliens invade the earth and threaten to obliterate it in a year’s time unless human beings can find the [fifth Ramsey number]. We could marshal the world’s best minds and fastest computers, and within a year we could prob- ably calculate the value. If the aliens demanded the [sixth Ramsey number], however, we would have no choice but to launch a preemptive attack.”, Paul Erdős, as quoted by Graham and Spencer, 1990.2 2 The -th Ramsey number, denoted as , is the smallest number such that for every graph on We have mentioned that the question of whether P NP, which vertices,푘 both and its complement contain푅(푘, 푘)a -sized independent set.푛 If P NP then we can compute퐺 푛 is equivalent to whether there is a polynomial-time algorithm for in time퐺 polynomial in , while otherwise푘 it SAT, is the great open question of Computer Science.= But why is it so can potentially take closer= to 푘 steps. 푅(푘, 푘) 22푘 important? In this chapter, we will try to figure out the implications of 2 2 3such an algorithm. First, let us get one qualm out of the way. Sometimes people say, “What if P NP but the best algorithm for 3SAT takes time?” Well, is much larger than, say, for any input smaller1000 than , √ as1000 large as= a harddrive as you will0.001 encounter,푛 and so푛 another way to50 phrase푛 this question is to say “what2 if the complexity of 3SAT is ex-2 ponential for all inputs that we will ever encounter, but then grows much smaller than that?” To me this sounds like the computer science equivalent of asking, “what if the laws of physics change completely once they are out of the range of our telescopes?”. Sure, this is a valid possibility, but wondering about it does not sound like the most pro- ductive use of our time.

Compiled on 8.26.2020 18:10 498 introduction to theoretical computer science

So, as the saying goes, we’ll keep an open mind, but not so open that our brains fall out, and assume from now on that:

• There is a mathematical god,

and

• She does not “beat around the bush’ ’ or take “half measures”.

What we mean by this is that we will consider two extreme scenar- :

• 3SAT is very easy: SAT has an or time algorithm with a not too huge constant (say smaller than 2.) 3 푂(푛) 푂(푛6 ) • 3SAT is very hard: SAT is exponentially hard and cannot be 10 solved faster than for some not too tiny (say at least ). We can even휖푛3 make the stronger assumption that for every sufficiently−6 large 2, the restriction of SAT to휖 inputs > 0 of length 10cannot be computed by a circuit of fewer than gates. 푛 3 휖푛 푛 At the time of writing, the fastest known algorithm2 for SAT re- quires more than to solve variable formulas, while we do not even know how to0.35푛 rule out the possibility that we can compute3 SAT using gates. To2 put it in perspective,푛 for the case our lower and upper bounds for the computational costs are apart by3 a factor10푛 of about . As far as we know, it could be푛 the = case 1000 that -variable SAT100can be solved in a millisecond on a first-generation iPhone, and it can10 also be the case that such instances require more 1000than the age of3 the universe to solve on the world’s fastest supercom- puter. So far, most of our evidence points to the latter possibility of 3SAT being exponentially hard, but we have not ruled out the former possi- bility either. In this chapter we will explore some of the consequences of the “ SAT easy” .

16.1 SEARCH-TO-DECISION3 REDUCTION

A priori, having a fast algorithm for 3SAT might not seem so impres- sive. Sure, such an algorithm allows us to decide the satisfiability of not just 3CNF formulas but also of quadratic equations, as well as find out whether there is a long path in a graph, and solve many other de- cision problems. But this is not typically what we want to do. It’s not enough to know if a formula is satisfiable: we want to discover the actual satisfying assignment. Similarly, it’s not enough to find out if a graph has a long path: we want to actually find the path. It turns out that if we can solve these decision problems, we can solve the corresponding search problems as well: what if p equals np? 499

Theorem 16.1 — Search vs Decision. Suppose that P NP. Then for every polynomial-time algorithm and ,there is a polynomial-time algorithm FIND such that for every= , if there exists satisfying푉 푎, 푏 ∈, then ℕFIND 푛 푏 푉 finds some string satisfying푎푛 this condition. 푥 ∈ {0, 1} 푉 푦 ∈′ {0, 1} 푉 (푥푦) = 1 (푥) 푦

P To understand what the statement of Theo- rem 16.1 means, let us look at the special case of the MAXCUT problem. It is not hard to see that there is a polynomial-time algorithm VERIFYCUT such that VERIFYCUT if and only if is a subset of ’s vertices that cuts at least edges. Theorem 16.1 implies that if(퐺,P 푘, 푆)NP =then 1 there is a polynomial-time푆 algorithm퐺 FINDCUT that on input푘 outputs a set such that VERIFYCUT= if such a set exists. This means that if P NP,퐺, by 푘 trying all values of푆 we can find in polynomial(퐺, 푘, 푆) time = 1 a maximum cut in any given graph. We can use= a similar argument to show푘 that if P NP then we can find a satisfying as- signment for every satisfiable 3CNF formula, find the longest path in a= graph, solve integer programming, and so and so forth.

Proof Idea: The idea behind the proof of Theorem 16.1 is simple; let us demonstrate it for the special case of SAT. (In fact, this case is not so “special” since SAT is NP-complete, we can reduce the task of solving the search problem for MAXCUT3 or any other problem in NP to the task− of solving3 it for SAT.) Suppose that P NP and we are given a satisfiable 3CNF formula , and we now want to find a satisfying assignment for . Define3 SAT to output= if there is a satisfying assignment for such that휑 its first bit is , and similarly 0 define SAT if푦 there휑 is a satisfying3 assignment(휑) with1 . The key observation is that푦 both휑 SAT and SAT are0 in NP, and so if 1 0 P NP3 then(휑) we can= 1 compute them in polynomial time푦 as well.푦 Thus= 1 0 1 we can use this to find the first3 bit of the 3satisfying assignment. We can= continue in this way to all the bits.

Proof⋆ of Theorem 16.1. Let be some polynomial time algorithm and some constants. Define the function STARTSWITH as follows: For every 푉 and , STARTSWITH 푉 푎,if 푏 and ∈ ℕ only if there exists some∗ ∗ (where such 푏 푉 that . That푥 ∈{0, is, STARTSWITH 1} 푧 ∈ {0, 1}푎푛 −|푧|outputs if there(푥,푧) is = 1some string of length such푦 that∈ {0, 1} and the푛 = first |푥|) 푉 푉 (푥푧푦) = 1 푏 (푥, 푧) 1 푤 푎|푥| 푉 (푥, 푤) = 1 |푧| 500 introduction to theoretical computer science

bits of are . Since, given as above, we can check in polynomial time if , the function STARTSWITH is in NP 0 ℓ−1 and hence푤 if 푧P , …NP , 푧 we can compute푥, it 푦, in 푧 polynomial time. 푉 Now for every such푉 (푥푧푦) polynomial-time = 1 and , we can imple- ment FIND =as follows: 푉 푎, 푏 ∈ ℕ 푉 Algorithm 16.2(푥) — : Search to decision reduction.

Input: 푉 퐹 퐼푁퐷 Output: 푛 s.t. , if such exists. Other- 푏 wise푥 output ∈ {0, 1} the푎푛 empty string. 1: Initially푧 ∈ {0, 1} 푉 (푥푧) = 1 . 푧 2: for do 푏 0 1 푎푛 −1 3: Let 푧 = 푧 푏= ⋯ = 푧 = 0 . 4: Letℓ = 0, … , 푎푛 − 1 . 0 푉 0 ℓ−1 5: if 푏 ← 푆푇 퐴푅푇then 푆푊 퐼푇 퐻 (푥푧 ⋯ 푧 0) 1 푉 0 ℓ−1 6: return푏 ← 푆푇”” 퐴푅푇 푆푊 퐼푇 퐻 (푥푧 ⋯ 푧 1) 0 1 7: 푏 = 푏 =# 0 Can’t extend to input accepts 8: end if 0 ℓ−1 9: if then 푥푧 … 푧 푉 10: 0 11: 푏 =# 1 Can extend with to accepting input ℓ 12: else푧 ← 0 0 ℓ−1 13: 푥푧 … 푥 0 14: # Can extend with to accepting input ℓ 15: end푧 if← 1 0 ℓ−1 16: end for 푥푧 … 푥 1 17: return

푏 0 푎푛 −1 To analyze Algorithm푧 , … , 푧 16.2, note that it makes invocations to STARTSWITH and hence if the latter is polynomial-time,푏 then so is Algorithm 16.2 Now suppose that is such that there2푎푛 exists some 푉 satisfying . We claim that at every step , we maintain the invariant that there exists푥 whose first푏 푦bits 푏 are s.t. 푉 (푥푦) =. 1 Note that this claim implies the푎푛ℓ = theorem, 0, … , 푎푛 since− 1 in particular it means that for ,푦 ∈satisfies {0, 1} . ℓ We푧 prove푉 (푥푦) the = claim 1 by induction.푏 For , this holds vacuously. Now for every , if theℓ call = 푎푛STARTSWITH− 1 푧 푉 (푥푧) = 1 returns , then we are guaranteed the invariantℓ = 0 by definition of 푉 0 ℓ−1 STARTSWITH ℓ. Now> 0 under our inductive hypothesis,(푥푧 ⋯ there푧 0) is 1 such that . If the call to 푉 STARTSWITH푏 returns then it must푏 be the case that ℓ 푎푛 −1 0 ℓ−1 ℓ 푎푛 −1 푦 , … ,, 푦 and hence when푃 we (푥푧 set, … , 푧 푦we, … maintain , 푦 ) the = 1invariant. 푉 0 ℓ−1 (푥푧 ⋯ 푧 0) 0 ■ ℓ ℓ 푦 = 1 푧 = 1 what if p equals np? 501

16.2 OPTIMIZATION

Theorem 16.1 allows us to find solutions for NP problems if P NP, but it is not immediately clear that we can find the optimal solution. For example, suppose that P NP, and you are given a graph= . Can you find the longest simple path in in polynomial time? = 퐺 퐺 P This is actually an excellent question for you to at- tempt on your own. That is, assuming P NP, give a polynomial-time algorithm that on input a graph , outputs a maximally long simple path in the= graph . 퐺 퐺 The answer is Yes. The idea is simple: if P NP then we can find out in polynomial time if an -vertex graph contains a simple path of length , and moreover, by Theorem 16.1, if= does contain such a path, then we can find it. 푛(Can you see 퐺why?) If does not contain a simple path푛 of length , then we will check if it퐺 contains a simple path of length , and continue in this way to find퐺 the largest such that contains a simple path푛 of length . The above푛 − reasoning1 was not specifically tailored to finding푘 paths 퐺in graphs. In fact, it can be vastly generalized푘 to proving the following result:

Theorem 16.3 — Optimization from P NP. Suppose that P NP. Then for every polynomial-time computable function = (identifying with natural numbers via the binary representa-= ∗ tion) there is a polynomial-time algorithm OPT such푓 ∶ that {0, 1} on input→ ℕ , 푓(푥) ∗ OPT max (16.1) 푥 ∈ {0, 1} 푚 푚 Moreover under the(푥, same 1 assumption,) = 푦∈{0,1} 푓(푥, there 푦) is . a polynomial- time algorithm FINDOPT such that for every , FINDOPT outputs such that max ∗ . 푚 ∗ ∗ ∗ 푥 ∈ {0,푚 1} (푥, 1 ) 푦∈{0,1} 푦 ∈ {0, 1} 푓(푥, 푦 ) = 푓(푥, 푦)

P The statement of Theorem 16.3 is a bit cumbersome. To understand it, think how it would subsume the example above of a polynomial time algorithm for finding the maximum length path in a graph. In this case the function would be the map that on input a pair outputs if the pair does not represent some graph푓 and a simple path inside the graph respectively;푥, 푦 otherwise0 (푥,would 푦) equal the length of the path in the graph . Since a path in an vertex graph can be represented푓(푥, 푦) by at most 푦 푥 푛 502 introduction to theoretical computer science

log bits, for every representing a graph of ver- tices, finding max log corresponds to

푛finding푛 the length of푥 푛 the푛 maximum simple푛 path inthe 푦∈{0,1} graph corresponding to , and푓(푥, finding 푦) the string that achieves this maximum corresponds to actually∗ finding the path. 푥 푦

Proof Idea: The proof follows by generalizing our ideas from the longest path example above. Let be as in the theorem statement. If P NP then for every for every string and number , we can test in in time whether푓 there exists∗ such that = , or in other words test whether푥 ∈ max {0, 1} 푘 . If is an integer푝표푙푦(|푥|, between 푚) and (as푚 푦 is the case in푓(푥, the 푦) example ≥ 푘 of 푦∈{0,1} longest path) then we can just try out all푓(푥, possibilities 푦) ≥ 푘 for푓(푥,to 푦) find the maximum number0 for푝표푙푦(|푥| which + max |푦|) . Otherwise, we can use binary search to hone down on the right value. Once we푘 do so, we 푦 can use search-to-decision푘 to actually푓(푥, find 푦) the ≥ string푘 that achieves the maximum. ∗ 푦

Proof⋆ of Theorem 16.3. For every as in the theorem statement, we can define the Boolean function as follows. 푓 ∗ 퐹 ∶ {0, 1} → {0, 1} (16.2) 푚 otherwise푦∈{0,1} 푚 {⎧1 ∃ 푓(푥, 푦) ≥ 푘 Since is computable퐹 (푥, 1 , 푘) in = polynomial time, is in NP, and so under {⎨0 our assumption that P NP, ⎩ itself can be computed in polynomial time. Now,푓 for every and , we can compute퐹 the largest such that by a binary= search.퐹 Specifically, we will do this as follows:푚 푥 푚 푘 퐹 (푥, 1 , 푘) = 1 1. We maintain two numbers such that we are guaranteed that max .

푚 푎, 푏 푦∈{0,1} 2.푎 Initially ≤ we set 푓(푥, 푦)and < 푏 where is the running time of . (A function with running푇 (푛) time can’t output more than bits and so푎 can’t = 0 output푏 = a 2 number larger푇 (푛) than .) 푓 푇 (푛) 푇 (푛) 3.푇 At (푛) each point in time, we compute the midpoint 2 and let . 푛 푐 = ⌊(푎 + 푏)/2⌋) a. If 푦 =then 퐹 (1 we, 푐) set and leave as it is. b. If then we set and leave as it is. 푦 = 1 푎 = 푐 푏 4. We then푦 = go 0 back to step푏 3,= until푐 푎 .

푏 ≤ 푎 + 1 what if p equals np? 503

Since shrinks by a factor of , within log steps, we will get to the point at which , and푇 (푛) then we can 2 simply output|푏 − 푎| . Once we find the 2maximum value2 of such= 푇 that (푛) , we can use the search to푏 ≤ decision 푎 + 1 reduction of Theo- rem 16.1푚 to obtain푎 the actual value such that푘 . ■ 퐹 (푥, 1 , 푘) = 1 ∗ 푚 ∗ 푦 ∈ {0, 1} 푓(푥, 푦 ) = 푘

■ Example 16.4 — Integer programming. One application for Theo- rem 16.3 is in solving optimization problems. For example, the task of linear programming is to find that maximizes some linear objective subject to the constraint푛 that satisfies linear inequalities푛−1 of the form 푦 ∈ ℝ . As we discussed in Sec- 푖=0 푖 푖 tion 12.1.3∑, there푐 is푦 a known푛−1 polynomial-time algorithm푦 for linear 푖=0 푖 푖 programming. However,∑ if we푎 want푦 to≤ place 푐 additional constraints on , such as requiring the coordinates of to be integer or valued then the best-known algorithms run in exponential time in the푦 worst case. However, if P NP then푦 Theorem 16.30/1tells us that we would be able to solve all problems of this form in poly- nomial time. For every string that= describes a set of constraints and objective, we will define a function such that if satisfies the constraints of then 푥 is the value of the objective, and otherwise we set where 푓 is some large푦 number. We can then use Theorem푥 16.3푓(푥,to 푦) compute the that maximizes and that will give푓(푥, us the 푦) =assignment −푀 for푀 the variables that satisfies our constraints and maximizes the objective.푦 (If the computation푓(푥, 푦) results in such that then we can double and try again; if the true maximum objective is achieved by some string , then eventually푦 푓(푥,will 푦) be = large −푀 enough so that would푀 be smaller∗ than the objective achieved by , and hence when we run 푦procedure of Theorem푀 16.3 we would get∗ a value larger−푀 than .) 푦 −푀

R Remark 16.5 — Need for binary search.. In many exam- ples, such as the case of finding longest path, we don’t need to use the binary search step in Theorem 16.3, and can simply enumerate over all possible values for until we find the correct one. One example where we do need to use this binary search step is in the case 푘of the problem of finding a maximum length path in a weighted graph. This is the problem where is a weighted graph, and every edge of is given a weight which is a number between and . Theorem퐺 16.3 shows that we can find the maximum-weight푘퐺 simple path in (i.e., simple path maximizing0 2 the sum of

퐺 504 introduction to theoretical computer science

the weights of its edges) in time polynomial in the number of vertices and in . Beyond just this example there is a vast field of math- ematical optimization that푘 studies problems of the same form as in Theorem 16.3. In the context of opti- mization, typically denotes a set of constraints over some variables (that can be Boolean, integer, or real valued), 푥encodes an assignment to these variables, and is the value of some objective function that we want푦 to maximize. Given that we don’t know efficient푓(푥, 푦) algorithms for NP complete problems, re- searchers in optimization research study special cases of functions (such as linear programming and semidefinite programming) where it is possible to optimize the푓 value efficiently. Optimization is widely used in a great many scientific areas including: ma- chine learning, engineering, economics and .

16.2.1 Example: Supervised learning One classical optimization task is supervised learning. In supervised learning we are given a list of examples (where we can think of each as a string in for some ) and the la- 0 1 푚−1 bels for them (which we will푥푛 think, 푥 , … of , 푥simply bits, i.e., 푖 ). For example,푥 we can{0, think 1} of the ’s푛 as images of ei- 0 푛−1 ther dogs or cats,푦 , … for , 푦 which in the former case and 푖 푖 in푦 the∈ {0, latter 1} case. Our goal is to come up with a푥hypothesis or predic- 푖 푖 tor such푦 that= 1 if we are given a new example푦 = 0 that has an (unknown푛 to us) label , then with high probability willℎpredict ∶ {0, 1}the label.→ {0, That 1} is, with high probability it will hold that푥 . The idea in supervised learning푦 is to use the Occam’s Ra-ℎ zor principle: the simplest hypothesis that explains the data is likely toℎ(푥) be =correct. 푦 There are several ways to model this, but one popular approach is to pick some fairly simple function . We think of the first inputs as the parameters and the last푘+푛inputs as the example data. (For example, we can think퐻 of ∶ {0,the 1} first →inputs {0, 1} of as specifying the푘 weights and connections for some neural푛 net- work that will then be applied on the latter inputs.) We can푘 then phrase퐻 the supervised learning problem as finding, given a set of la- beled examples 푛 , the set of parameters that minimizes the number of errors made by the 3 This is often known as Empirical Risk Minimization. 0 0 푚−1 푚−1 predictor 푆 = {(푥.3 , 푦 ), … , (푥 , 푦 )} 0 푘−1 휃 ,In … ,other 휃 words,∈ {0, 1} we can define for every set as above the function 푥 ↦ 퐻(휃, 푥)such that . Now, finding the푘 value that minimizes is equivalent푆 to solving the 푆 푆 (푥,푦)∈푆 퐹supervised∶ {0, 1} learning→ [푚] problem with퐹 (휃) respect = ∑ to .|퐻(휃, For every 푥) − polynomial- 푦| 푆 휃 퐹 (휃) 퐻 what if p equals np? 505

time computable , the task of minimizing can be “massaged” to푘+푛 fit the form of Theorem 16.3 and hence if P NP, then we퐻 can ∶ solve {0, 1} the supervised→ {0, 1} learning problem in great 푆 generality.퐹 (휃) In fact, this observation extends to essentially any learn- ing= model, and allows for finding the optimal predictors given the minimum number of examples. (This is in contrast to many current learning algorithms, which often rely on having access to an extremely large number of examples far beyond the minimum needed, and in particular far beyond the number of examples humans use for the same tasks.) −

16.2.2 Example: Breaking cryptosystems We will discuss cryptography later in this course, but it turns out that if P NP then almost every cryptosystem can be efficiently bro- ken. One approach is to treat finding an encryption key as an in- stance= of a supervised learning problem. If there is an encryption scheme that maps a “plaintext” message and a key to a “cipher- text” , then given examples of ciphertext/plaintext pairs of the form , our goal is푝 to find the휃 key such that 푐 where is the encryption algorithm. While you might 0 0 푚−1 푚−1 think(푐 getting, 푝 ), …such , (푐 “labeled, 푝 examples”) is unrealistic, it turns휃 out (as 푖 푖 many퐸(휃, 푝 amateur) = 푐 home-brew퐸 crypto designers learn the hard way) that this is actually quite common in real-life scenarios, and that it is also possible to relax the assumption to having more minimal prior infor- mation about the plaintext (e.g., that it is English text). We defer a more formal treatment to Chapter 21.

16.3 FINDING MATHEMATICAL PROOFS

In the context of Gödel’s Theorem, we discussed the notion of a proof system (see Section 11.1). Generally speaking, a proof system can be thought of as an algorithm (known as the verifier) such that given a statement ∗ and a candidate proof , if and only if 푉encodes ∶ {0, 1}∗ a→ valid {0, proof 1} for the statement .∗ Any type of proof system that푥 ∈ is {0, used 1} in mathematics for geometry,푤 ∈ {0, 1} 푉number (푥, 푤) =theory, 1 analysis, etc.,푤 is an instance of this form. In fact, stan-푥 dard mathematical proof systems have an even simpler form where the proof encodes a sequence of lines (each of which is itself a binary string) such that each line0 is either푚 an axiom or fol- lows from푤 some prior lines through an푤 application, …푖 , 푤 of some inference rule. For example, Peano’s axioms encode푤 a set of axioms and rules for the natural numbers, and one can use them to formalize proofs in number theory. Also, there are some even stronger axiomatic sys- tems, the most popular one being Zermelo–Fraenkel with the Axiom of Choice or ZFC for short. Thus, although mathematicians typically 506 introduction to theoretical computer science

write their papers in natural language, proofs of number theorists can typically be translated to ZFC or similar systems, and so in par- ticular the existence of an -page proof for a statement implies that there exists a string of length (in fact often or ) that encodes the proof in such푛 a system. Moreover, because푥 verify-2 ing a proof simply involves푤 going푝표푙푦(푛) over each line and checking푂(푛) 푂(푛 that) it does indeed follow from the prior lines, it is fairly easy to do that in or (where as usual denotes the length of the proof ). This means2 that for every reasonable proof system , the follow- 푂(|푤|)ing function푂(|푤|SHORTPROOF) |푤| is in NP, where 푤for every input of the form , SHORTPROOF∗ 푉 if and 푉 only if there exists 푚∶with {0, 1} → {0,s.t. 1} 푚 . That 푉 is, SHORTPROOF 푥1∗ if there is a proof(푥, (in 1 the) system = 1 ) of length at most 푤bits ∈푚 {0,that 1} is true.|푤| Thus, ≤ 푚 if P 푉 (푥푤)NP, then = 1 despite 푉 Gödel’s Incompleteness(푥, 1 Theorems,) = 1 we can still automate mathematics푉 in the sense of finding푚 proofs푥 that are not too= long for every statement that has one. (Frankly speaking, if the shortest proof for some state- ment requires a terabyte, then human mathematicians won’t ever find this proof either.) For this reason, Gödel himself felt that the question of whether SHORTPROOF has a polynomial time algorithm is of great interest. As Gödel wrote in a letter to in 1956 푉 (before the concept of NP or even “polynomial time” was formally defined):

One can obviously easily construct a Turing machine, which for every formula in first order predicate logic and every natural number , al- lows one to decide if there is a proof of of length (length = number of symbols).퐹 Let be the number of steps the machine requires푛 for this and let max . The퐹 question푛 is how fast grows for an optimal휓(퐹 , machine. 푛) One can show that [for some 퐹 constant ].휑(푛) If there = really휓(퐹 were , 푛) a machine with 휑(푛)(or even ), this would have consequences of the휑 greatest ≥ 푘 ⋅ 푛 importance. 4 The undecidability of Entscheidungsproblem refers Namely, it푘 would >2 0 obviously mean that in spite of the휑(푛) undecidability ∼ 푘 ⋅ 푛 4 to the uncomputability of the function that maps a of the∼ Entscheidungsproblem, 푘 ⋅ 푛 the mental work of a mathematician statement in first order logic to if and only if that concerning Yes-or-No questions could be completely replaced by a ma- statement has a proof. chine. After all, one would simply have to choose the natural number 1 so large that when the machine does not deliver a result, it makes no sense to think more about the problem. 푛

For many reasonable proof systems (including the one that Gödel referred to), SHORTPROOF is in fact NP-complete, and so Gödel can be thought of as the first person to formulate the P vs NP question. 푉 Unfortunately, the letter was only discovered in 1988. what if p equals np? 507

16.4 QUANTIFIER ELIMINATION (ADVANCED)

If P NP then we can solve all NP search and optimization problems in polynomial time. But can we do more? It turns out that the answer is that=Yes we can! An NP decision problem can be thought of as the task of deciding, given some string the truth of a statement of the form ∗ 푥 ∈ {0, 1} (16.3) 푝(|푥|) 푦∈{0,1} for some polynomial-time∃ algorithm푉 (푥푦)and = polynomial 1 . That is, we are trying to determine, given some string , whether there exists a string such that and 푉satisfy some polynomial-time푝 ∶ ℕ → ℕ checkable condition . For example, in the independent푥 set problem, the string represents푦 a graph푥 and푦 a number , the string repre- sents some subset 푉of ’s vertices, and the condition that we check is whether 푥 and there is no퐺 edge in 푘such that both푦 and . 푆 퐺 We can|푆| consider ≥ 푘 more general statements{푢, 푣} such퐺 as checking, given푢 ∈ 푆 a string푣 ∈ 푆 , the truth of a statement of the form ∗ 푥 ∈ {0, 1} (16.4)

푝0(|푥|) 푝1(|푥|) 푧∈{0,1} which in words corresponds∃푦∈{0,1} ∀ to checking,푉 given (푥푦푧) some = 1 , string , whether there exists a string such that for every string , the triple sat- isfy some polynomial-time checkable condition. We can also푥 consider more levels of quantifiers푦 such as checking푧 the truth(푥, of 푦, the 푧) statement

(16.5)

푝0(|푥|) 푝1(|푥|) 푝2(|푥|) 푧∈{0,1} 푤∈{0,1} and so on∃푦∈{0,1} and so forth.∀ ∃ 푉 (푥푦푧푤) = 1 For example, given an -input NAND-CIRC program , we might want to find the smallest NAND-CIRC program that computes the same function as . The question푛 of whether there′ is such푃 a that can be described by a string of at most bits can푃 be phrased as′ 푃 푃 푠 (16.6) 5 ′ 5 푠 푛 ′ Since NAND-CIRC programs are equivalent to which has the form (푃16.4∈{0,1}). Another푥∈{0,1} example of a statement involving ∃ ∀ 푃 (푥) = 푃 (푥) Boolean circuits, the search problem corresponding to levels of quantifiers would be to check, given a position , (16.6) known as the circuit minimization problem and whether there is a strategy that guarantees that White wins within is widely studied in Engineering. You can skip ahead to Section 16.4.1 to see a particularly compelling 푎steps. For example is we would want to check if given the푥 board application of this. position , there exists a move for White such that for every move 푎for Black there exists a move푎 = 3for White that ends in a a checkmate. It turns푥 out that if P NP푦then we can solve these kinds of prob-푧 lems as well: 푤 = 508 introduction to theoretical computer science

Theorem 16.6 — Polynomial hierarchy collapse. If P NP then for every , polynomial and polynomial-time algorithm , there is a polynomial-time algorithm SOLVE= that on input 푎 ∈ ℕ returns if푝 and ∶ ℕ only → if ℕ 푉 ,푎 푉 푛 푥 ∈ {0, 1} 1 (16.7) 푚 푚 푚 푦0∈{0,1} 푦1∈{0,1} 푦푎−1∈{0,1} 0 1 푎−1 where∃ ∀ and ⋯is 풬 either or푉 (푥푦depending푦 ⋯ 푦 on) = whether 1 is odd or even, respectively. 6 푚 = 푝(푛) 풬 ∃ ∀ 푎 6 For the ease of notation, we assume that all the strings we quantify over have the same length Proof Idea: , but using simple padding one can show that To understand the idea behind the proof, consider the special case this captures the general case of strings of different푚 = where we want to decide, given , whether for every 푝(푛)polynomial lengths. there exists such that 푛 . Consider the function푛 such that 푛 푥if there∈ {0, exists1} such푦 that ∈ {0, 1} . Since 푧 ∈runs {0, in 1} polynomial-time푉 (푥푦푧) =NP 1 and푛 hence if P NP,퐹 then there is퐹 an (푥푦) algorithm = 1 that on푧 input ∈ {0, 1}outputs if and푉 (푥푦푧) only = if 1 there exists푉 such′ that 퐹 ∈ . Now we can= see that the original statement푛 we푉 consider is true푥, if 푦 and only1 if for every , 푧 ∈ {0,, 1} which means푉 it (푥푦푧) is false = if 1 and only if the following condition푛 ′ holds: there exists some such that 푦 ∈ {0, 1}. But푉 (푥푦) for every = 1 , the question of whether푛 the condition′ is itself(∗) in NP (as we assumed푛 can푦 ∈ be {0, computed 1} in polynomial푉 (푥푦) = time) 0 and hence푥 under ∈ {0, the 1} assumption′ that P NP we can determine(∗) in polynomial time whether the푉 condition , and hence our original statement, is true. = (∗)

Proof⋆ of Theorem 16.6. We prove the theorem by induction. We assume that there is a polynomial-time algorithm SOLVE that can solve the problem (16.7) for and use that to solve the problem for . 푉 ,푎−1 For , SOLVE iff which is a polynomial-time computation since runs푎 − in 1 polynomial time. For every , define푎 푉 ,푎−1 the statement푎 = 1 to be(푥) the = following: 1 푉 (푥) = 1 0 푉 푥, 푦 푥,푦0 휑 (16.8)

푚 푚 푚 푥,푦0 푦1∈{0,1} 푦2∈{0,1} 푦푎−1∈{0,1} 0 1 푎−1 휑By the= ∀ definition∃ of SOLVE⋯ 풬, for every푉 (푥푦 푦 ⋯ 푦, our) = goal 1 is that SOLVE if and only if there exists 푛 such that 푉 ,푎 is true. 푥 ∈ {0, 1} 푚 푉 ,푎 0 The negation(푥)of = 1 is the statement 푦 ∈ {0, 1} 푥,푦0 휑 푥,푦0 휑 (16.9)

푚 푚 푚 0 푦1∈{0,1} 푦2∈{0,1} 푦푎−1∈{0,1} 0 1 푎−1 휑푥,푦 = ∃ ∀ ⋯ 풬 푉 (푥푦 푦 ⋯ 푦 ) = 0 what if p equals np? 509

where is if was and is otherwise. (Please stop and verify that you understand why this is true, this is a generalization of the fact that if 풬 is some∃ 풬 logical∀ condition풬 ∀ then the negation of is .) 푦 푧 TheΨ crucial observation is that is exactly a statement∃ ∀ Ψ(푦, of the 푧) 푦 푧 ∀form∃ ¬Ψ(푦, we consider 푧) with quantifiers instead of , and hence by 푥,푦0 our inductive hypothesis there is some휑 polynomial time algorithm that on input outputs푎 − 1 if and only if is true.푎 If we let be the algorithm that on input outputs then we see 0 푥,푦0 푆that outputs 푥푦if and only if1 is true.휑 Hence we can rephrase푆 the 0 0 original statement (16.7) as follows:푥, 푦 1 − 푆(푥푦 ) 푥,푦0 푆 1 휑 (16.10)

푚 but since is a polynomial-time푦0∈{0,1} algorithm,0 Eq. (16.10) is clearly a ∃ 푆(푥푦 ) = 1 statement in NP and hence under our assumption that P NP there is a polynomial푆 time algorithm that on input , will determine if (16.10) is true and so also if the original statement푛 (16.7=) is true. 푥 ∈ {0, 1} ■

The algorithm of Theorem 16.6 can also solve the search problem as well: find the value that certifies the truth of (16.7). We note that while this algorithm is in polynomial time, the exponent of this 0 polynomial blows up quite푦 fast. If the original NANDSAT algorithm required time, solving levels of quantifiers would require time 7 We do not know whether such loss is inherent. 7 . 2 As far as we can tell, it’s possible that the quantified 푎 Ω(푛 ) 푎 boolean formula problem has a linear-time algorithm. 2 We will, however, see later in this course that it Ω(푛16.4.1) Application: self improving algorithm for SAT satisfies a notion known as PSPACE-hardness that is Suppose that we found a polynomial-time algorithm for SAT that even stronger than NP-hardness. is “good but not great”. For example, maybe our3 algorithm runs in time for some not too small constant . However,퐴 it’s possible3 that the2best possible SAT algorithm is actually much more efficient than푐푛 that. Perhaps, as we guessed before,푐 there is a circuit of at most gates that computes 3SAT on variables, and we∗ simply haven’t6 discovered it yet. We can use Theorem 16.6 to “bootstrap”퐶 our original10 “good푛 but not great” 3SAT algorithm푛 to discover the optimal one. The idea is that we can phrase the question of whether there exists a size circuit that computes 3SAT for all length inputs as follows: there exists a size circuit such that for every formula described by푠 a string of length at most , if then푛 there exists an assignment to the variables≤ 푠 of 퐶that satisfies it. One can see휑 that this is a statement of the form (16.5) and푛 hence퐶(휑) if =P 1 NP we can solve it in polynomial푥 time as well. We can휑 therefore imagine investing huge computational resources in running one time to discover= the circuit and then using for all further computation. ∗ ∗ 퐴 퐶 퐶 510 introduction to theoretical computer science

16.5 APPROXIMATING COUNTING PROBLEMS AND POSTERIOR SAMPLING (ADVANCED, OPTIONAL)

Given a Boolean circuit , if P NP then we can find an input (if one exists) such that . But what if there is more than one like that? Clearly we can’t퐶 efficiently= output all such ’s; there might푥 be exponentially many.퐶(푥) But = we 1 can get an arbitrarily good multiplica-푥 tive approximation (i.e., a factor for arbitrarily small푥 ) for the number of such ’s, as well as output a (nearly) uniform member of this set. The details are beyond1±휖 the scope of this book, but휖 this > 0 result is formally stated in푥 the following theorem (whose proof is omitted).

Theorem 16.7 — Approximate counting if P NP. Let be some polynomial-time algorithm, and suppose that P ∗ NP. Then there exists an algorithm COUNT= that on푉 input ∶ {0, 1} →, {0, 1} runs in time polynomial in and outputs a number푚= in 푉 satisfying 푥, 1 , 휖 푚 |푥|, 푚, 1/휖 [2 + 1] COUNT COUNT 푚 (16.11) 푉 푉 (1−휖) (푥, 푚, 휖) ≤ ∣{푦 ∈ {0, 1} ∶ 푉 (푥푦) = 1}∣ ≤ (1+휖) (푥, 푚, 휖) . In other words, the algorithm COUNT gives an approximation up to a factor of for the number of witnesses for with respect 푉 to the verifying algorithm . Once again, to understand this theorem it can be useful to1 ±see 휖 how it implies that if P NP then푥 there is a polynomial-time algorithm푉 that given a graph and a number , can compute a number that is within a = factor equal to the number of simple paths in of length . (That퐺 is, is between푘 to times the number of퐾 such paths.) 1 ± 0.01 퐺 푘 퐾 0.99 Posterior1.01 sampling and probabilistic programming. The algorithm for count- ing can also be extended to sampling from a given posterior distri- bution. That is, if is a Boolean circuit and , then if P NP푛 we can sample푚 from (a close approx- imation of)푚 the distribution퐶 ∶ {0, 1} of→ uniform {0, 1} conditioned on 푦 ∈ {0, 1}. This task is= known as posterior sampling푛and is crucial for Bayesian data analysis. These days it is푥 known ∈ {0, how 1} to achieve pos- 퐶(푥)terior = sampling 푦 only for circuits of very special form, and even in these cases more often than not we do have guarantees on the quality of the sampling algorithm. The field퐶 of making inferences by sampling from posterior distribution specified by circuits or programs is known as probabilistic programming. what if p equals np? 511

16.6 WHAT DOES ALL OF THIS IMPLY?

So, what will happen if we have a algorithm for SAT? We have mentioned that NP-hard problems arise6 in many contexts, and indeed scientists, engineers, programmers10 and푛 others routinely3 encounter such problems in their daily work. A better SAT algorithm will prob- ably make their lives easier, but that is the wrong place to look for the most foundational consequences. Indeed,3 while the invention of electronic computers did of course make it easier to do calculations that people were already doing with mechanical devices and pen and paper, the main applications computers are used for today were not even imagined before their invention. An exponentially faster algorithm for all NP problems would be no less radical an improvement (and indeed, in some sense would be more) than the computer itself, and it is as hard for us to imagine what it would imply as it was for Babbage to envision today’s world. For starters, such an algorithm would completely change the way we program computers. Since we could automatically find the “best” (in any measure we chose) program that achieves a certain task, we would not need to define how to achieve a task, but only specify tests as to what would be a good solution, and could also ensure that a program satisfies an exponential number of tests without actually running them. The possibility that P NP is often described as “automating creativity”. There is something to that analogy, as we often think of a creative solution as one= that is hard to discover but that, once the “spark” hits, is easy to verify. But there is also an element of hubris to that statement, implying that the most impressive consequence of such an algorithmic breakthrough will be that computers would suc- ceed in doing something that humans already do today. Nevertheless, artificial intelligence, like many other fields, will clearly be greatly impacted by an efficient 3SAT algorithm. For example, it is clearly much easier to find a better Chess-playing algorithm when, given any algorithm , you can find the smallest algorithm that plays Chess better than . Moreover, as we mentioned above, much′ of machine learning (and푃 statistical reasoning in general) is about푃 finding “sim- ple” concepts푃 that explain the observed data, and if NP P, we could search for such concepts automatically for any notion of “simplicity” we see fit. In fact, we could even “skip the middle= man” anddoan 8 One interesting theory is that P NP and evolution automatic search for the learning algorithm with smallest general- has already discovered this algorithm, which we are already using without realizing it.= At the moment, ization error. Ultimately the field of Artificial Intelligence is about there seems to be very little evidence for such a sce- trying to “shortcut” billions of years of evolution to obtain artificial nario. In fact, we have some partial results in the other direction showing that, regardless of whether programs that match (or beat) the performance of natural ones, and a P NP, many types of “local search” or “evolution- fast algorithm for NP would provide the ultimate shortcut.8 ary” algorithms require exponential time to solve 3SAT= and other NP-hard problems. 512 introduction to theoretical computer science

More generally, a faster algorithm for NP problems would be im- mensely useful in any field where one is faced with computational or quantitative problems which is basically all fields of science, math, and engineering. This will not only help with concrete problems such as designing a better bridge,− or finding a better drug, but also with addressing basic mysteries such as trying to find scientific theories or “laws of nature”. In a fascinating talk, Nima Arkani-Hamed discusses the effort of finding scientific theories in much the same lan- guage as one would describe solving an NP problem, for which the solution is easy to verify or seems “inevitable”, once found, but that requires searching through a huge landscape of possibilities to reach, and that often can get “stuck” at local optima:

“the laws of nature have this amazing feeling of inevitability… which is associ- ated with local perfection.”

“The classical picture of the world is the top of a local mountain in the space of ideas. And you go up to the top and it looks amazing up there and absolutely incredible. And you learn that there is a taller mountain out there. Find it, Mount Quantum…. they’re not smoothly connected … you’ve got to make a jump to go from classical to quantum … This also tells you why we have such major challenges in trying to extend our understanding of physics. We don’t have these knobs, and little wheels, and twiddles that we can turn. We have to learn how to make these jumps. And it is a tall order. And that’s why things are difficult.”

Finding an efficient algorithm for NP amounts to always being able to search through an exponential space and find not just the “local” mountain, but the tallest peak. But perhaps more than any computational speedups, a fast algo- rithm for NP problems would bring about a new type of understanding. In many of the areas where NP-completeness arises, it is not as much a barrier for solving computational problems as it is a barrier for ob- taining “closed-form formulas” or other types of more constructive descriptions of the behavior of natural, biological, social and other sys- tems. A better algorithm for NP, even if it is “merely” -time, seems √ to require obtaining a new way to understand these types푛 of systems, whether it is characterizing Nash equilibria, spin-glass2 configurations, entangled quantum states, or any of the other questions where NP is currently a barrier for analytical understanding. Such new insights would be very fruitful regardless of their computational utility.

 Big Idea 23 If P NP, we can efficiently solve a fantastic number of decision, search, optimization, counting, and sampling problems from all areas of human= endeavors. what if p equals np? 513

16.7 CAN P NP BE NEITHER TRUE NOR FALSE? The Continuum Hypothesis is a conjecture made by Georg Cantor in 1878, positing≠ the non-existence of a certain type of infinite cardinality. (One way to phrase it is that for every infinite subset of the real numbers , either there is a one-to-one and onto function or there is a one-to-one and onto function 푆.) This was consideredℝ one of the most important open problems in set푓 theory, ∶ 푆 → ℝ and settling its truth or falseness was the first푓 ∶ 푆 problem → ℕ put forward by Hilbert in the 1900 address we mentioned before. However, using the theories developed by Gödel and Turing, in 1963 Paul Cohen proved that both the Continuum Hypothesis and its negation are consistent with the standard axioms of set theory (i.e., the Zermelo-Fraenkel axioms + the Axiom of choice, or “ZFC” for short). Formally, what he proved is that if ZFC is consistent, then so is ZFC when we assume either the continuum hypothesis or its negation. Today, many (though not all) mathematicians interpret this result as saying that the Continuum Hypothesis is neither true nor false, but rather is an axiomatic choice that we are free to make one way or the other. Could the same hold for P NP? In short, the answer is No. For example, suppose that we are try- ing to decide between the “3SAT is≠ easy” conjecture (there is an time algorithm for 3SAT) and the “3SAT is hard” conjecture (for ev-6 ery , any NAND-CIRC program that solves variable 3SAT takes10 푛 lines). Then, since for , , this boils down −6 −6 to10 the푛푛 finite question of deciding8 whether10 푛 푛 or not6 there isa -line 2NAND-CIRC program deciding푛 = 3SAT 10 2 on formulas> 10 푛 with variables.13 If there is such a program then there is a finite proof of 10its8 existence, namely the approximately 1TB file describing the program,10 and for which the verification is the (finite in principle though infeasible in 9 9 This inefficiency is not necessarily inherent. Later practice) process of checking that it succeeds on all inputs. If there in this course we may discuss results in program- isn’t such a program, then there is also a finite proof of that, though checking, interactive proofs, and average-case com- any such proof would take longer since we would need to enumer- plexity, that can be used for efficient verification of proofs of related statements. In contrast, the ineffi- ate over all programs as well. Ultimately, since it boils down to a finite ciency of verifying failure of all programs could well statement about bits and numbers; either the statement or its negation be inherent. must follow from the standard axioms of arithmetic in a finite number of arithmetic steps. Thus, we cannot justify our ignorance in distin- guishing between the “3SAT easy” and “3SAT hard” cases by claiming that this might be an inherently ill-defined question. Similar reason- ing (with different numbers) applies to other variants ofthe P vs NP question. We note that in the case that 3SAT is hard, it may well be that there is no short proof of this fact using the standard axioms, and this is a question that people have been studying in various restricted forms of proof systems. 514 introduction to theoretical computer science

16.8 IS P NP “IN PRACTICE”? The fact that a problem is NP-hard means that we believe there is no efficient= algorithm that solve it in the worst case. It does not, however, mean that every single instance of the problem is hard. For exam- ple, if all the clauses in a 3SAT instance contain the same variable (possibly in negated form), then by guessing a value to we can reduce to a 2SAT instance which can then휑 be efficiently solved. Gen- 푖 푖 푥eralizations of this simple idea are used in “SAT solvers”, which푥 are algorithms휑 that have solved certain specific interesting SAT formulas with thousands of variables, despite the fact that we believe SAT to be exponentially hard in the worst case. Similarly, a lot of problems 10 10 Actually, the computational difficulty of problems in arising in economics and machine learning are NP-hard. And yet economics such as finding optimal (or any) equilibria vendors and customers manage to figure out market-clearing prices is quite subtle. Some variants of such problems are (as economists like to point out, there is milk on the shelves) and mice NP-hard, while others have a certain “intermediate” complexity. succeed in distinguishing cats from dogs. Hence people (and ma- chines) seem to regularly succeed in solving interesting instances of NP-hard problems, typically by using some combination of guessing while making local improvements. It is also true that there are many interesting instances of NP-hard problems that we do not currently know how to solve. Across all ap- plication areas, whether it is scientific computing, optimization, con- trol or more, people often encounter hard instances of NP problems on which our current algorithms fail. In fact, as we will see, all of our digital security infrastructure relies on the fact that some concrete and easy-to-generate instances of, say, 3SAT (or, equivalently, any other NP-hard problem) are exponentially hard to solve. Thus it would be wrong to say that NP is easy “in practice”, nor would it be correct to take NP-hardness as the “final word” on the complexity of a problem, particularly when we have more informa- tion about how any given instance is generated. Understanding both the “typical complexity” of NP problems, as well as the power and limitations of certain heuristics (such as various local-search based al- gorithms) is a very active area of research. We will see more on these topics later in this course. 11 Talk more about coping with NP hardness. Main 11 two approaches are heuristics such as SAT solvers that succeed on some instances, and proxy measures such as mathematical relaxations that instead of solving 16.9 WHAT IF ? problem (e.g., an integer program) solve program P NP (e.g., a linear program) that is related to that. Maybe give compressed sensing as an example, and So, P NP would give us all kinds of fantastical outcomes. But we ′ 푋 ≠ 푋least square minimization as a proxy for maximum strongly suspect that P NP, and moreover that there is no much- apostoriori probability. better-than-brute-force= algorithm for 3SAT. If indeed that is the case, is it all bad news? ≠ One might think that impossibility results, telling you that you cannot do something, is the kind of cloud that does not have a silver what if p equals np? 515

lining. But in fact, as we already alluded to before, it does. A hard (in a sufficiently strong sense) problem in NP can be used to create a code that cannot be broken, a task that for thousands of years has been the dream of not just spies but of many scientists and mathematicians over the generations. But the complexity viewpoint turned out to yield much more than simple codes, achieving tasks that people had previously not even dared to dream of. These include the notion of public key cryptography, allowing two people to communicate securely without ever having exchanged a secret key; electronic cash, allowing private and secure transaction without a central authority; and secure multiparty computation, enabling parties to compute a joint function on private inputs without revealing any extra information about it. Also, as we will see, computational hardness can be used to replace the role of randomness in many settings. Furthermore, while it is often convenient to pretend that computa- tional problems are simply handed to us, and that our job as computer scientists is to find the most efficient algorithm for them, this isnot how things work in most computing applications. Typically even for- mulating the problem to solve is a highly non-trivial task. When we discover that the problem we want to solve is NP-hard, this might be a useful sign that we used the wrong formulation for it. Beyond all these, the quest to understand computational hardness including the discoveries of lower bounds for restricted compu- tational models, as well as new types of reductions (such as those arising− from “probabilistically checkable proofs”) has already had surprising positive applications to problems in algorithm design, as well as in coding for both communication and storage.− This is not surprising since, as we mentioned before, from group theory to the theory of relativity, the pursuit of impossibility results has often been one of the most fruitful enterprises of mankind.

Chapter Recap

• The question of whether P NP is one of the ✓ most important and fascinating questions of com- puter science and science at large,= touching on all fields of the natural and social sciences, as well mathematics and engineering. • Our current evidence and understanding supports the “SAT hard” scenario that there is no much- better-than-brute-force algorithm for 3SAT or many other NP-hard problems. • We are very far from proving this, however. Re- searchers have studied proving lower bounds on the number of gates to compute explicit functions in restricted forms of circuits, and have made some 516 introduction to theoretical computer science

advances in this effort, along the way generating mathematical tools that have found other uses. However, we have made essentially no headway in proving lower bounds for general models of computation such as Boolean circuits and Turing machines. Indeed, we currently do not even know how to rule out the possibility that for every , SAT restricted to -length inputs has a Boolean circuit of less than gates (even though there푛 ∈ ℕ exist -input functions푛 that require at least gates to compute).10푛 푛 • Understanding푛 how to cope with this compu-2 /(10푛) tational intractability, and even benefit from it, comprises much of the research in theoretical computer science.

16.10 EXERCISES

16.11 BIBLIOGRAPHICAL NOTES

As mentioned before, Aaronson’s survey [Aar16] is a great exposition of the P vs NP problem. Another recommended survey by Aaronson is [Aar05] which discusses the question of whether NP complete problems could be computed by any physical means. The paper [BU11] discusses some results about problems in the polynomial hierarchy. 17 Space bounded computation

PLAN: Example of space bounded algorithms, importance of pre- serving space. The classes L and PSPACE, space hierarchy theorem, PSPACE=NPSPACE, constant space = regular languages.

17.1 EXERCISES

17.2 BIBLIOGRAPHICAL NOTES

Compiled on 8.26.2020 18:10

IV RANDOMIZED COMPUTATION

Learning Objectives: • Review the basic notion of probability theory that we will use. • Sample spaces, and in particular the space

• Events,푛 of unions and intersections.{0, 1} • Random variables and their expectation, variance, and standard deviation. • Independence and correlation for both events 18 and random variables. • Markov, Chebyshev and Chernoff tail bounds Probability Theory 101 (bounding the probability that a random variable will deviate from its expectation).

“God doesn’t play dice with the universe”, Albert Einstein

“Einstein was doubly wrong … not only does God definitely play dice, but He sometimes confuses us by throwing them where they can’t be seen.”, Stephen Hawking

“ ‘The probability of winning a battle has no place in our theory because it does not belong to any [random experiment]. Probability cannot be applied to this problem any more than the physical concept of work can be applied to the ’work’ done by an actor reciting his part.”, Richard Von Mises, 1928 (paraphrased)

“I am unable to see why ‘objectivity’ requires us to interpret every probability as a frequency in some random experiment; particularly when in most problems probabilities are frequencies only in an imaginary universe invented just for the purpose of allowing a frequency interpretation.”, E.T. Jaynes, 1976

Before we show how to use randomness in algorithms, let us do a quick review of some basic notions in probability theory. This is not meant to replace a course on probability theory, and if you have not seen this material before, I highly recommend you look at additional resources to get up to speed. Fortunately, we will not need many of the advanced notions of probability theory, but, as we will see, even the so-called “simple” setting of tossing coins can lead to very subtle and interesting issues. 푛 18.1 RANDOM COINS

The nature of randomness and probability is a topic of great philo- sophical, scientific and mathematical depth. Is there actual random- ness in the world, or does it proceed in a deterministic clockwork fash- ion from some initial conditions set at the beginning of time? Does probability refer to our uncertainty of beliefs, or to the frequency of occurrences in repeated experiments? How can we define probability over infinite sets?

Compiled on 8.26.2020 18:10 522 introduction to theoretical computer science

These are all important questions that have been studied and de- bated by scientists, mathematicians, statisticians and philosophers. Fortunately, we will not need to deal directly with these questions here. We will be mostly interested in the setting of tossing random, unbiased and independent coins. Below we define the basic proba- bilistic objects of events and random variables when restricted푛 to this setting. These can be defined for much more general probabilistic ex- periments or sample spaces, and later on we will briefly discuss how this can be done. However, the -coin case is sufficient for almost everything we’ll need in this course. If instead of “heads” and “tails”푛 we encode the sides of each coin by “zero” and “one”, we can encode the result of tossing coins as a string in . Each particular outcome is obtained with probability푛 . For example, if we toss three coins,푛 푛 then we obtain each{0, of 1} the−푛 8 outcomes 푥 ∈ {0, 1} with probability 2 (see also Fig. 18.1). We can describe the experiment of tossing−3 coins as000, choosing 001, 010, a 011, string 100,uniformly 101, 110, 111 at random from 2 ,= and 1/8 hence we’ll use the shorthand for that is chosen according푛 푛 to this experiment. 푥 푛 An event is{0, simply 1} a subset of . The probability푥 of ∼ {0,, de- 1} noted푥 by Pr (or Pr for short,푛 when the sample space is understood from the푛 context),퐴 is the{0, probability 1} that an chosen퐴 uni- 푥∼{0,1} formly at random will[퐴] be contained[퐴] in . Note that this is the same as (where as usual denotes the number of elements푥 in the set Figure 18.1: The probabilistic experiment of tossing three coins corresponds to making ). For푛 example, the probability that 퐴has an even number of ones choices, each with equal probability. In this example, is|퐴|/2 Pr where |퐴| mod . In the case , the blue set corresponds to the event2 × 2 × 2 = 8 퐴 , and푛−1 hence Pr푥 (see Fig. 18.2). It where the first coin toss is equal 푖 푖=0 to , and3 the pink set corresponds to the퐴 = event {푥 ∈ turns[퐴] out this is퐴 true = {푥 for ∶ every ∑ 푥: = 0 42} 1 푛 = 3 0 8 2 {0, 1} | 푥 = 0} where the second coin toss is 퐴 = {000, 011, 101, 110} [퐴] = = equal to (with their intersection having a purplish Lemma 18.1 For every , 0 3 퐵 = 푛 {푥color). ∈ {0, As 1} we| can 푥1 see,= 1} each of these events contains elements1 (out of total) and so has probability . 푛 >Pr 0 is even (18.1) The intersection of and contains two elements, 4 푛−1 and so the probability8 that both of these events occur1/2 is . 퐴 퐵 푛 푖 푥∼{0,1} [∑푖=0 푥 ] = 1/2 2/8 = 1/4 P To test your intuition on probability, try to stop here and prove the lemma on your own.

Proof of Lemma 18.1. We prove the lemma by induction on . For the case it is clear since is even and is odd, and hence the probability that is even is . Let . We푛 assume by induction푛 = 1 that the lemma푥 = is 0true for 푥and = 1 we will prove it Figure 18.2: The event that if we toss three coins then the sum of the ’s is even for . We split the set푥 ∈ {0, 1}into four disjoint1/2 sets푛 > 1 , has probability since it corresponds to exactly 0 1 2 푖 where for , is defined푛 as푛 the − 1 set of such that out푥 , 푥of, the 푥 ∈possible {0, 1} strings of length . 푥 0 1 0 1 1/2 4 푛 {0, 1} 퐸 , 퐸 푛, 푂 , 푂 푏 8 3 푏 ∈ {0, 1} 퐸 푥 ∈ {0, 1} probability theory 101 523

has even number of ones and and similarly is the set of such that has odd number of ones and 0 푛−2 푛−1 푏 푥 ⋯ 푥 . Since is푛 obtained by simply푥 extending= 푏 -length푂 string 0 푛−2 with even푥 number ∈ {0, 1} of ones by the푥 digit⋯ 푥 , the size of is simply the 푛−1 0 푥number= 푏 of such퐸 -length strings which by the induction푛 − 1 hypothesis 0 is . The same reasoning0 applies for 퐸, , and . Hence푛−1 each one푛−2 of푛−1 the four sets is of size . Since 1 0 1 2 /2 =has 2 an even number of ones if and only퐸 if 푂 푛−2 푂 0 1 0 1 (i.e., either푛 the first coordinates퐸 , 퐸 sum, 푂 , up푂 to an even2 number and 0 1 the푥 ∈ final {0, 1} coordinate is or the first coordinates sum푥 ∈ up 퐸 to∪ an 푂 odd number and the final푛 − 1coordinate is ), we get that the probability that satisfies this property0 is 푛 − 1 1 푥 (18.2) 푛−2 푛−2 |퐸0∪푂1| 푛 2 + 2 1 using the fact that and2 =are disjoint푛 and= hence, 2 2 . 0 1 0 1 퐸 푂 |퐸 ∪ 푂 | = ■ 0 1 |퐸 | + |푂 | We can also use the intersection ( ) and union ( ) operators to talk about the probability of both event and event happening, or the probability of event or event ∩ happening. For∪ example, the probability that has an even number퐴 of ones and 퐵 is the same as Pr where 퐴 퐵 mod and 0 푝 푥 . This probability푛 푛−1 is equal푥 = to 1 for 푖=0 푖 [퐴. (It ∩ 퐵] is a great푛 exercise퐴 = {푥 for ∈ {0, you 1} to pause∶ ∑ here푥 = and 0 verify2} that you 0 understand퐵 = {푥 ∈ {0,why 1} this∶ is 푥 the= case.) 1} 1/4 푛 >Because 1 intersection corresponds to considering the logical AND of the conditions that two events happen, while union corresponds to considering the logical OR, we will sometimes use the and operators instead of and , and so write this probability Pr defined above also as ∧ ∨ ∩ ∪ 푝 = [퐴 ∩ 퐵] Pr mod (18.3)

푛 푖 0 푥∼{0,1} [∑ 푥 = 0 2 ∧ 푥 = 1] . If is an event,푖 then corresponds to the event that does푛 not happen. Since 푛 ⧵ , we get that 퐴 ⊆ {0, 1} 퐴 = {0, 1}푛 퐴 Pr Pr (18.4) 퐴 푛 |퐴| = 2 − |퐴| |퐴|푛 2 −|퐴|푛 |퐴|푛 This makes sense:[퐴] since = 2 =happens2 = if 1and − 2 only= if 1 − does[퐴]not happen, the probability of should be one minus the probability of . 퐴 퐴 퐴 퐴 R 524 introduction to theoretical computer science

Remark 18.2 — Remember the sample space. While the above definition might seem very simple and almost trivial, the human mind seems not to have evolved for probabilistic reasoning, and it is surprising how often people can get even the simplest settings of probability wrong. One way to make sure you don’t get confused when trying to calculate probability statements is to always ask yourself the following two questions: (1) Do I understand what is the sample space that this probability is taken over?, and (2) Do I under- stand what is the definition of the event that we are analyzing?. For example, suppose that I were to randomize seating in my course, and then it turned out that students sitting in row 7 performed better on the final: how surprising should we find this? If we started out with the hypothesis that there is something special about the number 7 and chose it ahead of time, then the event that we are discussing is the event that stu- dents sitting in number 7 had better performance on the final, and we might find it surprising.퐴 However, if we first looked at the results and then chose the row whose average performance is best, then the event we are discussing is the event that there exists some row where the performance is higher than the over- all average. is a superset of 퐵, and its probability (even if there is no correlation between sitting and performance)퐵 can be quite significant.퐴

18.1.1 Random variables Events correspond to Yes/No questions, but often we want to analyze finer questions. For example, if we make a bet at the roulette wheel, we don’t want to just analyze whether we won or lost, but also how much we’ve gained. A (real valued) random variable is simply a way to associate a number with the result of a probabilistic experiment. Formally, a random variable is a function that maps every outcome to an element .푛 For example, the function SUM 푛 that maps to푋 the ∶ {0, sum 1} of→ its ℝ coordinates (i.e., to 푥) is∈ a {0, random푛 1} variable. 푋(푥) ∈ ℝ The expectation푛−1 ∶ {0,of 1} a random→ ℝ variable 푥 , denoted by , is the 푖=0 푖 average∑ value푥 that this number takes, taken over all draws from the probabilistic experiment. In other words,푋 the expectation피[푋] of is de- fined as follows: 푋 (18.5) −푛 푛 If and are random피[푋] variables, =푥∈{0,1} ∑ then2 we푋(푥) can . define as simply the random variable that maps a point to 푋. One basic푌 and very useful property of the expectation푋푛 + is 푌 that it is linear: 푥 ∈ {0, 1} 푋(푥) + 푌 (푥) probability theory 101 525

Lemma 18.3 — Linearity of expectation.

(18.6)

Proof. 피[푋 + 푌 ] = 피[푋] + 피[푌 ]

−푛 푛 피[푋 + 푌 ] =푥∈{0,1} ∑ 2 (푋(푥) + 푌 (푥)) = (18.7) −푛 −푛 푏 푏 푥∈{0,1}∑ 2 푋(푥) +푥∈{0,1} ∑ 2 푌 (푥) = ■ 피[푋] + 피[푌 ] Similarly, for every . Solved Exercise 18.1 — Expectation of sum. Let be the 피[푘푋] = 푘 피[푋] 푘 ∈ ℝ random variable that maps to 푛 . Prove that . 푛 푋 ∶ {0, 1} → ℝ 0 1 푛−1 푥 ∈ {0, 1} 푥 + 푥 + … + 푥 ■ 피[푋] = 푛/2 Solution: We can solve this using the linearity of expectation. We can de- fine random variables such that . Since each equals with probability and with probability , 0 1 푛−1 푖 푖 . Since 푋 , 푋 , … ,, 푋 by the linearity푋 of(푥) expectation = 푥 푖 푥 1 푛−1 1/2 0 1/2 푖 푖 피[푋 ] = 1/2 푋 = ∑푖=0 푋 (18.8) 푛 0 1 푛−1 피[푋] = 피[푋 ] + 피[푋 ] + ⋯ + 피[푋 ] = 2 . ■

P If you have not seen discrete probability before, please go over this argument again until you are sure you follow it; it is a prototypical simple example of the type of reasoning we will employ again and again in this course.

If is an event, then is the random variable such that equals if , and otherwise. Note that Pr 퐴 퐴 (can퐴 you see why?). Using1 this and the linearity of expectation,1 (푥) we 퐴 퐴 can show1 one푥 ∈ of 퐴 the most1 (푥) useful = 0 bounds in probability theory:[퐴] = 피[1 ] Lemma 18.4 — Union bound. For every two events , Pr Pr Pr 퐴, 퐵 [퐴 ∪ 퐵] ≤ [퐴] + [퐵] P Before looking at the proof, try to see why the union bound makes intuitive sense. We can also prove it directly from the definition of probabilities and 526 introduction to theoretical computer science

the cardinality of sets, together with the equation . Can you see why the latter equation is true? (See also Fig. 18.3.) |퐴 ∪ 퐵| ≤ |퐴| + |퐵|

Proof of Lemma 18.4. For every , the variable . Hence, Pr Pr Pr . 퐴∪퐵 퐴 퐵 푥 1 (푥) ≤ 1 (푥)+1 (푥)■ 퐴∪퐵 퐴 퐵 퐴 퐵 [퐴∪퐵] = 피[1 ] ≤ 피[1 +1 ] = 피[1 ]+피[1 ] = [퐴]+ [퐵] The way we often use this in theoretical computer science is to argue that, for example, if there is a list of 100 bad events that can hap- pen, and each one of them happens with probability at most , then with probability at least , no bad event happens. 1/10000 1 − 100/10000 = 0.99 18.1.2 Distributions over strings While most of the time we think of random variables as having as output a real number, we sometimes consider random vari- ables whose output is a string. That is, we can think of a map and consider the “random variable” such that for every푛 ∗ , the probability that outputs is equal 푌to ∶ {0, 1} → {0, 1} ∗ . To avoid confusion, we will푌 typically Figure 18.3: The union bound tells us that the proba- refer1푛 to such string-valued푦 ∈푛 {0, 1} random variables as푌distributions푦 over bility of or happening is at most the sum of the 2 individual probabilities. We can see it by noting that strings.|{푥 So, ∈ {0, a distribution 1} | 푌 (푥) =over 푦}| strings can be thought of as for every퐴 two퐵 sets (with equality a finite collection of strings ∗ and probabilities only if and have no intersection). |퐴 ∪ 퐵| ≤ |퐴| + |퐵| (which are non-negative푌 numbers{0, 1} summing∗ up to one), 0 푀−1 퐴 퐵 so that Pr . 푦 , … , 푦 ∈ {0, 1} 0 푀−1 푝 ,Two … , 푝 distributions and are identical if they assign the same 푖 푖 probability[푌 to = every 푦 ] = string. 푝 For′ example, consider the following two functions 푌 푌 . For every , we define and ′ 2 2 where is the XOR2 operations. Although푌 these , 푌 are∶′ {0, two 1} different→ {0, 1} functions, 푥they ∈ {0, induce 1} the same 0 0 1 푌distribution (푥) = 푥 over푌 (푥) = 푥when(푥 ⊕ invoked 푥 ) on a⊕ uniform input. The distri- bution for 2 is of course the uniform distribution over . On the other{0, 1} hand2 is simply the map , , 2 푌, (푥) 푥which ∼ {0, 1} is a permutation′ of . {0, 1} 푌 00 ↦ 00 01 ↦ 01 1018.1.3 ↦ 11 More11 general ↦ 10 sample spaces 푌 While throughout most of this book we assume that the underlying probabilistic experiment corresponds to tossing independent coins, all the claims we make easily generalize to sampling from a more general finite or countable set (and not-so-easily푛 generalizes to uncountable sets as well). A probability distribution 푥over a finite set is simply a function 푆 such that . We think of this as the푆 experiment where we obtain every with 푆 휇 ∶ 푆 → [0, 1] ∑푥∈푆 휇(푥) = 1 푥 ∈ 푆 probability theory 101 527

probability , and sometimes denote this as . In particular, tossing random coins corresponds to the probability distribution 휇(푥) defined as for every푥 ∼ 휇 . An event 푛is푛 a subset of , and the probability−푛 of , which we denote푛 by Pr휇 ∶ {0,, 1} is → [0, 1] .A random휇(푥) variable = 2is a function 푥 ∈ {0, 1} , where the probability퐴 that 푆 is equal to 퐴 . 휇 s.t. [퐴] ∑푥∈퐴 휇(푥) 푋 ∶ 푆 → ℝ 푥∈푆 푋(푥)=푦 18.2 CORRELATIONS푋 = AND 푦 INDEPENDENCE∑ 휇(푥)

One of the most delicate but important concepts in probability is the notion of independence (and the opposing notion of correlations). Subtle correlations are often behind surprises and errors in probability and statistical analysis, and several mistaken predictions have been blamed on miscalculating the correlations between, say, housing prices in Florida and Arizona, or voter preferences in Ohio and Michigan. See also Joe Blitzstein’s aptly named talk “Conditioning is the Soul of ”. (Another thorny issue is of course the difference between correlation and causation. Luckily, this is another point we don’t need to worry about in our clean setting of tossing coins.) Two events and are independent if the fact that happens makes neither more nor less likely to happen.푛 For example, if we think of the experiment퐴 퐵 of tossing random coins 퐴 , and we let be퐵 the event that and the event that 3 , then if happens it is more likely3 that happens,푥 and ∈ {0, hence 1} these 0 0 1 2 events퐴 are not independent.푥 = 1On the퐵 other hand, if we푥 let+ 푥 be+ 푥the≥ event 2 that 퐴 , then because the second coin퐵 toss is not affected by the result of the first one, the events and are independent.퐶 1 The푥 formal= 1 definition is that events and are independent if Pr Pr Pr . If Pr 퐴 퐶 Pr Pr then we say that and are positively correlated, while퐴 if Pr퐵 Pr Pr then[퐴 we ∩ 퐵] say = that[퐴]and ⋅ [퐵]are negatively[퐴 ∩ 퐵] correlated > [퐴] ⋅(see[퐵]Fig. 18.1). If퐴 we consider퐵 the above examples on the experiment[퐴 ∩ 퐵] < of choosing[퐴] ⋅ [퐵] Figure 18.4: Two events and are independent if then we퐴 can퐵 see that Pr Pr Pr . In the two figures above, 3 the empty square퐴 is the퐵 sample space, and 푥 ∈ {0, 1} and[퐴 ∩are 퐵] two = events[퐴] ⋅ in[퐵] this sample space. In the left Pr figure, and푥 × 푥are independent, while in the right퐴 (18.9) 1 figure퐵 they are negatively correlated, since is less Pr Pr 0 likely to occur if we condition on (and vice versa). [푥 = 1] = 2 퐴 퐵 4 1 Mathematically, one can see this by noticing퐵 that in but 0 1 2 [푥 + 푥 + 푥 ≥ 2] = [{011, 101, 110, 111}] = 8 = 2 the left figure the areas of and 퐴 respectively are and , and so their probabilities are 퐴 퐵 and respectively, while the area of 푎⋅푥2 is푎 Pr Pr (18.10) 푎 ⋅ 푥 푏 ⋅ 푥 푥 = 푥 푏⋅푥which2 푏 corresponds to the probability . In the 푥 푥 3 1 1 = 퐴 ∩ 퐵 right figure, the area of the triangle is 푎⋅푏2which and0 hence, as0 we1 already2 observed, the events 8 2and2 푥 [푥 = 1 ∧ 푥 +푥 +푥 ≥ 2] = [{101, 110, 111}] = > ⋅ corresponds푎 ⋅ 푏 to a probability of , but the푏⋅푥 area of are not independent and in fact are positively correlated. 퐵 2 0 0 is for some . This푏 means that the ′ 2푥 On the other hand, Pr Pr {푥 = 1} {푥 + probability푏 ⋅푎 of is ′ , or in other words 1 2 퐴Pr ∩ 퐵 2 Pr Pr 푏′ .< 푏 푥and+ hence 푥 ≥ 2}the events and are indeed independent.2 1 1 푏 ⋅푎2 푏 푎 0 1 2푥 2푥 푥 [푥 = 1 ∧ 푥 = 1] = [{110, 111}] = 8 = 2 ⋅ 2 퐴 ∩ 퐵 < ⋅ 0 1 [퐴 ∩ 퐵] < [퐴] ⋅ [퐵] {푥 = 1} {푥 = 1} 528 introduction to theoretical computer science

R Remark 18.5 — Disjointness vs independence. People sometimes confuse the notion of disjointness and in- dependence, but these are actually quite different. Two events and are disjoint if , which means that if happens then definitely does∅ not happen. They are퐴 independent퐵 if Pr 퐴 ∩ 퐵 =Pr Pr which means퐴 that knowing that퐵 happens gives us no infor- mation about whether happened[퐴 ∩ 퐵] = or not.[퐴] If [퐵]and have nonzero probability,퐴 then being disjoint implies that they are not independent,퐵 since in particular퐴 it 퐵 means that they are negatively correlated.

Conditional probability: If and are events, and happens with nonzero probability then we define the probability that happens conditioned on to be Pr 퐴 퐵Pr Pr 퐴. This corresponds to calculating the probability that happens if we already퐵 know that happened.퐴 Note that[퐵|퐴] =and [퐴are ∩ 퐵]/ independent[퐴] if and only if Pr Pr . 퐵 퐴 퐴 퐵 More[퐵|퐴] than = two events:[퐵] We can generalize this definition to more than two events. We say that events are mutually independent if knowing that any set of them occurred or didn’t occur does not 1 푘 change the probability that an event퐴 , … outside , 퐴 the set occurs. Formally, the condition is that for every subset ,

Pr 퐼 ⊆Pr [푘] (18.11)

푖∈퐼 푖 푖 [∧ 퐴 ] = ∏ [퐴 ]. For example, if , then the푖∈퐼 events , and are mutually independent.3 On the other hand, the events 0 1 , 푥 ∼and {0, 1} mod {푥 are= 1}not{푥mutually= 1} 2 {푥independent,= 1} even though every pair of these events is independent 0 1 0 1 {푥(can= you 1} see{푥 why?= 1} see also{푥Fig.+ 18.5 푥 =). 0 2}

18.2.1 Independent random variables We say that two random variables and are independent if for every , the events푛 and 푛 are independent. (We use 푋as ∶ {0,shorthand 1} → ℝ for 푌 ∶ {0, 1} →.) ℝ Figure 18.5: Consider the sample space and the events corresponding to : , In other words, and are푢, independent 푣 ∈ ℝ if Pr {푋 = 푢} {푌 = 푣} 푛 : , : , : {0, 1} Pr Pr for every{푋 = 푢} . For example,{푥 if two| 푋(푥) random = 푢} and퐴, 퐵,: 퐶, 퐷, 퐸 . We can퐴 see푥 that0 = 1 1 0 1 2 0 1 2 variables depend푋 on the푌 result of tossing different[푋 = coins 푢 ∧ 푌 then = 푣] they = are 퐵 and푥 = 1are퐶 independent,푥 + 푥 + 푥 ≥is2 positively퐷 푥 + correlated 푥 + 푥 = with and positively0 correlated1 with , the three [푋 = 푢] [푌 = 푣] 푢, 푣 ∈ ℝ 0푚표푑2 퐷 푥 + 푥 = 0푚표푑2 independent: events퐴 퐵 are mutually퐶 independent, and while every pair out of is independent, the three Lemma 18.6 Suppose that and are 퐴 퐵 events 퐴, 퐵, 퐷 are not mutually independent since disjoint subsets of and let be random their intersection퐴, has 퐵, probability 퐸 instead of 0 푘−1 0 푚−1 . variables such that 푆 = {푠 , … , 푠 and} 푇 =푛 {푡 , … , 푡 } for 퐴, 퐵, 퐸 2 1 8 4 1 1 1 1 = {0, … , 푛 − 1} 푋, 푌 ∶ {0, 1} → ℝ 2 2 2 8 푠0 푠푘−1 푡0 푡푚−1 ⋅ ⋅ = 푋 = 퐹 (푥 , … , 푥 ) 푌 = 퐺(푥 , … , 푥 ) probability theory 101 529

some functions and . Then and are independent. 푘 푚 퐹 ∶ {0, 1} → ℝ 퐺 ∶ {0, 1} → ℝ 푋 푌

P The notation in the lemma’s statement is a bit cum- bersome, but at the end of the day, it simply says that if and are random variables that depend on two disjoint sets and of coins (for example, might be푋 the sum푌 of the first coins, and might be the largest consecutive푆 푇 stretch of zeroes in the second푋 coins), then they are independent.푛/2 푌 푛/2

Proof of Lemma 18.6. Let , and let and . Since and are disjoint,푘 we can reorder the indices so푚 that푎, 푏 ∈ ℝ 퐴 =and {푥 ∈ {0, 1} ∶ 퐹 (푥) = 푎} without퐵 = affecting {푥 ∈ {0, 1} any∶ 퐹of (푥) the = 푏}probabilities.푆 푇Hence we can write Pr where푆 = {0, … , 푘 − 1} 푇 = {푘, … , 푘 + 푚 − 1} 푛 . Another way to write this using string[푋 = 0 푛−1 0 푘−1 푎concatenation ∧ 푋 = 푏] = is |퐶|/2 that 퐶 = {푥 , … , 푥 ∶ (푥 , … , 푥 ), ∈ and 푘 푘+푚−1 hence퐴 ∧ (푥 , … , 푥 ) ∈ 퐵}, which means that 푛−푘−푚 푛−푘−푚퐶 = {푥푦푧 ∶ 푥 ∈ 퐴, 푦 ∈ 퐵, 푧 ∈ {0, 1} } |퐶| = |퐴||퐵|2 Pr Pr (18.12) 푛−푘−푚 |퐶|푛 |퐴|푘 |퐵|푚 2푛−푘−푚 ■ 2 = 2 2 2 = [푋 = 푎] [푌 = 푏].

If and are independent random variables then (letting denote the sets of all numbers that have positive probability of being 푋 푌 the output푋 of푌 and , respectively): 푆 , 푆

XY 푋 Pr푌 Pr Pr (1) (2)

피[ ] =푎∈푆 ∑푋,푏∈푆푌 [푋 = 푎 ∧ 푌 = 푏] ⋅ 푎푏 = 푎∈푆∑푋,푏∈푆푌 [푋 = 푎] [푌 = 푏] ⋅ 푎푏 = Pr Pr (3)

(푎∈푆 ∑푋 [푋 = 푎] ⋅ 푎) (푏∈푆 ∑푌 [푌 = 푏] ⋅ 푏) = (18.13) where the first equality ( ) follows피[푋] from 피[푌 the ] independence of and , the second equality(1) ( ) follows by “opening the parenthe- ses” of the righthand side,= and(2) the third equality ( ) follows from푋 the definition푌 of expectation.= (This is not an “if(3) and only if”;see Exer- cise 18.3.) = Another useful fact is that if and are independent random variables, then so are and for all functions . This is intuitively true since learning푋 푌 can only provide us with less information than퐹 does (푋) learning퐺(푌 ) itself. Hence, if learning퐹 , 퐺 ∶ ℝ → ℝ does not teach us anything about (and퐹 (푋) so also about ) then 푋 푋 푌 퐹 (푌 ) 530 introduction to theoretical computer science

neither will learning . Indeed, to prove this we can write for every : 퐹 (푋) 푎, 푏 ∈ ℝ Pr Pr s.t. s.t.

[퐹 (푋) = 푎 ∧ 퐺(푌 ) = 푏] =푥 퐹(푥)=푎,푦Pr ∑ 퐺(푦)=푏Pr [푋 = 푥 ∧ 푌 = 푦] = s.t. s.t.

푥 퐹(푥)=푎,푦∑ 퐺(푦)=푏 [푋 = 푥] [푌 = 푦] = Pr Pr s.t. s.t. ⎛ ⎞ ⎛ ⎞ ⎜푥 ∑퐹(푥)=푎 Pr[푋 = 푥]⎟ ⋅ ⎜Pr푦 ∑퐺(푦)=푏 [푌 = 푦]⎟ = ⎝ ⎠ ⎝ ⎠ (18.14) [퐹 (푋) = 푎] [퐺(푌 ) = 푏]. 18.2.2 Collections of independent random variables We can extend the notions of independence to more than two random variables: we say that the random variables are mutually independent if for every , 0 푛−1 푋 , … , 푋 0 푛−1 Pr 푎 , … , 푎 ∈ ℝPr Pr (18.15) 0 0 푛−1 푛−1 0 0 푛−1 푛−1 And similarly,[푋 = 푎 ∧ we ⋯ have ∧ 푋 that= 푎 ] = [푋 = 푎 ] ⋯ [푋 = 푎 ]. Lemma 18.7 — Expectation of product of independent random variables. If are mutually independent then

0 푛−1 푋 , … , 푋 (18.16) 푛−1 푛−1

푖 푖 피[∏ 푋 ] = ∏ 피[푋 ]. Lemma 18.8 — Functions preserve푖=0 independence.푖=0 If are mu- tually independent, and are defined as for 0 푛−1 some functions , then 푋 , … , 푋 are mutually 0 푛−1 푖 푖 푖 independent as well. 푌 , … , 푌 푌 = 퐹 (푋 ) 0 푛−1 0 푛−1 퐹 , … , 퐹 ∶ ℝ → ℝ 푌 , … , 푌

P We leave proving Lemma 18.7 and Lemma 18.8 as Exercise 18.6 and Exercise 18.7. It is good idea for you stop now and do these exercises to make sure you are comfortable with the notion of independence, as we will use it heavily later on in this course.

18.3 CONCENTRATION AND TAIL BOUNDS

The name “expectation” is somewhat misleading. For example, sup- pose that you and I place a bet on the outcome of 10 coin tosses, where if they all come out to be ’s then I pay you 100,000 dollars and other-

1 probability theory 101 531

wise you pay me 10 dollars. If we let be the random variable denoting your gain, then we see that 10 푋 ∶ {0, 1} → ℝ (18.17) −10 −10 But we don’t피[푋] really = “expect”2 ⋅ 100000 the result − (1 − of 2 this)10 experiment ∼ 90. to be for you to gain 90 dollars. Rather, 99.9% of the time you will pay me 10 dollars, and you will hit the jackpot 0.01% of the times. However, if we repeat this experiment again and again (with fresh and hence independent coins), then in the long run we do expect your average earning to be close to 90 dollars, which is the reason why casinos can make money in a predictable way even though every individual bet is random. For example, if we toss independent and unbiased coins, then as grows, the number of coins that come up ones will be more and more concentrated around 푛 according to the famous “bell curve” (see푛Fig. 18.6). Much of probability theory is concerned with푛/2 so called concentration or tail bounds, which are upper bounds on the probability that a ran- dom variable deviates too much from its expectation. The first and simplest one of them is Markov’s inequality: 푋 Theorem 18.9 — Markov’s inequality. If is a non-negative random variable then for every , Pr . 푋 Figure 18.6: The probabilities that we obtain a partic- 푘 > 1 [푋 ≥ 푘 피[푋]] ≤ 1/푘 ular sum when we toss coins converge quickly to the Gaussian/normal distribu- P tion. 푛 = 10, 20, 100, 1000 Markov’s Inequality is actually a very natural state- ment (see also Fig. 18.7). For example, if you know that the average (not the median!) household income in the US is 70,000 dollars, then in particular you can deduce that at most 25 percent of households make more than 280,000 dollars, since otherwise, even if the remaining 75 percent had zero income, the top 25 percent alone would cause the average income to be larger than 70,000 dollars. From this example you can already see that in many situations, Markov’s inequality will not be tight and the probability of devi- ating from expectation will be much smaller: see the Chebyshev and Chernoff inequalities below.

Proof of Theorem 18.9. Let and define . That Figure 18.7: Markov’s Inequality tells us that a non- is, if and otherwise. Note that by 푋≥푘휇 negative random variable cannot be much larger definition, for every , 휇 = 피[푋] . We need푌 to = show 1 . than its expectation, with high probability. For exam- But푌 this (푥) follows = 1 푋(푥) since ≥ 푘휇 푌 (푥) = 0 . ple, if the expectation of 푋is , then the probability ■ that must be at most , as otherwise just 푥 푌 (푥) ≤ 푋/(푘휇) 피[푌 ] ≤ 1/푘 the contribution from this푋 part휇 of the sample space 피[푌 ] ≤ 피[푋/푘(휇)] = 피[푋]/(푘휇) = 휇/(푘휇) = 1/푘 will be푋 too> 4휇 large. 1/4 532 introduction to theoretical computer science

The averaging principle. While the expectation of a random variable is hardly always the “typical value”, we can show that is guar- anteed to achieve a value that is at least its expectation with positive 푋probability. For example, if the average grade in an exam푋 is points, at least one student got a grade or more on the exam. This is known as the averaging principle, and despite its simplicity it is surprisingly87 useful. 87 Lemma 18.10 Let be a random variable, then Pr .

Proof. Suppose towards푋 the sake of contradiction[푋 that ≥ 피[푋]] Pr > 0 . Then the random variable is always positive. By linearity of expectation . Yet by[푋 Markov, < 피[푋]] a = 1non-negative random variable푌 =with 피[푋] − 푋 must equal with probability , since the probability피[푌 ] = 피[푋] that − 피[푋] = 0 is at most for every . Hence we get a contradiction푌 피[푌 ] to = the 0 assumption0 that is always positive.1 푌 > 푘 ⋅ 0 = 0 1/푘 푘 > 1 푌 ■

18.3.1 Chebyshev’s Inequality Markov’s inequality says that a (non-negative) random variable can’t go too crazy and be, say, a million times its expectation, with significant probability. But ideally we would like to say that푋 with high probability, should be very close to its expectation, e.g., in the range where . In such a case we say that is concentrated, and hence푋 its expectation (i.e., mean) will be close to its median[0.99휇,and other 1.01휇] ways of measuring휇 = 피[푋] ’s “typical value”. Chebyshev’s푋 inequality can be thought of as saying that is concentrated if it has a small standard deviation. 푋 A standard way to measure the deviation푋 of a random variable from its expectation is by using its standard deviation. For a random variable , we define the variance of as Var where ; i.e., the variance is the average squared distance2 of from푋 its expectation. The standard푋 deviation[푋]of = 피[(푋is defined − 휇) ] as 휇 =Var 피[푋] . (This is well-defined since the variance, being an average푋 of a square, is always a non-negative number.)푋 휎[푋]Using = √ Chebyshev’s[푋] inequality, we can control the probability that a random variable is too many standard deviations away from its expectation.

Theorem 18.11 — Chebyshev’s inequality. Suppose that and Var . Then for every , Pr . 2 휇 =2 피[푋] Proof.휎 =The[푋] proof follows from Markov’s푘 > 0 [|푋 inequality. − 휇| ≥ 푘휎] We ≤ define 1/푘 the random variable . Then Var , and hence 2 2 푌 = (푋 − 휇) 피[푌 ] = [푋] = 휎 probability theory 101 533

by Markov the probability that is at most . But clearly if and only if 2 2 . 2 ■ 2 2 2 푌 > 푘 휎 1/푘 (푋 − 휇) ≥ 푘 휎 |푋 − 휇| ≥ 푘휎 One example of how to use Chebyshev’s inequality is the setting when where ’s are independent and identically distributed (i.i.d for short) variables with values in where each 1 푛 푖 has expectation푋 = 푋 + ⋯. + Since 푋 푋 , we would like to say that is very likely to be in, say, the interval [0, 1] . Us- 푖 푖 ing Markov’s inequality1/2 directly피[푋] = will ∑ not피[푋 help] = us, 푛/2 since it will only tell us that 푋is very likely to be at most (which[0.499푛, we already 0.501푛] knew, since it always lies between and ). However, since are independent,푋 100푛 1 푛 0 푛 푋 , … , 푋 Var Var Var (18.18)

1 푛 1 푛 (We leave showing[푋 + ⋯ this + 푋 to the] = reader[푋 as] +Exercise ⋯ + 18.8[푋 .)]. For every random variable in , Var (if the variable is always in , it can’t be more than away from its expectation), 푖 푖 and hence (18.18) implies that푋 Var [0, 1] and[푋 hence] ≤ 1 . For large , [0, 1] , and in particular1 if , we can √ use Chebyshev’s inequality to bound[푋] the ≤ probability 푛 that휎[푋]is ≤ not푛 in √ √ 푛 푛by ≪ 0.001푛. 푛 ≤ 0.001푛/푘 2 푋 [0.499푛,18.3.2 The 0.501푛] Chernoff1/푘 bound Chebyshev’s inequality already shows a connection between inde- pendence and concentration, but in many cases we can hope for a quantitatively much stronger result. If, as in the example above, where the ’s are bounded i.i.d random variables of mean , then as grows, the distribution of would be roughly 1 푛 푖 푋the =normal 푋 +or … +Gaussian 푋 distribution푋 that is, distributed according to the bell curve1/2 (see Fig.푛 18.6 and Fig. 18.8). This distribution푋 has the property of being very concentrated− in the sense that the probability of deviating standard deviations from the mean is not merely as is guaranteed by Chebyshev, but rather is roughly . Specifically,2 for 2 a normal random푘 variable of expectation and−푘 standard deviation1/푘 , the probability that is at most 푒 . That is, we have 2 an exponential decay of the probability푋 of deviation.휇 −푘 /2 휎 The following extremely|푋 − useful휇| ≥ 푘휎 theorem shows2푒 that such expo- nential decay occurs every time we have a sum of independent and Figure 18.8: In the normal distribution or the Bell curve, bounded variables. This theorem is known under many names in dif- the probability of deviating standard deviations from the expectation shrinks exponentially in , and ferent communities, though it is mostly called the Chernoff bound in specifically with probability푘 at least 2 , a the computer science literature: 2 random variable of expectation and standard−푘푘 /2 deviation satisfies 1 − 2푒. This figure gives more precise푋 bounds for 휇 . (Image credit:Imran휎 휇−푘휎 Baghirov) ≤ 푋 ≤ 휇+푘휎 푘 = 1, 2, 3, 4, 5, 6 534 introduction to theoretical computer science

Theorem 18.12 — Chernoff/Hoeffding bound. If are i.i.d ran- dom variables such that and for every , then 0 푛−1 for every 푋 , … , 푋 푖 푖 푋 ∈ [0, 1] 피[푋 ] = 푝 푖 휖 > 0 Pr (18.19) 푛−1 2 −2휖 푛 푖 [∣∑ 푋 − 푝푛∣ > 휖푛] ≤ 2 ⋅ 푒 . We omit the proof, which푖=0 appears in many texts, and uses Markov’s inequality on i.i.d random variables that are of the form for some carefully chosen parameter . See Exercise 18.11 0 푛 for a proof휆푋푖 of the simple (but highly푌 useful, … , 푌 and representative) case 푖 푌where= 푒 each is valued and . (See휆 also Exercise 18.12 for a generalization.) 푖 푋 {0, 1} 푝 = 1/2

R Remark 18.13 — Slight simplification of Chernoff. Since is roughly (and in particular larger than ), (18.19) would still be true if we replaced its righthand푒 side with 2.7 . For , the equation2 will 2 still be true−2휖 if we푛+1 replaced the righthand2 side with the simpler푒 . Hence푛 we > will 1/휖 sometimes use the 2 Chernoff bound−휖 푛 as stating that for and as above, 푒 then 0 푛−1 2 푋 , … , 푋 푝 Pr푛 > 1/휖 (18.20) 푛−1 2 −휖 푛 푖 [∣∑푖=0 푋 − 푝푛∣ > 휖푛] ≤ 푒 .

18.3.3 Application: Supervised learning and empirical risk minimization Here is a nice application of the Chernoff bound. Consider the task of supervised learning. You are given a set of samples of the form drawn from some unknown distribution over pairs . For simplicity we will assume푆 푛 that and 0 0 푛−1 푛−1 (푥 , 푦 ), … ,. (푥 (We use, 푦 here) the concept of general distribution over푚 퐷 the 푖 finite set (푥, 푦) as discussed in Section 18.1.3.) The푥 goal∈ {0, is 1} to find 푖 a푦 classifier∈ {0, 1} 푚+1 that will minimize the test error which is the probability{0, 1} 푚 that where is drawn from the distributionℎ ∶ {0,. That 1} is,→ {0, 1} Pr . One way to find퐿(ℎ) such ℎ(푥)a classifier ≠ 푦 is(푥, to 푦) consider a collection of po- (푥,푦)∼퐷 tential classifiers퐷 and 퐿(ℎ)look = at the classifier[ℎ(푥)in ≠ 푦]that does best on the training set . The classifier is known as the empirical risk minimizer풞 (see also Section 12.1.6) . The Chernoff ℎbound풞 can be used to show that as long푆 as the number ℎof samples is sufficiently larger than the logarithm of , the test error will be close to its training error 푛 |풞| 퐿(ℎ) probability theory 101 535

, which is defined as the fraction of pairs that it fails to classify. (Equivalently, .) 푆 푖 푖 퐿̂ (ℎ) 1 (푥 , 푦 ) ∈ 푆 ̂푆 푛 푖∈[푛] 푖 푖 Theorem 18.14 — Generalization퐿 (ℎ) of ERM.= ∑Let |ℎ(푥be any) − distribution 푦 | over pairs and be any set of functions mapping log log to . Then for푚+1 every 퐷, if and

is a set푚(푥, of 푦) ∈ {0, 1} 풞samples that are drawn|풞| 2 (1/훿) indepen- 휖 dently{0, 1} from{0, 1}then 휖, 훿 > 0 푛 > 푆 0 0 푛−1 푛−1 (푥 , 푦 ), … , (푥 , 푦 ) 퐷 Pr (18.21)

ℎ∈풞 푆 where the probability푆 [∀ is|퐿(ℎ) taken − over퐿̂ (ℎ)| the ≤ choice 휖] > of1 − the 훿 , set of samples . In particular if and log then with probability at

least푆 , the classifier 푘 that minimizes푘 (1/훿)2 that empirical test er- 휖 ror satisfies|풞| ≤ 2 푛 > , and hence its test error is at ∗ most1−훿worse than its trainingℎ ∈ 풞 error. 푆 ∗ 푆 ∗ 퐿̂ (퐶) 퐿(ℎ ) ≤ 퐿̂ (ℎ ) + 휖 Proof Idea:휖 The idea is to combine the Chernoff bound with the union bound. Let log . We first use the Chernoff bound to show that for every fixed , if we choose at random then the probability that 푘 = |풞| will be smaller than . We can then use the union bound overℎ all ∈ the 풞 members푆 of to show훿푘 that this will be the case ̂푆 2 for|퐿(ℎ) every − 퐿. (ℎ)| > 휖 푘 2 풞 ℎ Proof⋆ of Theorem 18.14. Set log and so log . We start by making the following claim 2 CLAIM: For every 푘, = the probability|풞| over푛 > 푘 that(1/훿)/휖 is smaller than . We prove the claimℎ using ∈ 풞 the푘 Chernoff bound.푆 Specifically,|퐿(ℎ) − for ev- 푆 퐿erŷ (ℎ)| such ≥ 휖, let us defined훿/2 a collection of random variables as follows: 0 푛−1 ℎ 푋 , … , 푋

(18.22) otherwise ⎧ 푖 푖 푖 {1 ℎ(푥 ) ≠ 푦 Since the samples 푋 = are. drawn independently {⎨0 from the same distribution ,⎩ the random variables are 0 0 푛−1 푛−1 independently and identically(푥 , 푦 ), … distributed. , (푥 , 푦 Moreover,) for every , 0 푛−1 . Hence by the퐷 Chernoff bound (see (18.20푋 ,)), … , the 푋 proba- bility that is at most log 푖 푖 2 피[푋(using] = the 퐿(ℎ) fact푛 that ). Since −휖 푛 , this−푘 completes(1/훿) 푘 푖=0 푖 the proof of| ∑ the claim.푋 −푛⋅퐿(ℎ)| ≥ 휖푛 1 푒 < 푒 < 훿/2 푖 푒 > 2 퐿(ℎ)̂ = 푛 ∑푖∈[푛] 푋 536 introduction to theoretical computer science

Given the claim, the theorem follows from the union bound. In- deed, for every , define the “bad event” to be the event (over the choice of ) that . By the claim Pr , ℎ and hence by theℎ union∈ 풞 bound the probability퐵 that the union of for푘 푆 ℎ all happens푆 is|퐿(ℎ) smaller − 퐿 than̂ (ℎ)| > 휖 . If for every[퐵 ] < 훿/2, ℎ does not happen, it means that for every 푘 , 퐵 , ℎ andℎ so ∈ the ℋ probability of the latter event|풞|훿/2 is= larger 훿 than ℎwhich ∈ 풞 퐵 is 푆 what we wanted to prove. ℎ ∈ ℋ |퐿(ℎ) − 퐿̂ (ℎ)| ≤ 휖 1 − 훿 ■

Chapter Recap

• A basic probabilistic experiment corresponds to ✓ tossing coins or choosing uniformly at random from . • Random푛 variables푛 assign a real푥 number to every result{0, of 1}a coin toss. The expectation of a random variable is its average value. • There are several concentration results, also known as tail bounds푋 showing that under certain condi- tions, random variables deviate significantly from their expectation only with small probability.

18.4 EXERCISES

Exercise 18.1 Suppose that we toss three independent fair coins . What is the probability that the XOR of , , and is equal to ? What is the probability that the AND of these three values is equal푎, 푏, 푐 to ∈ {0,? Are 1} these two events independent? 푎 푏 푐 1 ■ 1 Exercise 18.2 Give an example of random variables such that XY . 3 푋, 푌 ∶ {0, 1} → ℝ ■ 피[ ] ≠ 피[푋] 피[푌 ] Exercise 18.3 Give an example of random variables such that and are not independent but XY . 3 푋, 푌 ∶ {0, 1} → ℝ ■ 푋 푌 피[ ] = 피[푋] 피[푌 ] Exercise 18.4 Let be an odd number, and let be the random variable defined as follows: for every 푛 , if and푛 otherwise. Prove푋 that ∶ {0, 1} 푛→ ℝ . 푥 ∈ {0, 1} 푋(푥) = 1 ■ 푖 ∑푖=0 푥 > 푛/2 푋(푥) = 0 피[푋] = 1/2 Exercise 18.5 — standard deviation. 1. Give an example for a random variable such that ’s standard deviation is equal to

푋 푋 피[|푋 − 피[푋]|] probability theory 101 537

2. Give an example for a random variable such that ’s standard deviation is not equal to 푋 푋 ■ 피[|푋 − 피[푋]|] Exercise 18.6 — Product of expectations. Prove Lemma 18.7 ■

Exercise 18.7 — Transformations preserve independence. Prove Lemma 18.8 ■

Exercise 18.8 — Variance of independent random variables. Prove that if are independent random variables then Var Var . 0 푛−1 0 푋 , … , 푋 푛−1 [푋 + ⋯ + ■ 푛−1 푖 푋 ] = ∑푖=0 [푋 ] Exercise 18.9 — (challenge). Recall the definition of a distribution over some finite set . Shannon defined the entropy of a distribution , denoted by , to be log . The idea is that if 휇is a distribution of entropy푆 , then encoding members of will require 푥∈푆 휇 bits, in an amortized퐻(휇) sense.∑ In휇(푥) this exercise(1/휇(푥)) we justify this definition.휇 Let be such that 푘. 휇 푘 1. Prove that for every one to one function , 휇 . 퐻(휇) = 푘 ∗ 2. Prove that for every , there is some and퐹 a ∶ one-to-one 푆 → {0, 1} function 푥∼휇 피 |퐹 (푥)| ≥ 푘 , such that , where denotes푛 the experiments∗ of휖 choosing푛 푛 each independently 푥∼휇 퐹from ∶ 푆 using→ {0, the 1} distribution피 . |퐹 (푥)| ≤ 푛(푘 + 휖) 푥 ∼ 휇 0 푛−1 푥 , … , 푥 ■ 푆 휇 Exercise 18.10 — Entropy approximation to binomial. Let log 1 1 While you don’t need this to solve this exercise, this log . Prove that for every and , if is is the function that maps to the entropy (as defined large enough then2 퐻(푝) = 푝 (1/푝) + in Exercise 18.9) of the -biased coin distribution over (1 − 푝) (1/(1 − 푝)) 푝 ∈ (0, 1) 휖 > 0 푛 , which is the function푝 s.y. and 푝 . (18.23) 2{0,Hint: 1} Use Stirling’s formula휇 for ∶ {0, approximating 1} → [0, 1] the factorial휇(0) = 1 function. − 푝 휇(1) = 푝 (퐻(푝)−휖)푛 푛 (퐻(푝)+휖)푛 2 ( ) ≤ 2 where is the binomial coefficient푝푛 which is equal to the number푛 of -size subsets of 푛!. 푘 푘!(푛−푘)! ( ) ■ 푘 {0, … , 푛 − 1} Exercise 18.11 — Chernoff using Stirling. 1. Prove that Pr . 푛 푥∼{0,1} 푖 푛 −푛 [∑ 푥 = 2.푘] Use = this(푘)2 and Exercise 18.10 to prove (an approximate version of) the Chernoff bound for the case that are i.i.d. random variables over each equaling and with probability . 0 푛−1 That is, prove that for every , and푋 , … , 푋 as above, Pr {0, 1} . 0 1 1/2 2 0 푛−1 푛−1 푛 0.1⋅휖휖 >푛 0 푋 , … , 푋 ■ 푖 [| ∑푖=0 푋 − 2 | > 휖푛] < 2 538 introduction to theoretical computer science

Exercise 18.12 — Poor man’s Chernoff. Exercise 18.11 establishes the Cher- noff bound for the case that are i.i.d variables over with expectation . In this exercise we use a slightly different 0 푛−1 method (bounding the moments푋 , …of , the 푋 random variables) to estab-{0, 1} lish a version of Chernoff1/2 where the random variables range over and their expectation is some number that may be different than . Let be i.i.d random variables with [0,and 1] Pr . Define 푝. ∈ [0, 1] 0 푛−1 푖 1/2 푋 , … , 푋 피 푋 = 푝 푖 푖 푖 1.[0 Prove ≤ 푋 that≤ 1] for = every1 푌 = 푋 − 푝, if there exists one such that is odd then . 0 푛−1 푖 푛−1 푖 푗푗 , … , 푗 ∈ ℕ 푖 푗 3 Hint: Bound the number of tuples such 푖=0 푖 3 2. Prove that for피[∏ every푌 , ] = 0 . that every is even and using the Binomial 0 푛−1 푛−1 푘 푘/2 coefficient and the fact that in 푗any, … such , 푗 tuple there are 푖 log 4 푖 푖 3. Prove that for every 푘 피[(∑, Pr푖=0 푌 ) ] ≤ (10푘푛) . at most 푗 distinct indices.∑ 푗 = 푘 2 4 −휖 푛/(10000 1/휖) Hint: Set and then show that if the 푖 푖 ■ event 푘/2 happens then the random variable 휖 > 0 [| ∑ 푌 | ≥ 휖푛] ≥ 2 2 is a푘 factor = 2⌈휖 of푛/1000⌉larger than its expectation. 푖 Exercise 18.13 — Sampling. Suppose that a country has 300,000,000 citi- |푘 ∑ 푌 | ≥ 휖푛 −푘 푖 zens, 52 percent of which prefer the color “green” and 48 percent of (∑ 푌 ) 휖 which prefer the color “orange”. Suppose we sample random citi- zens and ask them their favorite color (assume they will answer truth- fully). What is the smallest value among the following푛 choices so that the probability that the majority of the sample answers “green” is at most ? 푛 a. 1,0000.05 b. 10,000 c. 100,000 d. 1,000,000

Exercise 18.14 Would the answer to Exercise 18.13 change if the country had 300,000,000,000 citizens? ■

Exercise 18.15 — Sampling (2). Under the same assumptions as Exer- cise 18.13, what is the smallest value among the following choices so that the probability that the majority of the sample answers “green” is at most ? 푛 −100 a. 1,0002 b. 10,000 c. 100,000 probability theory 101 539

d. 1,000,000 e. It is impossible to get such low probability since there are fewer than citizens. 100 2 ■

18.5 BIBLIOGRAPHICAL NOTES

There are many sources for more information on discrete probability, including the texts referenced in Section 1.9. One particularly recom- mended source for probability is Harvard’s STAT 110 class, whose lectures are available on youtube and whose book is available online. The version of the Chernoff bound that we stated in Theorem 18.12 is sometimes known as Hoeffding’s Inequality. Other variants of the Chernoff bound are known as well, but all of them are equally good for the applications of this book.

Learning Objectives: • See examples of randomized algorithms • Get more comfort with analyzing probabilistic processes and tail bounds • Success amplification using tail bounds

19 Probabilistic computation

“in 1946 .. (I asked myself) what are the chances that a Canfield solitaire laid out with 52 cards will come out successfully? After spending a lot of time trying to estimate them by pure combinatorial calculations, I wondered whether a more practical method … might not be to lay it our say one hundred times and simple observe and count”, , 1983

“The salient features of our method are that it is probabilistic … and with a controllable miniscule probability of error.”, Michael Rabin, 1977

In early computer systems, much effort was taken to drive out randomness and noise. Hardware components were prone to non- deterministic behavior from a number of causes, whether it was vac- uum tubes overheating or actual physical bugs causing short circuits (see Fig. 19.1). This motivated John von Neumann, one of the early computing pioneers, to write a paper on how to error correct computa- tion, introducing the notion of redundancy. So it is quite surprising that randomness turned out not just a hin- drance but also a resource for computation, enabling us to achieve tasks much more efficiently than previously known. One of the first applications involved the very same John von Neumann. While he was sick in bed and playing cards, Stan Ulam came up with the ob- servation that calculating statistics of a system could be done much faster by running several randomized . He mentioned this idea to von Neumann, who became very excited about it; indeed, it Figure 19.1: A 1947 entry in the log book of the Har- turned out to be crucial for the neutron transport calculations that vard MARK II computer containing an actual bug that were needed for development of the Atom bomb and later on the hy- caused a hardware malfunction. By Courtesy of the drogen bomb. Because this project was highly classified, Ulam, von Naval Surface Warfare Center. Neumann and their collaborators came up with the codeword “Monte

Carlo” for this approach (based on the famous casinos where Ulam’s 1 Some texts also talk about “Las Vegas algorithms” uncle gambled). The name stuck, and randomized algorithms are that always return the right answer but whose run- known as Monte Carlo algorithms to this day.1 ning time is only polynomial on the average. Since this Monte Carlo vs Las Vegas terminology is confus- In this chapter, we will see some examples of randomized algo- ing, we will not use these terms anymore, and simply rithms that use randomness to compute a quantity in a faster or talk about randomized algorithms.

Compiled on 8.26.2020 18:10 542 introduction to theoretical computer science

simpler way than was known otherwise. We will describe the algo- rithms in an informal / “pseudo-code” way, rather than as NAND- TM, NAND-RAM programs or Turing macines. In Chapter 20 we will discuss how to augment the computational models we say before to incorporate the ability to “toss coins”.

19.1 FINDING APPROXIMATELY GOOD MAXIMUM CUTS

We start with the following example. Recall the maximum cut problem of finding, given a graph , the cut that maximizes the num- ber of edges. This problem is NP-hard, which means that we do not know of any efficient algorithm퐺 = (푉 , 퐸) that can solve it, but randomization enables a simple algorithm that can cut at least half of the edges:

Theorem 19.1 — Approximating max cut. There is an efficient probabilis- tic algorithm that on input an -vertex -edge graph , outputs a cut that cuts at least of the edges of in expectation. 푛 푚 퐺

Proof Idea:(푆, 푆) 푚/2 퐺 We simply choose a random cut: we choose a subset of vertices by choosing every vertex to be a member of with probability in- dependently. It’s not hard to see that each edge is cut with푆 probability and so the expected푣 number of cut edges푆 is . 1/2

1/2 푚/2 ⋆ Proof of Theorem 19.1. The algorithm is extremely simple:

Algorithm Random Cut:

Input: Graph with vertices and edges. Denote . 퐺 = (푉 , 퐸) 푛 푚 푉 = Operation:0 1 푛−1 {푣 , 푣 , … , 푣 }

1. Pick uniformly at random in . 푛 2. Let be the set that includes all vertices 푥 {0, 1} corresponding to coordinates of where . 푖 푖 푆 ⊆ 푉 {푣 ∶ 푥 = 1 , 푖 ∈ [푛]} 3. Output the cut . 푖 푥 푥 = 1

(푆, 푆) We claim that the expected number of edges cut by the algorithm is . Indeed, for every edge , let be the random variable such that if the edge is cut by , and otherwise. For 푒 푚/2every such edge , 푒 ∈ 퐸 if푋 and only if . Since the 푒 푒 pair 푋 (푥) =obtains 1 each of푒 the values푥 푋 (푥) =with 0 probability 푒 푖 푗 , the probability푒 = that {푖, 푗} 푋 (푥)is = 1 and hence 푥 ≠ 푥 . If we let 푖 푗 (푥 , 푥 ) 00, 01, 10, 11 푖 푗 푒 1/4 푥 ≠ 푥 1/2 피[푋 ] = 1/2 probabilistic computation 543

be the random variable corresponding to the total number of edges cut by , then and hence by linearity of expectation 푋 푒∈퐸 푒 푆 푋 = ∑ 푋 (19.1)

푒 ■ 피[푋] =푒∈퐸 ∑ 피[푋 ] = 푚(1/2) = 푚/2 . Randomized algorithms work in the worst case. It is tempting of a random- ized algorithm such as the one of Theorem 19.1 as an algorithm that works for a “random input graph” but it is actually much better than that. The expectation in this theorem is not taken over the choice of the graph, but rather only over the random choices of the algorithm. In particular, for every graph , the algorithm is guaranteed to cut half of the edges of the input graph in expectation. That is, 퐺  Big Idea 24 A outputs the correct value with good probability on every possible input.

We will define more formally what “good probability” means in Chapter 20 but the crucial point is that this probability is always only taken over the random choices of the algorithm, while the input is not chosen at random.

19.1.1 Amplifying the success of randomized algorithms Theorem 19.1 gives us an algorithm that cuts edges in expectation. But, as we saw before, expectation does not immediately imply con- centration, and so a priori, it may be the case that푚/2 when we run the algorithm, most of the time we don’t get a cut matching the expecta- tion. Luckily, we can amplify the probability of success by repeating the process several times and outputting the best cut we find. We start by arguing that the probability the algorithm above succeeds in cutting at least edges is not too tiny. Lemma 19.2 The probability that a random cut in an edge graph cuts 푚/2 at least edges is at least . 푚 Proof Idea:푚/2 1/(2푚) To see the idea behind the proof, think of the case that . In this case one can show that we will cut at least edges with proba- bility at least (and so in particular larger than 푚 = 1000 ). Specifically, if we assume otherwise, then this500 means that with proba- bility more than0.001 the algorithm cuts or fewer1/(2푚) edges. = But 1/2000 since we can never cut more than the total of edges, given this assump- tion, the highest0.999 value the expected number499 of edges cut is if we cut exactly edges with probability 1000and cut edges with prob- ability . Yet even in this case the expected number of edges will 499 0.999 1000 0.001 544 introduction to theoretical computer science

be , which contradicts the fact that we’ve calculated the expectation to be at least in Theorem 19.1. 0.999 ⋅ 499 + 0.001 ⋅ 1000 < 500 500 Proof⋆ of Lemma 19.2. Let be the probability that we cut at least edges and suppose, towards a contradiction, that . Since the number of edges cut푝 is an integer, and is a multiple of 푚/2, by definition of , with probability we cut at푝 most < 1/(2푚) edges. Moreover, since we can never cut more푚/2 than edges, under0.5 our assumption푝 that , we1− can 푝 bound the expected푚/2 − number 0.5 of edges cut by 푚 푝 < 1/(2푚) (19.2)

But if then and so the righthand side is smaller 푝푚 + (1 − 푝)(푚/2 − 0.5) ≤ 푝푚 + 푚/2 − 0.5 than , which contradicts the fact that (as proven in Theorem 19.1) the expected푝 < 1/(2푚) number of푝푚 edges < 0.5 cut is at least . 푚/2 ■ 푚/2 19.1.2 Success amplification Lemma 19.2 shows that our algorithm succeeds at least some of the time, but we’d like to succeed almost all of the time. The approach to do that is to simply repeat our algorithm many times, with fresh randomness each time, and output the best cut we get in one of these repetitions. It turns out that with extremely high probability we will get a cut of size at least . For example, if we repeat this experiment times, then (using the inequality ) we can show that the probability푚/2 that we will never cut푘 at least edges 2000푚is at most (1 − 1/푘) ≤ 1/푒 ≤ 1/2 푚/2 (19.3) 2000푚 −1000 More generally, the same calculations can be used to show the (1 − 1/(2푚)) ≤ 2 . following lemma: Lemma 19.3 There is an algorithm that on input a graph and a number , runs in time polynomial in and and outputs a cut such that 퐺 = (푉 , 퐸) 푘 |푉 | 푘 (푆, 푆)Pr number of edges cut by (19.4) −푘 Proof of Lemma[ 19.3. The algorithm(푆, will푆) ≥ work |퐸|/2] as ≥follows: 1 − 2 .

Algorithm AMPLIFY RANDOM CUT: Input: Graph with vertices and edges. Denote . Number . 퐺 = (푉 , 퐸) 푛 푚 푉 = Operation:0 1 푛−1 {푣 , 푣 , … , 푣 } 푘 > 0 probabilistic computation 545

1. Repeat the following times:

a. Pick uniformly at200푘푚 random in . b. Let be the set 푛 that includes all vertices푥 corresponding to coordinates{0, 1} of where . 푖 푖 푆 ⊆ 푉 {푣 ∶ 푥 = 1 , 푖 ∈ [푛]} c. If cuts at least then halt and output 푖 . 푥 푥 = 1 2. Output(푆, “failed”푆) 푚/2 (푆, 푆)

We leave completing the analysis as an exercise to the reader (see Exercise 19.1). ■

19.1.3 Two-sided amplification The analysis above relied on the fact that the maximum cut has one sided error. By this we mean that if we get a cut of size at least then we know we have succeeded. This is common for randomized algorithms, but is not the only case. In particular, consider the푚/2 task of computing some Boolean function . A randomized algorithm for computing , given input ,∗ might toss coins and suc- ceed in outputting with probability,퐹 ∶ {0, say, 1} →. {0, We 1} say that has two sided errors퐴 if there is positive퐹 probability푥 that outputs when , and positive퐹 (푥) probability that outputs0.9 when 퐴 . In such a case, to simplify ’s success, we cannot simply퐴(푥) repeat1 it 퐹times (푥) = and 0 output if a single one of those퐴(푥) repetitions0 resulted퐹 in(푥), = 1 nor can we output if a single퐴 one of the repetitions resulted in .푘 But we can output the1majority value of these repetitions. By the Chernoff1 bound (Theorem 18.120 ), with probability exponentially close to (i.e.,0 ), the fraction of the repetitions where will output will beΩ(푘) at least, say , and in such cases we will of course output1 the correct1 − 2 answer. 퐴 퐹 (푥) The above translates0.89 into the following theorem

Theorem 19.4 — Two-sided amplification. If is a func- tion such that there is a polynomial-time algorithm∗ satisfying 퐹 ∶ {0, 1} → {0, 1} Pr 퐴 (19.5)

for every , then[퐴(푥) there = 퐹 is (푥)] a polynomial ≥ 0.51 time algorithm satisfying ∗ 푥 ∈ {0, 1}Pr (19.6)퐵 for every . −|푥| [퐵(푥) = 퐹 (푥)] ≥ 1 − 2 ∗ We omit푥 the ∈ proof{0, 1} of Theorem 19.4, since we will prove a more general result later on in Theorem 20.5. 546 introduction to theoretical computer science

19.1.4 What does this mean? We have shown a probabilistic algorithm that on any edge graph , will output a cut of at least edges with probability at least . Does it mean that we can consider this problem푚 as “easy”? 퐺Should−1000 we be somewhat wary푚/2 of using a probabilistic algorithm, since it1 − can 2 sometimes fail? First of all, it is important to emphasize that this is still a worst case guarantee. That is, we are not assuming anything about the input graph: the probability is only due to the internal randomness of the al- gorithm. While a probabilistic algorithm might not seem as nice as a that is guaranteed to give an output, to get a sense of what a failure probability of means, note that: −1000 • The chance of winning the Massachusetts2 Mega Million lottery is one over which is roughly . So corresponds to winning the lottery5 about times in−35 a row, at−1000 which point you might not(75) care⋅ so 15 much about your algorithm2 2 failing. 300 • The chance for a U.S. resident to be struck by lightning is about , which corresponds to about chance that you’ll be struck by lightning the very second that−45 you’re reading this sen- 1/700000tence (after which again you might not2 care so much about the algorithm’s performance).

• Since the earth is about 5 billion years old, we can estimate the chance that an asteroid of the magnitude that caused the dinosaurs’ extinction will hit us this very second to be about . It is quite likely that even a deterministic algorithm will fail if−60 this happens. 2 So, in practical terms, a probabilistic algorithm is just as good as a deterministic one. But it is still a theoretically fascinating ques- tion whether randomized algorithms actually yield more power, or whether is it the case that for any computational problem that can be solved by probabilistic algorithm, there is a deterministic algorithm 2 2 This question does have some significance to prac- with nearly the same performance. For example, we will see in Ex- tice, since hardware that generates high quality ercise 19.2 that there is in fact a deterministic algorithm that can cut randomness at speed is nontrivial to construct. at least edges in an -edge graph. We will discuss this question in generality in Chapter 20. For now, let us see a couple of examples where randomization푚/2 leads푚 to algorithms that are better in some sense than the known deterministic algorithms.

19.1.5 Solving SAT through randomization The 3SAT problem is NP hard, and so it is unlikely that it has a poly- nomial (or even subexponential) time algorithm. But this does not mean that we can’t do at least somewhat better than the trivial al- gorithm for -variable 3SAT. The best known worst-case algorithms푛 2 푛 probabilistic computation 547

for 3SAT are randomized, and are related to the following simple algorithm, variants of which are also used in practice:

Algorithm WalkSAT: Input: An variable 3CNF formula . Parameters: 푛 휑 Operation: 푇 , 푆 ∈ ℕ 1. Repeat the following steps: a. Choose a random assignment and repeat the following 푇 for steps: 푛 1. If satisfies then output 푥. ∈ {0, 1} 2. Otherwise,푆 choose a random clause that does not푥 satisfy, choose휑 a random푥 literal in and modify to 푖 푗 푘 satisfy this literal. (ℓ ∨ ℓ ∨ ℓ ) 푥 푖 푗 푘 2. If all the repetitions above did not resultℓ , inℓ , a ℓ satisfying assign-푥 ment then output Unsatisfiable 푇 ⋅ 푆 The running time of this algorithm is , and so the key question is how small we can make and so that the proba- bility that WalkSAT outputs Unsatisfiable푆 ⋅ 푇on ⋅ 푝표푙푦(푛) a satisfiable formula is small. It is known that we can do so푆 with ST푇 (see Exercise 19.4 for a weaker result), but we’ll푛 show ̃ 휑below a simpler푛 analysis yielding ST = 푂((4/3), which) = is 푂(1.333̃ … ) 3 푛 3 At the time of this writing, the best known ran- still much better than the trivial bound.√ 푛 ̃ ̃ domized algorithms for 3SAT run in time roughly 푛 = 푂( 3 ) = 푂(1.74 ) , and the best known deterministic algo- Theorem 19.5 — WalkSAT simple analysis.2 If we set and rithms run푛 in time in the worst case. 푛 , then the probability we output Unsatisfiable for a 푂(1.308 ) 푛 √ 푂(1.3303 ) satisfiable is at most . 푇 = 100 ⋅ 3 푆 = 푛/2 Proof. Suppose휑 that is a1/2 satisfiable formula and let be a satisfying assignment for it. For every , denote by ∗ the num- ber of coordinates that휑 differ between 푛 and . The heart푥 ∗ of the proof is the following claim: 푥 ∈ {0, 1} ∗ Δ(푥, 푥 ) Claim I: For every as above, in푥 every푥 local improvement step, the value is decreased∗ by one with probability at least . Proof of Claim∗ I: Since푥, 푥 is a satisfying assignment, if is a clause that doesΔ(푥,not 푥satisfy,) then at∗ least one of the variables involve1/3 in must get different values 푥in and . Thus when we change퐶 by one of the푥 three literals in the clause, we∗ have probability at least of퐶 decreasing the distance. 푥 푥 푥 The second claim is that our starting point is not that bad: 1/3 Claim 2: With probability at least over a random , . 푛 Proof∗ of Claim II: Consider the map1/2FLIP 푥 ∈ {0, 1}that Δ(푥,simply 푥 ) “flips” ≤ 푛/2 all the bits of its input from to and푛 vice versa.푛 That ∶ {0, 1} → {0, 1} 0 1 548 introduction to theoretical computer science

is, FLIP . Clearly FLIP is one to one. Moreover, if is of distance to , then FLIP is distance to 0 푛−1 0 푛−1 . Now(푥 let, … ,be 푥 the) “bad= (1−푥 event”, … , in∗ 1−푥 which) is of distance from ∗. Then the푥 set FLIP 푘 푥 FLIP (푥) satisfies푛 − 푘 푥∗ and퐵 that if then is of distance푥 from푛 > 푛/2. Since 푥 and are disjoint퐴 = events,(퐵) Pr = { Pr (푥) ∶ 푥. Since ∈ {0, they 1} } have∗ the |퐴|same = cardinality, |퐵| they푥 have ∈ 퐴 the same푥 probability< and 푛/2 so we get푥 that 퐴Pr 퐵 or Pr . (See[퐴] also + Fig.[퐵] 19.2 ≤). 1 Claims I and II imply that each of the iterations of the outer loop succeeds2 [퐵] ≤ with 1 probability[퐵] ≤ 1/2 at least . Indeed, by Claim II, 푇 −푛 the original guess will satisfy √ with probability Pr . By Claim1/2 I, ⋅∗ even3 conditioned on all the history so∗ far, for each푥 of the Δ(푥, 푥steps) ≤ of 푛/2 the inner loop we have probability[Δ(푥, 푥 ) at ≤ least 푛/2] ≥ 1/2of being “lucky” and decreasing the distance (i.e. the output of ) by one.푆 The = 푛/2chance we will be lucky in all steps is hence at least≥ 1/3 . Δ −푛 푛/2 Since any single iteration푛/2 of the√ outer loop succeeds with probabil- ity at least , the(1/3) probability= 3 that we never do so in repetitions1 is at most−푛 . 푛 √ 푛 √ 2 √ ⋅ 3 푛 푇 = 100 3■ √1 100⋅ 3 50 (1 − 2 3 ) ≤ (1/푒)

19.1.6 Bipartite matching The matching problem is one of the canonical optimization problems, arising in all kinds of applications: matching residents with hospitals, kidney donors with patients, flights with crews, and many others. One prototypical variant is bipartite perfect matching. In this problem, Figure 19.2: For every , we can sort all strings in according to their distance from we are given a bipartite graph which has vertices ∗ 푛 (top to bottom푛 in the푥 above∈ {0, figure), 1} where we let partitioned into -sized sets and , where all edges have one end- ∗ {0, 1} be the “top point in and the other in . The퐺 = goal (퐿 ∪ is 푅, to 퐸) determine whether2푛 there half”푥 of strings. If푛 we define FLIP∗ to be the map that “flips” the bits of a given string 푛 퐿 푅 퐴 = {푥 ∈ {0, 1} | 푑푖푠푡(푥, 푥 ≤ 푛/2} 푛 is a perfect matching, a subset of disjoint edges. That is, then it maps every to an output∶ {0, 1}FLIP→ {0, 1} matches퐿 every vertex in to푅 a unique vertex in . in a one-to-one way, and so it demonstrates that 푥 The bipartite matching problem푀 ⊆ 퐸 turns푛 out to have a polynomial-푀 which implies푥 ∈ 퐴 that Pr Pr (푥)and ∈ hence 퐴 Pr . time algorithm, since we퐿 can reduce finding a푅 matching in to find- |퐴| ≤ |퐴| [퐴] ≥ [퐴] ing a maximum flow (or equivalently, minimum cut) in a related [퐴] ≥ 1/2 graph (see Fig. 19.3). However, we will see a different퐺 probabilistic algorithm′ to determine whether a graph contains such a matching. Let us퐺 label ’s vertices as and . A matching corresponds to a permutation (i.e., one-to-one 0 푛−1 0 푛−1 and onto function퐺 퐿) = where {ℓ , … for , ℓ every} 푅 =, we {푟 define, … , 푟 } 푛 to be the unique푀 such that contains the edge휋 ∈ 푆 . Define an Figure 19.3 matrix 휋 ∶ [푛]where → [푛] if and only푖 ∈ if [푛] the edge 휋(푖) : The bipartite matching problem in the 푖 푗 graph can be reduced to the minimum is present and 푗 otherwise.푀 The correspondence{ℓ , 푟 } between cut problem in the graph obtained by adding 푖,푗 푖 푗 푛matchings × 푛 and퐴 permutations = 퐴(퐺) implies퐴 = the 1 following claim: {ℓ , 푟 } vertices퐺 = (퐿∪푅,to , connecting 퐸) with′ and connecting 푖,푗 퐴 = 0 푠,with 푡 . 퐺 푠, 푡 퐺 푠 퐿 푡 푅 probabilistic computation 549

Lemma 19.6 — Matching polynomial. Define to be the polynomial mapping to where 2 푛 푃 = 푃 (퐺) ℝ ℝ (19.7) 푛−1 푛−1

0,0 푛−1,푛−1 푖,휋(푖) 푖,휋(푖) 푃 (푥 , … , 푥 ) = ∑푛 (∏ 푠푖푔푛(휋)퐴 ) ∏ 푥 Then has a perfect matching휋∈푆 if and푖=0 only if is not푖=0 identically zero. That is, has a perfect matching if and only if there exists some as- 4 signment퐺 such that 푃 .4 The sign of a permutation , denoted by 2 , can be defined in several equivalent ways, 퐺 푛 one of which is that 휋 ∶ [푛] → [푛] where Proof. If has a푖,푗 perfect푖,푗∈[푛] matching , then let be the permutation 푥 = (푥 ) ∈ ℝ 푃 (푥) ≠ 0 INV푠푖푔푛(휋) 퐼푁푉 (휋) (i.e., corresponding to and let ∗ defined ∗as follows: if INV is the number푠푖푔푛(휋) of pairs = of (−1) elements that are 2 inverted(푝푖)by = |{(푥,). The 푦) importance∈ [푛] | 푥 < 푦 of ∧ the 휋(푥) term > 휋(푦)} is and퐺 . Note that∗ for푀푛 every 휋, but 푖,푗 that(휋) it makes equal to the determinant of the matrix . Hence푀 푥 will∈ ℤ equal ∗ 푛−1. But푥 since= 1 is and hence휋 efficiently computable. 푠푖푔푛(휋) 푖,푗 푖=0 푖,휋(푖) 푗 =푛−1 휋(푖) 푥 = 0 휋푛−1 ≠ 휋 ∏ 푥 = 0 푃 ∗ ∗ ∗ ∗ ∗ 푖,푗 a perfect matching in , . 푖,휋 (푖) (푥 ) ∏푖=0 푥푖,휋 (푖) = 1 푃 (푥 ) ∏푖=0 퐴 푀 On the other hand, suppose푛−1 that∗ is not identically zero. By (19.7), 푖=0 푖,휋 (푖) this means that at least퐺 one∏ of the퐴 terms= 1 is not equal to zero. But then this permutation must푃 be푛−1 a perfect matching in . 푖,휋(푖) ∏푖=0 퐴 ■ 휋 퐺 As we’ve seen before, for every , we can compute 2 by simply computing the determinant of the푛 matrix , which is obtained by replacing with 푥 ∈. ℝ This reduces testing perfect푃 (푥) matching to the zero testing problem for polynomials:퐴(푥) given some poly- 푖,푗 푖,푗 푖,푗 nomial , test whether퐴 is identically퐴 푥 zero or not. The intuition behind our randomized algorithm for zero testing is the following: 푃 (⋅) 푃 If a polynomial is not identically zero, then it can’t have “too many” roots.

This intuition sort of makes sense. For one variable polynomi- als, we know that a nonzero linear function has at most one root, a quadratic function (e.g., a parabola) has at most two roots, and gener- ally a degree equation has at most roots. While in more than one variable there can be an infinite number of roots (e.g., the polynomial vanishes푑 on the line )푑 it is still the case that the set of roots is very “small” compared to the set of all inputs. For example, 0 0 푥the+ root 푦 of a bivariate polynomial푦 = −푥 form a curve, the roots of a three- Figure 19.4: A degree curve in one variable can variable polynomial form a surface, and more generally the roots of an have at most roots. In higher dimensions, a - -variable polynomial are a space of dimension . variate degree- polynomial푑 can have an infinite This intuition leads to the following simple randomized algorithm: number roots푑 though the set of roots will be an푛 dimensional surface.푑 Over a finite field , an -variate 푛 푛 − 1 degree polynomial has at most roots. 푛 − 1 To decide if is identically zero, choose a “random” input and check if 푛−1픽 푛 . 푑 푑|픽| 푃 푥 This푃 (푥) makes≠ 0 sense: if there are only “few” roots, then we expect that with high probability the random input is not going to be one of those roots. However, to transform this into an actual algorithm, we 푥 550 introduction to theoretical computer science

need to make both the intuition and the notion of a “random” input precise. Choosing a random real number is quite problematic, espe- cially when you have only a finite number of coins at your disposal, and so we start by reducing the task to a finite setting. We will use the following result:

Theorem 19.7 — Schwartz–Zippel lemma. For every integer , and poly- nomial with integer coefficients. If has degree at most and is not푛 identically zero, then it has at most 푞 roots in the set 푃 ∶ ℝ → ℝ . 푃 푛−1 푑 푛 푑푞 0 푛−1 푖 We omit[푞] the= (not {(푥 , too … , complicated) 푥 ) ∶ 푥 ∈ {0, proof … , 푞 of −Theorem 1}} 19.7. We remark that it holds not just over the real numbers but over any field as well. Since the matching polynomial of Lemma 19.6 has degree at most , Theorem 19.7 leads directly to a simple algorithm for testing if it is nonzero: 푃 푛 Algorithm Perfect-Matching: Input: Bipartite graph on vertices .

Operation: 0 푛−1 0 푛−1 퐺 2푛 {ℓ , … , ℓ , 푟 , … , 푟 } 1. For every , choose independently at random from . 푖,푗 2. Compute the푖, 푗 determinant ∈ [푛] of푥 the matrix whose entry [2푛]equals = {0, …if 2푛 the − edge 1} is present and otherwise. 푡ℎ no perfect matching 퐴(푥) (푖, 푗) 3. Output 푖,푗 푖 푗 if this determinant is zero, and out- put perfect푥 matching{ℓotherwise., 푟 } 0

This algorithm can be improved further (e.g., see Exercise 19.5). While it is not necessarily faster than the cut-based algorithms for per- fect matching, it does have some advantages. In particular, it is more amenable for parallelization. (However, it also has the significant dis- advantage that it does not produce a matching but only states that one exists.) The Schwartz–Zippel Lemma, and the associated zero testing algorithm for polynomials, is widely used across computer science, including in several settings where we have no known deterministic algorithm matching their performance.

Chapter Recap

• Using concentration results, we can amplify in ✓ polynomial time the success probability of a prob- abilistic algorithm from a mere to for every polynomials and . −푞(푛) • There are several randomized algorithms1/푝(푛) 1 that − are2 better in various senses푝 (e.g.,푞 simpler, faster, or other advantages) than the best known determinis- tic algorithm for the same problem. probabilistic computation 551

19.2 EXERCISES

Exercise 19.1 — Amplification for max cut. Prove Lemma 19.3 ■ 5 TODO: add exercise to give a deterministic max cut 5 Exercise 19.2 — Deterministic max cut algorithm. algorithm that gives edges. Talk about greedy ■ approach. 푚/2 Exercise 19.3 — Simulating distributions using coins. Our model for proba- bility involves tossing coins, but sometimes algorithm require sam- pling from other distributions, such as selecting a uniform number in for some푛 . Fortunately, we can simulate this with an exponentially small probability of error: prove that for every , if {0,log … , 푀, − then 1} there is a푀 function such that (1) The probability that 푛is at most and푀(2)푛the > 푘⌈ 푀⌉ 퐹 ∶ {0, 1} → {0, … , 푀 − 1} ∪ {⊥} distribution of conditioned on is equal to−푘 the uniform 6 6 Hint: Think of as choosing numbers distribution over . 퐹 (푥) = ⊥ 2 log 푛 . Output the first such ■ 퐹 (푥) 퐹 (푥) ≠ ⊥ number that is in푥 ∈⌈ {0,푀⌉ 1} . 푘 1 푘 {0, … , 푀 − 1} 푦 , … , 푦 ∈ {0, … , 2 − 1} Exercise 19.4 — Better walksat analysis. 1. Prove that for every , if {0, … , 푀 − 1} is large enough then for every Pr where ∗ log 푛 푛 log휖 > 0 ∗ 푥∼{0,1} 푛is the same−(1−퐻(1/3)−휖)푛 function as in Exercise푥 18.10∈ {0,. 1} [Δ(푥, 푥 ) ≤ 2.푛/3] Prove ≤ that 2 log 퐻(푝) =. 푝 (1/푝)+(1−푝) (1/(1−푝)) 3. Use the above1−퐻(1/4)+(1/4) to prove that3 for every and large enough , if we set 2 and= (3/2) in the WalkSAT algorithm then for every satisfiable 푛3CNF , the훿 probability > 0 that we output푛 unsatisfiable푇 = 1000is ⋅ (3/2 at most + 훿) . 푆 = 푛/4 휑 ■ 1/2 Exercise 19.5 — Faster bipartite matching (challenge). (to be completed: im- prove the matching algorithm by working modulo a prime) ■

19.3 BIBLIOGRAPHICAL NOTES

The books of Motwani and Raghavan [MR95] and Mitzenmacher and Upfal [MU17] are two excellent resources for randomized algorithms. Some of the history of the discovery of Monte Carlo algorithm is cov- ered here.

19.4 ACKNOWLEDGEMENTS

Learning Objectives: • Formal definition of probabilistic polynomial time: the class BPP. • Proof that every function in BPP can be computed by -sized NAND-CIRC programs/circuits. • Relations between푝표푙푦(푛) BPP and NP. • Pseudorandom generators 20 Modeling randomized computation

“Any one who considers arithmetical methods of producing random digits is, of course, in a state of sin.” John von Neumann, 1951.

So far we have described randomized algorithms in an informal way, assuming that an operation such as “pick a string ” can be done efficiently. We have neglected to address two questions:푛 푥 ∈ {0, 1} 1. How do we actually efficiently obtain random strings in the physi- cal world?

2. What is the mathematical model for randomized computations, and is it more powerful than deterministic computation?

The first question is of both practical and theoretical importance, but for now let’s just say that there are various physical sources of “random” or “unpredictable” data. A user’s mouse movements and typing pattern, (non solid state) hard drive and network latency, thermal noise, and radioactive decay have all been used as sources for randomness (see discussion in Section 20.8). For example, many Intel chips come with a random number generator built in. One can even build mechanical coin tossing machines (see Fig. 20.1). In this chapter we focus on the second question: formally modeling probabilistic computation and studying its power. We will show that:

1. We can define the class BPP that captures all Boolean functions that can be computed in polynomial time by a randomized algorithm. Crucially BPP is still very much a worst case class of computation: the probability is only over the choice of the random coins of the algorithm, as opposed to the choice of the input.

2. We can amplify the success probability of randomized algorithms, Figure 20.1: A mechanical coin tosser built for Percy and as a result the class BPP would be identical if we changed Diaconis by Harvard technicians Steve Sansone and the required success probability to any number that lies strictly Rick Haggerty between and (and in fact any number in the range to for any polynomial ). 푝 −푞(푛)1/2 1 1/2+1/푞(푛) 1 − 2 푞(푛)

Compiled on 8.26.2020 18:10 554 introduction to theoretical computer science

3. Though, as is the case for P and NP, there is much we do not know about the class BPP, we can establish some relations between BPP and the other complexity classes we saw before. In particular we

will show that P BPP EXP and BPP P poly.

4. While the relation between BPP and NP is not/ known, we can show ⊆ ⊆ ⊆ that if P NP then BPP P.

5. We also show that the concept of NP completeness applies equally = = well if we use randomized algorithms as our model of “efficient computation”. That is, if a single NP complete problem has a ran- domized polynomial-time algorithm, then all of NP can be com- puted in polynomial-time by randomized algorithms.

6. Finally we will discuss the question of whether BPP P and show some of the intriguing evidence that the answer might actually be “Yes” using the concept of pseudorandom generators. =

20.1 MODELING RANDOMIZED COMPUTATION

Modeling randomized computation is actually quite easy. We can add the following operations to any programming language such as NAND-TM, NAND-RAM, NAND-CIRC etc..: foo = RAND()

where foo is a variable. The result of applying this operation is that foo is assigned a random bit in . (Every time the RAND operation is invoked it returns a fresh independent random bit.) We call the programming languages that are{0, augmented 1} with this extra operation RNAND-TM, RNAND-RAM, and RNAND-CIRC respectively. Similarly, we can easily define randomized Turing machines as Turing machines in which the transition function gets as an extra input (in addition to the current state and symbol read from the tape) a bit that in each step is chosen at random ∈ 훿 . Of course the function can ignore this bit (and have the same output regardless of whether푏 or ) and hence randomized{0, 1} Turing machines generalize deterministic Turing machines. We can푏 use = 0 the RAND()푏 = 1 operation to define the notion of a function being computed by a randomized time algorithm for every nice time bound , as well as the notion of a finite function being computed by a size randomized푇 NAND-CIRC (푛) program (or, equiv- alently, a randomized푇 ∶ ℕ → ℕ circuit with gates that correspond to either the NAND or coin-tossing푆 operations). However, for simplicity we will not define randomized computation푆 in full generality, but simply focus on the class of functions that are computable by randomized algorithms running in polynomial time, which by historical convention is known as BPP: modeling randomized computation 555

Definition 20.1 — The class BPP. Let . We say that BPP if there exist constants ∗ and an RNAND-TM program such that for every 퐹 ∶ {0, 1}, on→ input {0, 1}, the program 퐹halts ∈ within at most steps and푎, 푏 ∈∗ ℕ 푃 푏 푥 ∈ {0, 1} 푥 푃 푎|푥|Pr (20.1) 2 where this probability is taken[푃 (푥) over = 퐹 the(푥)] result ≥ 3 of the RAND opera- tions of .

Note that푃 the probability in (20.1) is taken only over the ran- dom choices in the execution of and not over the choice of the in- put . In particular, as discussed in Big Idea 24, BPP is still a worst case complexity class, in the sense푃 that if is in BPP then there is a polynomial-time푥 randomized algorithm that computes with proba- bility at least on every possible (and not퐹 just random) input. The same polynomial-overhead simulation of NAND-RAM퐹 pro- grams by NAND-TM2/3 programs we saw in Theorem 13.5 extends to randomized programs as well. Hence the class BPP is the same re- gardless of whether it is defined via RNAND-TM or RNAND-RAM programs. Similarly, we could have just as well defined BPP using randomized Turing machines. Because of these equivalences, below we will use the name “poly- nomial time randomized algorithm” to denote a computation that can be modeled by a polynomial-time RNAND-TM program, RNAND-RAM program, or a randomized Turing machine (or any programming lan- guage that includes a coin tossing operation). Since all these models are equivalent up to polynomial factors, you can use your favorite model to capture polynomial-time randomized algorithms without any loss in generality. Solved Exercise 20.1 — Choosing from a set. Modern programming lan- guages often involve not just the ability to toss a random coin in but also to choose an element at random from a set . Show that you can emulate this primitive using coin tossing. Specifically, show{0, that 1} there is randomized algorithm that on input a set푆 of strings of length , runs in time and outputs either an element or “fail” such that 퐴 푆 푚 푛 푝표푙푦(푛, 푚) 푥 ∈ 푆 1. Let be the probability that outputs “fail”, then (a number small enough that it can be ignored). −푛 푝 퐴 푝 < 2 2. For every , the probability that outputs is exactly (and so the output is uniform over if we ignore the tiny probabil-1−푝 푚 ity of failure)푥 ∈ 푆 퐴 푥 푆 ■ 556 introduction to theoretical computer science

Solution: If the size of is a power of two, that is for some , then we can choose a random element in by tossingℓ coins to obtain a string 푆 and then output푚 = the 2 -th elementℓ ∈ of 푁 where is the number whoseℓ binary representation푆 is ℓ . If is not a power푤 ∈ of {0, two, 1} then our first attempt푖 will be to let 푆 푖 log and do the same, but then output the -th푤 element of if 푆 and output “fail” otherwise. Conditioned on not out- ℓputting = ⌈ “fail”,푚⌉ this element is distributed uniformly in푖 . However, 푆in the푖 worst ∈ [푚] case, can be almost and so the probability of fail might be close to half.ℓ To reduce the failure probability,푆 we can repeat the experiment2 above times.2푚 Specifically, we will use the following algorithm 푛 Algorithm 20.2 — Sample from set. Input: Set with for all . 푛 0 푚−1 푖 Output: Either푆 = {푥 , …or ,”fail” 푥 } 푥 ∈ {0, 1} 1: Let푖 ∈ [푚] log 2: for 푥 ∈ 푆 do 3: Pickℓ ← ⌈ 푚⌉ 4: Let푗 = 0, 1, … , 푛be −ℓ number 1 whose binary representa- tion is 푤. ∼ {0,ℓ 1} 5: if 푖 ∈then [2 ] 6: 푤return 7: end푖 < if 푚 푖 8: end for 푥 9: return ”fail”

Conditioned on not failing, the output of Algorithm 20.2 is uni- formly distributed in . However, since , the probability of failure in each iteration is less than ℓand so the probability of failure in all of them is푆 at most 2 .< 2푚 푛 1/2 −푛 ■ (1/2) = 2

20.1.1 An alternative view: random coins as an “extra input” While we presented randomized computation as adding an extra “coin tossing” operation to our programs, we can also model this as Figure 20.2: The two equivalent views of random- being given an additional extra input. That is, we can think of a ran- ized algorithms. We can think of such an algorithm as having access to an internal RAND() operation domized algorithm as a deterministic algorithm that takes two that outputs a random independent value in inputs and where the second input is chosen at′ random from whenever it is invoked, or we can think of it as a de- 퐴 퐴 terministic algorithm that in addition to the standard{0, 1} for some (see Fig. 20.2). The equivalence to the Defini- input obtains an additional auxiliary 푥 푟 푟 tion 20.1푚 is shown in the following theorem: input 푛that is chosen uniformly at random. {0, 1} 푚 ∈ ℕ 푥 ∈ {0, 1}푚 푟 ∈ {0, 1} modeling randomized computation 557

Theorem 20.3 — Alternative characterization of BPP. Let . Then BPP if and only if there exists ∗ and such that is in P and for every퐹 ∶ {0, 1} , → {0, 1} ∗ 퐹 ∈ 푎, 푏 ∈ ℕ∗ 퐺 ∶ {0, 1} → {0, 1} Pr 퐺 푥 ∈ {0, 1}(20.2) 2 푎|푥|푏 3 푟∼{0,1} [퐺(푥푟) = 퐹 (푥)] ≥ . Proof Idea: The idea behind the proof is that, as illustrated in Fig. 20.2, we can simply replace sampling a random coin with reading a bit from the extra “random input” and vice versa. To prove this rigorously we need to work through some slightly cumbersome formal notation. This might be one of those푟 proofs that is easier to work out on your own than to read.

Proof⋆ of Theorem 20.3. We start by showing the “only if” direction. Let BPP and let be an RNAND-TM program that computes as per Definition 20.1, and let be such that on every input of length퐹 ∈ , the program푃 halts within at most steps. We will construct퐹 a polynomial-time algorithm푎, 푏 ∈ ℕ that such푏 that for every 푛, if we set 푃 , then ′ 푎푛 푛 푏 푃 푥 ∈ {0, 1} Pr푚 = 푎푛 Pr (20.3)

푚 ′ where the probability푟∼{0,1} in[푃 the(푥푟) righthand = 1] = side[푃 is (푥) taken = 1] over , the RAND() operations in . In particular this means that if we define then the function satisfies the conditions of (20.2). ′ The algorithm푃 will be very simple: it simulates the program퐺(푥푟) = , 푃maintaining(푥푟) a counter′ initialized퐺 to . Every time that makes a RAND() operation,푃 the program will supply the result from and푃 increment by one. We푖 will never′ “run0 out” of bits, since푃 the running 푖 time of is at most and hence푃 it can make at most this number푟 of RNAND() calls.푖 The output푏 of for a random will be distributed푃 identically푎푛 to the output′ of . 푚 For the other direction, given푃 (푥푟) a function P satisfying푟 ∼ {0, 1} the condi- tion (20.2) and a NAND-TM that computes푃 (푥) in polynomial time, we can construct an RNAND-TM′ program 퐺that ∈ computes in poly- nomial time. On input 푃 , the program퐺 will simply use the RNAND() instruction times to푛 fill an array푃 R[ ] , , R[ 퐹 ] and then execute the original푏푥∈ program {0, 1} on input 푃where is푏 the -th element of the array 푎푛R. Once again, it′ is clear that0 if … runs푎푛 in− polyno- 1 푖 mial time then so will , and for every푃 input 푥푟and ′ 푟 ,푖 the 푏 output of on input and where the coin tosses outcome푃 is 푎푛is equal to . 푃 푥 푟 ∈ {0, 1} ′ 푃 푥 푟 푃 (푥푟) 558 introduction to theoretical computer science

R Remark 20.4 — Definitions of BPP and NP. The char- acterization of BPP Theorem 20.3 is reminiscent of the characterization of NP in Definition 15.1, with the randomness in the case of BPP playing the role of the solution in the case of NP. However, there are important differences between the two:

• The definition of NP is “one sided”: if there exists a solution such that and if for every string of the퐹 (푥) appropriate = 1 length, 푤. In contrast,퐺(푥푤) the characteriza- = 1 tion퐹 (푥) of BPP =is 0 symmetric with respect푤 to the cases 퐺(푥푤)and = 0 . • The relation between NP and BPP is not immedi- ately퐹 (푥) clear.= 0 It is퐹 not(푥) =known 1 whether BPP NP, NP BPP, or these two classes are incomparable. It is however known (with a non-trivial proof)⊆ that if P ⊆NP then BPP P (see Theorem 20.11). • Most importantly, the definition of NP is “inef- fective,”= since it does= not yield a way of actually finding whether there exists a solution among the exponentially many possibilities. By contrast, the definition of BPP gives us a way to compute the function in practice by simply choosing the second input at random.

”Random tapes”. Theorem 20.3 motivates sometimes considering the randomness of an RNAND-TM (or RNAND-RAM) program as an extra input. As such, if is a randomized algorithm that on inputs of length makes at most coin tosses, we will often use the notation (where 퐴 and ) to refer to the result of executing푛 when the coin푚푛 tosses of correspond푚 to the coordinates of퐴(푥;. 푟)This second,푥 ∈ or {0, “auxiliary,” 1} 푟 input ∈ {0, is 1} sometimes referred to as a “random푥 tape.” This terminology퐴 originates from the model of randomized푟 Turing machines.

20.1.2 Success amplification of two-sided error algorithms The number might seem arbitrary, but as we’ve seen in Chapter 19 it can be amplified to our liking: 2/3 Theorem 20.5 — Amplification. Let be a Boolean function such that there is a polynomial ∗ and a polynomial-time randomized algorithm퐹 ∶ {0, 1} satisfying→ {0, 1} that for ev- 푝 ∶ ℕ → ℕ 퐴 modeling randomized computation 559

ery , 푛 푥 ∈ {0, 1} Pr (20.4) 1 1 Then for every polynomial[퐴(푥) = 퐹 (푥)] ≥ +there is. a polynomial-time 2 푝(푛) randomized algorithm satisfying for every , 푞 ∶ ℕ → ℕ 푛 Pr 퐵 푥 ∈ {0, 1} (20.5) −푞(푛) [퐵(푥) = 퐹 (푥)] ≥ 1 − 2 .  Big Idea 25 We can amplify the success of randomized algorithms to a value that is arbitrarily close to .

1 Proof Idea: The proof is the same as we’ve seen before in the case of maximum cut and other examples. We use the Chernoff bound to argue that if computes with probability at least and we run it times, each time using fresh and independent1 random coins, then2 the 2 퐴probability that퐹 the majority of the answers+ 휖will not be correct푂(푘/휖 will) be less than . Amplification can be thought of as a “polling” ofthe choices for−푘 randomness for the algorithm (see Fig. 20.3). 2

Proof⋆ of Theorem 20.5. Let be an algorithm satisfying (20.4). Set and where are the polynomials in the theorem statement.1 We can run on퐴 input for times, using fresh 푝(푛) 휖randomness = in푘 =each 푞(푛) execution,푝, and 푞 compute the outputs2 . We output the value that푃 appeared푥 the푡 largest = 10푘/휖 number of times. Let 0 푡−1 be the random variable that is equal to if and푦 , … equal , 푦 to otherwise. The random푦 variables are i.i.d. and satisfy 푖 푖 푋 Pr , and hence1 by linearity푦 = 퐹 (푥) of expectation 0 푡−1 0 . For the plurality푋 , … , value 푋 to be incorrect, it must 푖 푖 피[푋hold푡−1] that = [푋 = 1] ≥ 1/2, which + 휖 means that is at least far 푖=0 푖 피[∑ 푋 ] ≥푡−1 푡(1/2 + 휖) 푡−1 from its expectation. Hence by the Chernoff bound (Theorem 18.12), Figure 20.3 푖=0 푖 푖=0 푖 : If BPP then there is randomized the probability∑ that푋 ≤ the 푡/2 plurality value is not∑ correct푋 is at most 휖푡 , polynomial-time algorithm with the following 2 which is smaller than for our choice of . −휖 푡 property: In the퐹 case∈ two thirds of the “population” of random choices satisfy 2푒 ■ 푃 −푘 and in the case 퐹(푥)two = thirds 0 of the population 2 푡 satisfy . We can think of amplification푃 (푥; 푟) = 0 as a form of “polling”퐹(푥) of = the 1 choices of randomness. By 20.2 BPP AND NP COMPLETENESS the Chernoff푃 (푥; 푟) =bound, 1 if we poll a sample of log random choices , then with probability at least (1/훿) 2 Since “noisy processes” abound in nature, randomized algorithms can , the fraction of ’s in the sample satisfying푂( 휖 ) be realized physically, and so it is reasonable to propose BPP rather will푟 give us an estimate of the fraction of 1the − population 훿 within푟 an margin of error. This is the than P as our mathematical model for “feasible” or “tractable” com- 푃same (푥; 푟) calculation = 1 used by pollsters to determine the putation. One might wonder if this makes all the previous chapters needed sample size in their휖 polls. 560 introduction to theoretical computer science

irrelevant, and in particular if the theory of NP completeness still applies to probabilistic algorithms. Fortunately, the answer is Yes:

Theorem 20.6 — NP hardness and BPP. Suppose that is NP-hard and BPP. Then NP BPP. 퐹 퐹Before ∈ seeing the proof,⊆ note that Theorem 20.6 implies that if there was a randomized polynomial time algorithm for any NP-complete problem such as SAT, ISET etc., then there would be such an algo- rithm for every problem in NP. Thus, regardless of whether our model of computation is3 deterministic or randomized algorithms, NP com- plete problems retain their status as the “hardest problems in NP.”

Proof Idea: The idea is to simply run the reduction as usual, and plug it into the randomized algorithm instead of a deterministic one. It would be an excellent exercise, and a way to reinforce the definitions of NP- hardness and randomized algorithms, for you to work out the proof for yourself. However for the sake of completeness, we include this proof below.

Proof⋆ of Theorem 20.6. Suppose that is NP-hard and BPP. We will now show that this implies that NP BPP. Let NP. By the definition of NP-hardness, it퐹 follows that 퐹, ∈ or that in other words there exists a polynomial-time computable⊆ function퐺 ∈ 푝 such that for every퐺 ≤ 퐹 . Now if is∗ in BPP then∗ there is a polynomial-time RNAND-TM program∗ 푅 ∶ {0,such 1} that→ {0, 1} 퐺(푥) = 퐹 (푅(푥)) 푥 ∈ {0, 1} 퐹 Pr (20.6)푃 for every (where the probability is taken over the random [푃 (푦) = 퐹 (푦)] ≥ 2/3 coin tosses of ). Hence∗ we can get a polynomial-time RNAND-TM program푦 ∈to {0, compute 1} by setting . By (20.6) Pr ′ 푃 and since ′ this implies that Pr ′ 푃 ,퐺 which proves푃 that(푥) = 푃BPP (푅(푥)). ■ [푃 ′(푥) = 퐹 (푅(푥))] ≥ 2/3 퐹 (푅(푥)) = 퐺(푥) [푃 (푥) = 퐺(푥)] ≥ 2/3 퐺 ∈ Most of the results we’ve seen about NP hardness, including the search to decision reduction of Theorem 16.1, the decision to optimiza- tion reduction of Theorem 16.3, and the quantifier elimination result of Theorem 16.6, all carry over in the same way if we replace P with BPP as our model of efficient computation. Thus if NP BPP then we get essentially all of the strange and wonderful consequences of P NP. Unsurprisingly, we cannot rule out this possibility.⊆ In fact, unlike P EXP, which is ruled out by the time hierarchy theorem, we = = modeling randomized computation 561

don’t even know how to rule out the possibility that BPP EXP! Thus a priori it’s possible (though seems highly unlikely) that randomness = is a magical tool that allows us to speed up arbitrary exponential time 1 1 At the time of this writing, the largest “natural” com- computation. Nevertheless, as we discuss below, it is believed that plexity class which we can’t rule out being contained randomization’s power is much weaker and BPP lies in much more in BPP is the class NEXP, which we did not define “pedestrian” territory. in this course, but corresponds to non deterministic exponential time. See this paper for a discussion of this question. 20.3 THE POWER OF RANDOMIZATION

A major question is whether randomization can add power to compu- tation. Mathematically, we can phrase this as the following question: does BPP P? Given what we’ve seen so far about the relations of other complexity classes such as P and NP, or NP and EXP, one might guess that:=

1. We do not know the answer to this question.

2. But we suspect that BPP is different than P.

One would be correct about the former, but wrong about the latter. As we will see, we do in fact have reasons to believe that BPP P. This can be thought of as supporting the extended Church Turing hy- pothesis that deterministic polynomial-time Turing machines capture= what can be feasibly computed in the physical world. We now survey some of the relations that are known between BPP and other complexity classes we have encountered. (See also Fig. 20.4.)

20.3.1 Solving BPP in exponential time It is not hard to see that if is in BPP then it can be computed in exponential time. 퐹 Figure 20.4: Some possibilities for the relations be- Theorem 20.7 — Simulating randomized algorithms in exponential time. tween BPP and other complexity classes. Most BPP EXP researchers believe that BPP P and that these classes are not powerful enough to solve NP-complete problems, let alone all problems= in EXP. However, ⊆ we have not even been able yet to rule out the possi- bility that randomness is a “silver bullet” that allows P exponential speedup on all problems, and hence The proof of Theorem 20.7 readily follows by enumer- BPP EXP. As we’ve already seen, we also can’t ating over all the (exponentially many) choices for the rule out that P NP. Interestingly, in the latter case, random coins. We omit the formal proof, as doing it P BPP= . by yourself is an excellent way to get comfortable with = Definition 20.1. =

20.3.2 Simulating randomized algorithms by circuits We have seen in Theorem 13.12 that if is in P, then there is a polyno- mial such that for every , the restriction of to inputs 퐹 ↾푛 푝 ∶ ℕ → ℕ 푛 퐹 퐹 562 introduction to theoretical computer science

is in SIZE . (In other words, that P P poly.) A priori it is not at푛 all clear that the same holds for a function in BPP, but this does / {0,turn 1} out to be the(푝(푛)) case. ⊆

Figure 20.5: The possible guarantees for a random- ized algorithm computing some function . In the tables above, the columns correspond to differ- ent inputs and the퐴 rows to different choices퐹 of the random tape. A cell at position is colored green if (i.e., the algorithm outputs the correct answer) and red otherwise.푟, 푥 The standard BPP guarantee퐴(푥; 푟) =corresponds 퐹(푥) to the middle figure, where for every input , at least two thirds of the choices for a random tape will result in computing the correct value. That푥 is, every column is colored green 푟in at least two thirds of its coordinates.퐴 In the left figure we have an “average case” guarantee where the algorithm is only guaranteed to output the correct answer with probability two thirds over a random input (i.e., at most one third of the total entries of the table are colored red, but there could be an all red column). The right figure corresponds to the “offline BPP” case, with probability at least two thirds over the random choice , will be good for every input. Theorem 20.8 — Randomness does not help for non uniform computation. That is, at least two thirds of the rows are all green. Theorem 20.8 (BPP P ) is proven by amplify- BPP P poly. 푟 푟 poly ing the success of a BPP algorithm until we have the / / “offline BPP” guarantee,⊆ and then hardwiring the That⊆ is, for every BPP, there exist some such that choice of the randomness to obtain a nonuniform for every , SIZE where is the restriction of to deterministic algorithm. 푟 inputs in . 퐹 ∈ 푏 푎, 푏 ∈ ℕ ↾푛 ↾푛 푛 > 0푛 퐹 ∈ (푎푛 ) 퐹 퐹 Proof Idea:{0, 1} The idea behind the proof is that we can first amplify by repetition the probability of success from to . This will allow us to show that for every there exists a single fixed−푛 choice of “favorable coins” which is a string of length2/3 polynomial1 − 0.1 ⋅ 2 in such that if is used for the randomness푛 ∈ ℕ then we output the right answer on all of the possible inputs. We푟 can then use the standard푛 “unravelling푟 the loop” technique푛 to transform an RNAND-TM program to an RNAND- CIRC program,2 and “hardwire” the favorable choice of random coins to transform the RNAND-CIRC program into a plain old deterministic NAND-CIRC program.

Proof of Theorem 20.8. Suppose that BPP. Let be a polynomial- time RNAND-TM program that computes as per Definition 20.1. Using Theorem 20.5, we can amplify퐹the ∈ success probability푃 of to obtain an RNAND-TM program that is at퐹 most a factor of ′ 푃 푃 푂(푛) modeling randomized computation 563

slower (and hence still polynomial time) such that for every

푛 푥 ∈ {0, 1} Pr (20.7)

푚 ′ −푛 where is the푟∼{0,1} number[푃 (푥; of푟) coin = tosses퐹 (푥)] ≥ that 1 − 0.1uses ⋅ 2 on, inputs of length . We use the notation to denote′ the execution of on input 푚and when the result of′ the coin tosses푃 corresponds to the′ string 푛. 푃 (푥; 푟) 푃 For every푥 , define the “bad” event to hold if , where푟 the sample푛 space for this event consists of the coins′ of . 푥 Then by (20.7푥), ∈ Pr {0, 1} for every 퐵 . Since푃 there(푥) ≠′ are퐹 (푥) many such ’s, by the union−푛 bound we see that the푛 probability푃 푥 that the푛 union of the[퐵 events] ≤ 0.1 ⋅ 2 is at푥 most ∈ {0, 1}. This means that if we2 choose 푥 , then with probability푛 at least it will be 푥 푥∈{0,1} the case that for every 푚 {퐵 }, .0.1 (Indeed, otherwise the event 푟would ∼ {0, hold 1} for some푛 .) In particular,′ because0.9 of the mere fact that the the probability푥 ∈ {0, 1} of퐹 (푥) = 푃 (푥; 푟)is smaller than , this 푥 means that퐵there exists a particular 푥 푛 such that 푥∈{0,1} 푥 ∗∪ 푚퐵 1 푟 ∈ {0, 1} (20.8) for every . ′ ∗ 푃 (푥; 푟 ) = 퐹 (푥) Now let us use the standard푛 “unravelling the loop” the technique and transform푥 ∈ {0,into 1} a NAND-CIRC program of polynomial in size, such that ′ for every and . Then by푃 “hardwiring”′ the values 푄 in푛 place of the 푛last 푚inputs of 푄(푥푟), we obtain = 푃 (푥; a new 푟) NAND-CIRC∗ 푥 ∈∗{0, program 1} 푟that ∈ 0 푚−1 {0,satisfies 1} by (20.8) that for every푟 , … , 푟 . This∗ 푟 demonstrates푚 that푄 has∗ a polynomial sized NAND-CIRC푛푄 program, 푟 hence completing the proof푄 (푥) of =Theorem 퐹 (푥) 20.8. 푥 ∈ {0, 1} ↾푛 퐹 ■

20.4 DERANDOMIZATION

The proof of Theorem 20.8 can be summarized as follows: we can replace a -time algorithm that tosses coins as it runs with an algorithm that uses a single set of coin tosses which will be good푝표푙푦(푛) enough for all inputs of size . Another∗ way푝표푙푦(푛) to say it is that for the purposes of computing functions,푟 we∈ do {0, not 1} need “online” access to random coins and can generate a푛 set of coins “offline” ahead of time, before we see the actual input. But this does not really help us with answering the question of whether BPP equals P, since we still need to find a way to generate these “offline” coins in the first place. To derandomize an RNAND- TM program we will need to come up with a single deterministic 564 introduction to theoretical computer science

algorithm that will work for all input lengths. That is, unlike in the case of RNAND-CIRC programs, we cannot choose for every input length some string to use as our random coins. Can we derandomize∗ randomized푝표푙푦(푛) algorithms, or does randomness add an푛 inherent extra푟 power∈ {0, for1} computation? This is a fundamentally interesting question but is also of practical significance. Ever since people started to use randomized algorithms during the Manhattan project, they have been trying to remove the need for randomness and replace it with numbers that are selected through some deterministic process. Throughout the years this approach has often been used 2 2 One amusing anecdote is a recent case where scam- successfully, though there have been a number of failures as well. mers managed to predict the imperfect “pseudo- A common approach people used over the years was to replace random generator” used by slot machines to cheat the random coins of the algorithm by a “randomish looking” string casinos. Unfortunately we don’t know the details of how they did it, since the case was sealed. that they generated through some arithmetic progress. For example, one can use the digits of for the random tape. Using these type of methods corresponds to what von Neumann referred to as a “state of sin”. (Though this is a휋 sin that he himself frequently committed, as generating true randomness in sufficient quantity was and still is often too expensive.) The reason that this is considered a “sin” is that such a procedure will not work in general. For example, it is easy to modify any probabilistic algorithm such as the ones we have seen in Chapter 19, to an algorithm that is guaranteed to fail if the random tape happens to equal the digits′ of 퐴. This means that the procedure “replace the random tape by퐴 the digits of ” does not yield a general way to transform a probabilistic algorithm휋 to a deterministic one that will solve the same problem. Of course, this휋 procedure does not always fail, but we have no good way to determine when it fails and when it succeeds. This reasoning is not specific to and holds for every deterministically produced string, whether it obtained by , , the Fibonacci series, or anything else. 휋 An algorithm that checks if its random tape is equal to 휋 and푒 then fails seems to be quite silly, but this is but the “tip of the iceberg” for a very serious issue. Time and again people have learned the휋 hard way that one needs to be very careful about producing random bits using deterministic means. As we will see when we discuss cryptography, many spectacular security failures and break-ins were the result of using “insufficiently random” coins.

20.4.1 Pseudorandom generators So, we can’t use any single string to “derandomize” a probabilistic algorithm. It turns out however, that we can use a collection of strings to do so. Another way to think about it is that rather than trying to eliminate the need for randomness, we start by focusing on reducing the modeling randomized computation 565

amount of randomness needed. (Though we will see that if we reduce the randomness sufficiently, we can eventually get rid of it altogether.) We make the following definition:

Definition 20.9 — Pseudorandom generator. A function is a -pseudorandom generator if for every circuit withℓ inputs,푚 one output, and at most gates, 퐺 ∶ {0, 1} → {0, 1} (푇 , 휖) 퐶 푚 Pr Pr푇 (20.9)

ℓ 푚 ∣푠∼{0,1} [퐶(퐺(푠)) = 1] − 푟∼{0,1} [퐶(푟) = 1]∣ < 휖

P This is a definition that’s worth reading more than once, and spending some time to digest it. Note that it takes several parameters:

• is the limit on the number of gates of the circuit that the generator needs to “fool”. The larger 푇is, the stronger the generator. • 퐶is how close is the output of the pseudorandom푇 Figure 20.6: A pseudorandom generator maps generator to the true uniform distribution over a short string into a long string such that an small program/circuit퐺 cannot 휖 . The smaller is, the stronger the generator. ℓ distinguish푚 between푠 ∈ {0, the 1} case that it is provided푟 ∈ a 푚 • is the input length and is the output length. {0,random 1} input and the case that푃 it {0, 1} 휖 If then it is trivial to come up with such is provided a “pseudorandom”푚 input of the form aℓ generator: on input 푚 , we can output where푟 ∼ {0, 1} . The short string is sometimes called the seed of the pseudorandom ℓ ≥. 푚 In this case Pr ℓ ℓ 푟generator, = 퐺(푠) as it is a푠 small ∼ {0, object 1} that can be thought푠 as will simply equal Pr 푠 ∈ {0,ℓ 1} , no matter 0 푚−1 푠∼{0,1} yielding a large “tree of randomness”. how푠 , … many , 푠 lines has. So,푚 the smaller[푃 (퐺(푠))is and = the 1] 푟∈{0,1} larger is, the stronger the[푃 generator, (푟) = and 1] to get anything non-trivial,푃 we need . ℓ 푚 Furthermore note that although our eventual goal is to 푚 > ℓ fool probabilistic randomized algorithms that take an unbounded number of inputs, Definition 20.9 refers to finite and deterministic NAND-CIRC programs.

We can think of a pseudorandom generator as a “randomness amplifier.” It takes an input of bits chosen at random and ex- pands these bits into an output of pseudorandom bits. If is small enough then the pseudorandom푠 ℓ bits will “look random” to any NAND-CIRCℓ program that푟 is not푚 >too ℓ big. Still, there are two questions휖 we haven’t answered:

• What reason do we have to believe that pseudorandom generators with non-trivial parameters exist?

• Even if they do exist, why would such generators be useful to derandomize randomized algorithms? After all, Definition 20.9 does not involve 566 introduction to theoretical computer science

RNAND-TM or RNAND-RAM programs, but rather deterministic NAND-CIRC programs with no randomness and no loops.

We will now (partially) answer both questions. For the first ques- tion, let us come clean and confess we do not know how to prove that interesting pseudorandom generators exist. By interesting we mean pseudorandom generators that satisfy that is some small constant (say ), , and the function itself can be computed in time. Nevertheless, Lemma 20.12 (whose휖 statement and proof is deferred휖 < 1/3 to the푚 end > ℓ of this chapter) shows퐺 that if we only drop the last푝표푙푦(푚) condition (polynomial-time computability), then there do in fact exist pseudorandom generators where is exponentially larger than .

푚 ℓ P At this point you might want to skip ahead and look at the statement of Lemma 20.12. However, since its proof is somewhat subtle, I recommend you defer reading it until you’ve finished reading the rest of this chapter.

20.4.2 From existence to constructivity The fact that there exists a pseudorandom generator does not mean that there is one that can be efficiently computed. However, it turns out that we can turn complexity “on its head” and use the assumed non existence of fast algorithms for problems such as 3SAT to obtain pseudorandom generators that can then be used to transform random- ized algorithms into deterministic ones. This is known as the Hardness vs Randomness paradigm. A number of results along those lines, most of which are outside the scope of this course, have led researchers to believe the following conjecture:

Optimal PRG conjecture: There is a polynomial-time computable function PRG that yields an exponentially secure pseudorandom generator∗. Specifically, there∶ {0, 1}exists→ a {0, constant 1} such that for every and , if we define as PRG for every 훿ℓ and , then ℓis a 훿 > 0푚 pseudorandom generator.ℓ 푖 푚 < 2 ℓ 퐺 ∶ {0, 1} → {0,훿ℓ 1}−훿ℓ 퐺(푠) = (푠, 푖) 푠 ∈ {0, 1} 푖 ∈ [푚] 퐺 (2 , 2 )

P The “optimal PRG conjecture” is worth while reading more than once. What it posits is that we can obtain pseudorandom generator such that every output bit of can be computed in time polynomial in(푇 the, 휖) length of the input, where퐺 is exponentially large in and퐺 is exponentially small in . (Note that we could notℓ hope for the entire output푇 to be com- ℓ 휖 ℓ modeling randomized computation 567

putable in , as just writing the output down will take too long.) To understandℓ why we call such a pseudorandom generator “optimal,” it is a great exercise to convince yourself that, for example, there does not exist a pseudorandom generator (in fact, the number1.1ℓ −1.1ℓin the conjecture must be smaller than ). To see(2 that, 2 we) can’t have , note that if we allow a NAND-CIRC훿 program with muchℓ more than lines1 then this NAND-CIRC푇 program ≫ 2 could “hardwire”ℓ in- side it all the outputs of on all its inputs,2 and use that to distinguish between a string ofℓ the form and a uniformly chosen string퐺 in 2 . To see that we can’t have , note that by guessing푚 the퐺(푠) input (which will be successful−ℓ with probability{0, 1} ), we can obtain a small휖 ≪ (i.e., 2 line) NAND-CIRC−2ℓ pro- gram푠 that achieves a advantage in distinguishing2 a pseudorandom and uniform−ℓ푂(ℓ) input. Working out these details is a highly recommended2 exercise.

We emphasize again that the optimal PRG conjecture is, as its name implies, a conjecture, and we still do not know how to prove it. In par- ticular, it is stronger than the conjecture that P NP. But we do have some evidence for its truth. There is a spectrum of different types of pseudorandom generators, and there are weaker≠ assumptions than the optimal PRG conjecture that suffice to prove that BPP P. In particular this is known to hold under the assumption that there exists a function TIME and such that for every= sufficiently large , is not in SIZE푂(푛) . The name “Optimal PRG conjecture” is non standard.퐹 ∈ This(2 conjecture)휖푛 is휖 sometimes> 0 known in the literature ↾푛 3 푛 퐹 (2 ) 3 A pseudorandom generator of the form we posit, as the existence of exponentially strong pseudorandom functions. where each output bit can be computed individually in time polynomial in the seed length, is commonly 20.4.3 Usefulness of pseudorandom generators known as a pseudorandom function generator. For more We now show that optimal pseudorandom generators are indeed very on the many interesting results and connections in the study of pseudorandomness, see this monograph of Salil useful, by proving the following theorem: Vadhan.

Theorem 20.10 — Derandomization of BPP. Suppose that the optimal PRG conjecture is true. Then BPP P.

Proof Idea: = The optimal PRG conjecture tells us that we can achieve exponential expansion of truly random coins into as many as “pseudorandom coins.” Looked at from the other direction, it allows훿ℓ us to reduce the need for randomnessℓ by taking an algorithm that2 uses coins and converting it into an algorithm that only uses log coins. Now an algorithm of the latter type by can be made fully deterministic푚 by enu- merating over all the log (which is polynomial푂( 푚) in ) possibilities for its random choices.푂( 푚) 2 푚 568 introduction to theoretical computer science

We now proceed with the proof details. ⋆ Proof of Theorem 20.10. Let BPP and let be a NAND-TM pro- gram and constants such that for every , runs in at most steps퐹 and ∈ Pr 푃 푛 . By “unrolling푎, 푏, the푐, 푑 loop”푑 and hardwiring푚 the input푥 ∈, {0,we1} can obtain푃 (푥) 푟∼{0,1} for every input 푐 ⋅ 푛 a NAND-CIRC[푃 program (푥; 푟) = 퐹 (푥)]of at ≥ most, 2/3 say, lines, that푛 takes bits of input and푥 such that 푥 . 푥푑 ∈ {0, 1} 푄 Now푇 = suppose 10푐 ⋅ 푛 that 푚 is a pseudorandom 푄(푟)generator. = 푃 (푥; Then 푟) we could deterministicallyℓ estimate the probability Pr 퐺 ∶ {0, 1}up→ to {0,accuracy 1} (푇 in, 0.1) time where 푚 is the time that it takes to compute a singleℓ 푟∼{0,1} 푥 푝(푥)output = bit of . [푄 (푟) = 1] 0.1 푂(푇 ⋅ 2 ⋅ 푚 ⋅ 푐표푠푡(퐺))The reason is푐표푠푡(퐺) that we know that Pr will give us퐺 such an estimate for , and we canℓ compute the 푠∼{0,1} 푥 probability by simply trying all ̃푝(푥)possibillites = for[푄 .(퐺(푠)) Now, under = 1]the optimal PRG conjecture we can푝(푥) setℓ or equivalently log , and our̃푝(푥) total computation time2 is polynomial훿ℓ in푠 . Since1 , this running time will푇 be=2 polynomial inℓ . ℓ1/훿 = 훿 This푇 completes푑 the proof, since we are guaranteed that2 = 푇 Pr 푇 ≤ 10푐 ⋅ 푛 , and hence estimating the푛 probability푚 to within accuracy is sufficient to compute . 푟∼{0,1} 푥 [푄 (푟) = 퐹 (푥)] ≥ 2/3 ■ 푝(푥) 0.1 퐹 (푥) 20.5 P NP AND BPP VS P Two computational complexity questions that we cannot settle are: = • Is P NP? Where we believe the answer is negative.

• Is BPP P? Where we believe the answer is positive. =

However= we can say that the “conventional wisdom” is correct on at least one of these questions. Namely, if we’re wrong on the first count, then we’ll be right on the second one:

Theorem 20.11 — Sipser–Gács Theorem. If P NP then BPP P.

= = P Before reading the proof, it is instructive to think why this result is not “obvious.” If P NP then given any randomized algorithm and input , we will be able to figure out in polynomial= time if there is a string of퐴 random coins푥 for 푚 푟 ∈ {0, 1} 퐴 modeling randomized computation 569

such that . The problem is that even if Pr , it can still be the

case that푚 even퐴(푥푟) when = 1 there exists a string 푟∈{0,1} such that [퐴(푥푟) =. 퐹 (푥)] ≥ 0.9999 The proof is rather subtle.퐹 (푥) It = is much 0 more important 푟 that you understand퐴(푥푟) = 1 the statement of the theorem than that you follow all the details of the proof.

Proof Idea: The construction follows the “quantifier elimination” idea which we have seen in Theorem 16.6. We will show that for every BPP, we can reduce the question of some input satisfies to the question of whether a formula of the form 퐹 ∈ is true, where are polynomial in the length푥 of푚 퐹and (푥) =푘is 1 푢∈{0,1} 푣∈{0,1} polynomial-time computable. By Theorem∃ 16.6, if P∀ NP then푃 (푢, we 푣) can decide in polynomial푚, 푘 time whether such a formula푥 is true푃 or false. The idea behind this construction is that using amplification= we can obtain a randomized algorithm for computing using coins such that for every , if then the set of coins that make output 푛is extremely퐴 tiny, and if퐹 푚 then푚 it is very large. Now푥 in ∈ the {0, 1}case 퐹 (푥) =, 0 one can show푆 that ⊆ {0, this 1} means that there exists퐴 a small1 number of “shifts” 퐹 (푥) = 1such that the union of the sets 퐹(i.e., (푥) sets = 1 of the form ) 0 푘−1 covers , while in the case 푘 this union푠 will, … always , 푠 be of 푖 푖 size at most 푚 which is푆 much ⊕ 푠 smaller than . We can{푠 ⊕ express 푠 | 푠 ∈ the 푆} condition{0, that1} there exists 퐹 (푥)such = 0 that푚 as a statement푘|푆| with a constant number of quantifiers.2 푚 0 푘−1 푖∈[푘] 푖 푠 , … , 푠 ∪ (푆 ⊕ 푠 ) = {0, 1}

Proof⋆ of Theorem 20.11. Let BPP. Using Theorem 20.5, there exists a polynomial-time algorithm such that for every , Pr 퐹 ∈ where is polynomial in . In푛 particular푚 (since an exponential dominates−푛퐴 a polynomial,푥 and ∈ {0, we 1} can 푥∈{0,1} always assume[퐴(푥푟) =is sufficiently 퐹 (푥)] ≥ 1 − 2large), it푚 holds that 푛 Figure 20.7: If BPP then through amplification we can ensure that there is an algorithm to compute 푛 Pr (20.10) on -length퐹 inputs ∈ and using coins such that 1 2 Pr . Hence 푚 퐴 10푚 if then almost all of the choices for 푥∈{0,1} [퐴(푥푟) = 퐹 (푥)] ≥ 1 − . 퐹 푛 푚 푚 Let , and let be the set 푟∈{0,1} will cause [퐴(푥푟)to ≠ output 퐹(푥)] ≪, while 1/푝표푙푦(푚)푚 if then . By our푛 assumption, if 푚 then 푚 and if 퐹(푥) = 1for almost all ’s. To prove2 the Sipser–푟 푥 푥 ∈then {0, 1} 푆 ⊆ {0,. 1} {푟 ∈ {0,1 2 1}푚 ∶ Gács Theorem퐴(푥푟) we consider1 several “shifts”퐹(푥) of = the 0 set 푥 퐴(푥푟) = 1} 퐹 (푥) = 0 |푆 | ≤ 10푚 2 퐴(푥푟) = 0 of the coins 푟 such that . If For a set and1 a2 string푚 , we define the set then we can find a set of shifts 푥 10푚 푚 퐹 (푥) =to 1 be |푆 | ≥ (1푚 − where)2 denotes the푚 XOR operation. That 푆for ⊆ which {0, 1} 푟 . If퐴(푥푟) = 1then 0 푘−1 푆 ⊆ {0, 1} 푠 ∈ {0, 1} 퐹(푥)for every = 1 such set 푚푘 푠 ., We … , can 푠 is, is the set “shifted” by . Note that . (Please 푖∈[푘] 푖 phrase the∪ question(푆 ⊕ of 푠 whether) = {0, 1} there is퐹(푥) such푚 = a set 0 of 푆make ⊕ 푠 sure that{푟 ⊕ you 푠 ∶ see 푟 ∈ why 푆} this is true.)⊕ shift using a constant|푐푢푝 number푖∈[푘]푆푖| of ≤quantifiers, 푘|푆| ≪ 2 and so The푆 ⊕ heart 푠 of the푆 proof is the following푠 two|푆 claims: ⊕ 푠| = |푆| can solve it in polynomial time if P NP.

= 570 introduction to theoretical computer science

CLAIM I: For every subset , if , then for every , 푚 1 .푚 1000푚 CLAIM II: For every subset푚푆 ⊆ {0, 1} , if|푆| ≤ 푚2then there 0 100푚−1 푖∈[100푚] 푖 exist a푠 set, … of , 푠 string ∈ {0, 1} ∪ such that(푆푚 ⊕ 푠 ) ⊊ {0,1 1}푚 . 2 CLAIM I and CLAIM II together푆 ⊆ imply {0, 1} the theorem.|푆| ≥ 2 Indeed, they푚 0 100푚−1 푖∈[100푚] 푖 mean that under our푠 , assumptions, … , 푠 for every∪ (푆 ⊕, 푠 ) = {0, 1}if and only if 푛 푥 ∈ {0, 1} 퐹 (푥) = 1 (20.11) 푚 푚 which we푠0,…,푠 can100푚−1 re-write∈{0,1} as 푖∈[100푚] 푥 푖 ∃ ∪ (푆 ⊕ 푠 ) = {0, 1}

푚 푚 (20.12) 푠0,…,푠100푚−1∈{0,1} 푤∈{0,1} 푥 0 푥 1 푥 100푚−1 ∃ or equivalently ∀ (푤 ∈ (푆 ⊕푠 )∨푤 ∈ (푆 ⊕푠 )∨⋯ 푤 ∈ (푆 ⊕푠 ))

푚 푚 (20.13) 푠0,…,푠100푚−1∈{0,1} 푤∈{0,1} 0 1 100푚−1 ∃ which (since ∀is computable(퐴(푥(푤⊕푠 in polynomial)) = 1∨퐴(푥(푤⊕푠 time) is exactly)) = 1∨⋯∨퐴(푥(푤⊕푠 the )) = 1) type of statement shown in Theorem 16.6 to be decidable in polyno- mial time if P 퐴NP. We see that all that is left is to prove CLAIM I and CLAIM II. CLAIM I follows= immediately from the fact that

100푚−1 100푚−1 푖∈[100푚−1] 푥 푖 푥 푖 푥 (20.14)푥 ∪ |푆 ⊕ 푠 | ≤ ∑푖=0 |푆 ⊕ 푠 | = ∑푖=0 |푆 | = 100푚|푆 | . To prove CLAIM II, we will use a technique known as the prob- abilistic method (see the proof of Lemma 20.12 for a more extensive discussion). Note that this is a completely different use of probability than in the theorem statement, we just use the methods of probability to prove an existential statement. Proof of CLAIM II: Let with be as in the claim’s statement. Consider the푚 following probabilistic푚 ex- periment: we choose 푆random ⊆ {0, 1}shifts |푆| ≥ 0.5 ⋅indepen- 2 dently at random in , and consider the event GOOD that 0 100푚−1 100푚푚. To prove CLAIM푠 , … II , 푠it is enough to show that Pr GOOD , since{0, 1}푚 that means that in particular there must exist 푖∈[100푚] 푖 ∪shifts (푆 ⊕ 푠 ) = {0,that 1} satisfy this condition. For every[ ] > 0 , define the event BAD to hold if 0 100푚−1 푠 , … , 푠 . The푚 event GOOD holds if BAD fails for every 푧 , and푧 ∈ so {0, our 1} goal is to prove that Pr BAD푧 ∉ . By 푖∈[100푚−1] 푖 푧 ∪the union푚 bound,(푆 ⊕ 푠 to) show this, it is enough to show that푚 Pr BAD 푧∈{0,1} 푧 푧 ∈ {0, 1} [∪ ] < 1 푧 [ ] < modeling randomized computation 571

for every . Define the event BAD to hold if . 푖 Since−푚 every shift is chosen푚 independently, for every fixed the 4 푧 4 푖 The condition of independence here is subtle. It 2events BAD 푧 ∈BAD {0, 1} are mutually independent, and푧 ∉hence 푆 ⊕ 푠 is not the case that all of the events 푖 0 푠 100푚−1 푧 BAD are mutually푚 independent. 푧 푧 Only for푖 a fixed푚 2, the× 100푚events of the ,…, 푧 푧∈{0,1} ,푖∈[100푚] {form BAD} are mutually independent.푚 Pr BAD Pr BAD Pr BAD (20.15) 푧 ∈ {0, 1} 100푚 100푚−1 푖 푖 푖 푧 푧 푖∈[100푚−1] 푧 푧 So this[ means] = that[∩ the result will follow] = ∏푖=0 by showing[ that]. Pr BAD for every and (as that would allow to푖 bound1 the righthand side푚 of (20.15) by ). In other 푧 2 words,[ we] ≤ need to show푧 that ∈ {0,for 1} every 푖 ∈ [100푚]−100푚and set with , 2푚 푚 1 푚 푧 ∈ {0, 1} 푆 ⊆ {0, 1} 2 |푆| ≥ 2 Pr (20.16) 1 푚 2 To show this, we observe푠∈{0,1} that[푧 ∈ 푆 ⊕ 푠] ≥if and. only if (can you see why). Hence we can rewrite the probability on the lefthand side of (20.16) as Pr 푧 ∈ 푆which ⊕ 푠 simply equals푠 ∈ 푆 ⊕ 푧 ! This concludes푚 the proof of CLAIM I and hence of 푚 푠∈{0,1} Theorem푚 20.11. [푠 ∈ 푆⊕푧] |푆⊕푧|/2 = |푆|/2 ≥ 1/2 ■

20.6 NON-CONSTRUCTIVE EXISTENCE OF PSEUDORANDOM GEN- ERATORS (ADVANCED, OPTIONAL)

We now show that, if we don’t insist on constructivity of pseudoran- dom generators, then we can show that there exists pseudorandom generators with output that exponentially larger in the input length. Lemma 20.12 — Existence of inefficient pseudorandom generators. There is some absolute constant such that for every , if log log and , then there is an pseudorandom generator . 퐶 휖, 푇 ℓ > 퐶( 푇 + (1/휖)) ℓ 푚 ≤푚 푇 (푇 , 휖) Proof퐺 ∶ {0, Idea: 1} → {0, 1} The proof uses an extremely useful technique known as the “prob- abilistic method” which is not too hard mathematically but can be 5 confusing at first.5 The idea is to give a “non constructive” proof of There is a whole (highly recommended) book by Alon and Spencer devoted to this method. existence of the pseudorandom generator by showing that if was chosen at random, then the probability that it would be a valid pseudorandom generator is positive. In particular퐺 this means that퐺 there exists a single that is a valid pseudorandom generator.(푇 , 휖) The probabilistic method is just a proof technique to demonstrate the existence of such a function.퐺 Ultimately,(푇 , 휖) our goal is to show the exis- tence of a deterministic function that satisfies the condition.

퐺 ⋆ 572 introduction to theoretical computer science

The above discussion might be rather abstract at this point, but would become clearer after seeing the proof.

Proof of Lemma 20.12. Let be as in the lemma’s statement. We need to show that there exists a function that “fools” every line program휖, 푇 , ℓ,in 푚 the sense of (20.9ℓ ). We will푚 show that this follows from the following claim:퐺 ∶ {0, 1} → {0, 1} Claim I: For푇 every fixed NAND-CIRC푃 program , if we pick at random then the probability that (20.9) is violated is at mostℓ . 푚 푃 퐺 ∶ 2 {0,Before 1} → proving {0,−푇 1} Claim I, let us see why it implies Lemma 20.12. We can identify2 a function with its “truth table” or simply the list of evaluations onℓ all its possible푚 inputs. Since each output is an bit퐺 string, ∶ {0, we 1} can→ also {0, 1} think of ℓ as a string in . We define to be the set of all functions2 from to ℓ 푚⋅2. As discussed푚 above푚 we can identify with퐺 andℓ ℓ ℓ {0,choosing 1}푚 a random functionℱ corresponds푚 to choosing{0,푚⋅2 1} a ℓ {0,random 1} -long bit string. 푚 ℱ {0, 1} ℓ For every NAND-CIRCℓ program퐺 ∼ ℱ let be the event that, if we choose 푚at ⋅ 2random from then (20.9) is violated with respect to 푃 the program . It is important푚 to understand푃 퐵 what is the sample space ℓ that the퐺 event is definedℱ over, namely this event depends onthe choice of and푃 so is a subset of . An equivalent way to define 푃 the event is퐵 that it is the subset of all푚 functions mapping to 푃 ℓ that퐺 violate퐵 (20.9), or in otherℱ words: ℓ 푃 푚 퐵 {0, 1} {0, 1}

⎧ 푚 1ℓ 1푚 ⎫ 푃 { } 퐵 = 퐺 ∈ ℱℓ ∣ ∣ 2 ∑ ℓ 푃 (퐺(푠)) − 2 ∑ 푚 푃 (푟)∣ > 휖(20.17) ⎨ 푠∈{0,1} 푟∈{0,1} ⎬ (We’ve replaced⎩{ here the probability statements in (20.9) with the⎭} equivalent sums so as to reduce confusion as to what is the sample space that is defined over.) To understand this proof it is crucial that you pause here and see 푃 how the definition퐵 of above corresponds to (20.17). This may well take re-reading the above text once or twice, but it is a good exercise 푃 at parsing probabilistic퐵 statements and learning how to identify the sample space that these statements correspond to. Now, we’ve shown in Theorem 5.2 that up to renaming variables (which makes no difference to program’s functionality) there are log NAND-CIRC programs of at most lines. Since log 푂(푇for sufficiently푇 ) large , this means that if Claim I is true, then 2by2 the union bound it holds that the probability푇 of the union푇 of 푇 < 푇 over all NAND-CIRC푇 programs of at most lines is at most log for sufficiently large . What is important for 푃 2 퐵푂(푇 푇 ) −푇 푇 2 2 < 0.1 푇 modeling randomized computation 573

us about the number is that it is smaller than . In particular this means that there exists a single such that does not violate (20.9) with respect to0.1 any NAND-CIRC∗ 푚 program1 of at∗ most lines, ℓ but that precisely means that 퐺is∈ a ℱ pseudorandom퐺 generator. Hence to conclude the proof∗ of Lemma 20.12, it suffices푇 to prove Claim I. Choosing a random 퐺 (푇 , 휖) amounts to choos- ing random strings ℓ and푚 letting (identifyingℓ and via퐺 the ∶ {0, binary 1} → representation). {0,푚 1} This means 0 퐿−1 푥 that퐿 proving = 2 the claimℓ amounts푦 , …to , 푦 showing∈ {0, that1} for every fixed퐺(푥) func- = 푦 tion {0, 1} [퐿], if log log (which by setting , we can ensure푚 is larger than 퐶( ) then푇 + the휖) probability that 푃 ∶ {0, 1} → {0, 1} 퐿 >2 2 퐶 > 4 10푇 /휖Pr (20.18) 퐿−1 1 푖 푚 퐿 푠∼{0,1} is at most . ∣ ∑푖=0 푃 (푦 ) − [푃 (푠) = 1]∣ > 휖 2 (20.18) follows−푇 directly from the Chernoff bound. Indeed, if we let for every2 the random variable denote , then since is chosen independently at random, these are in- 푖 푖 dependently푖 and ∈ [퐿]identically distributed random푋 variables푃 (푦 ) with mean 0 퐿−1 푦 , … , 푦 Pr and hence the probability that they deviate푚 from their expectation푚 by is at most . 푦∼{0,1} 푦∼{0,1} 2 ■ 피 [푃 (푦)] = [푃 (푦) = 1] −휖 퐿/2 휖 2 ⋅ 2

Figure 20.8: The relation between BPP and the other complexity classes that we have seen. We know that P BPP EXP and BPP P poly but we don’t know how BPP compares with NP and can’t rule out even⊆ BPP ⊆ EXP. Most evidence⊆ / points out to the possibliity that BPP P. = =

Chapter Recap

• We can model randomized algorithms by either ✓ adding a special “coin toss” operation or assuming an extra randomly chosen input. • The class BPP contains the set of Boolean func- tions that can be computed by polynomial time randomized algorithms. • BPP is a worst case class of computation: a ran- domized algorithm to compute a function must compute it correctly with high probability on every input. 574 introduction to theoretical computer science

• We can amplify the success probability of random- ized algorithm from any value strictly larger than into a success probability that is exponentially close to . • We1/2 know that P BPP EXP. 1 • We also know that BPP P poly. ⊆ ⊆ • The relation between BPP and/ NP is not known, but we do know that if P⊆ NP then BPP P. • Pseudorandom generators are objects that take a short random “seed” and= expand it to a= much longer output that “appears random” for effi- cient algorithms. We conjecture that exponentially strong pseudorandom generators exist. Under this conjecture, BPP P.

=

20.7 EXERCISES

20.8 BIBLIOGRAPHICAL NOTES

In this chapter we ignore the issue of how we actually get random bits in practice. The output of many physical processes, whether it is thermal heat, network and hard drive latency, user typing pat- tern and mouse movements, and more can be thought of as a binary string sampled from some distribution that might have significant unpredictability (or entropy) but is not necessarily the uniform distri- bution over . Indeed, as this paper휇 shows, even (real-world) coin tosses do not푛 have exactly the distribution of a uniformly random string. Therefore,{0, 1} to use the resulting measurements for randomized algorithms, one typically needs to apply a “distillation” or random- ness extraction process to the raw measurements to transform them to the uniform distribution. Vadhan’s book [Vad+12] is an excellent source for more discussion on both randomness extractors and pseu- dorandom generators. The name BPP stands for “bounded probability polynomial time”. This is an historical accident: this class probably should have been called RP or PP but both names were taken by other classes. The proof of Theorem 20.8 actually yields more than its statement. We can use the same “unrolling the loop” arguments we’ve used be- fore to show that the restriction to of every function in BPP is also computable by a polynomial-size푛 RNAND-CIRC program (i.e., NAND-CIRC program with the{0,RAND 1} operation). Like in the P vs SIZE case, there are also functions outside BPP whose restrictions can be computed by polynomial-size RNAND-CIRC pro- grams. Nevertheless(푝표푙푦(푛)) the proof of Theorem 20.8 shows that even such functions can be computed by polynomial sized NAND-CIRC pro- modeling randomized computation 575

grams without using the rand operations. This can be phrased as saying that BPSIZE SIZE (where BPSIZE is defined in the natural way using RNAND progams). The stronger version of Theorem(푇 20.8 (푛))we ⊆ mentioned(푂(푛푇 can (푛))) be phrased as saying that

BPP poly P poly.

/ / =

V ADVANCED TOPICS

Learning Objectives: • Definition of perfect secrecy • The one-time pad encryption scheme • Necessity of long keys for perfect secrecy • Computational secrecy and the derandomized one-time pad. • Public key encryption • A taste of advanced topics 21 Cryptography

“Human ingenuity cannot concoct a cipher which human ingenuity cannot resolve.”, Edgar Allen Poe, 1841

“A good disguise should not reveal the person’s height”, and Silvio Micali, 1982

““Perfect Secrecy” is defined by requiring of a system that after a cryptogram is intercepted by the enemy the a posteriori probabilities of this cryptogram rep- resenting various messages be identically the same as the a priori probabilities of the same messages before the interception. It is shown that perfect secrecy is possible but requires, if the number of messages is finite, the same number of possible keys.”, , 1945

“We stand today on the brink of a revolution in cryptography.”, Whitfeld Diffie and Martin Hellman, 1976

Cryptography - the art or science of “secret writing” - has been around for several millennia, and for almost all of that time Edgar Allan Poe’s quote above held true. Indeed, the history of cryptography is littered with the figurative corpses of cryptosystems believed secure and then broken, and sometimes with the actual corpses of those who have mistakenly placed their faith in these cryptosystems. Yet, something changed in the last few decades, which is the “revo- lution” alluded to (and to a large extent initiated by) Diffie and Hell- man’s 1976 paper quoted above. New cryptosystems have been found that have not been broken despite being subjected to immense efforts involving both human ingenuity and computational power on a scale that completely dwarves the “code breakers” of Poe’s time. Even more amazingly, these cryptosystem are not only seemingly unbreakable, but they also achieve this under much harsher conditions. Not only do today’s attackers have more computational power but they also have more data to work with. In Poe’s age, an attacker would be lucky if they got access to more than a few encryptions of known messages. These days attackers might have massive amounts of data- terabytes

Compiled on 8.26.2020 18:10 580 introduction to theoretical computer science

or more - at their disposal. In fact, with public key encryption, an at- tacker can generate as many ciphertexts as they wish. The key to this success has been a clearer understanding of both how to define security for cryptographic tools and how to relate this security to concrete computational problems. Cryptography is a vast and continuously changing topic, but we will touch on some of these issues in this chapter.

21.1 CLASSICAL CRYPTOSYSTEMS

A great many cryptosystems have been devised and broken through- out the ages. Let us recount just some of these stories. In 1587, Mary the queen of Scots, and the heir to the throne of England, wanted to arrange the assassination of her cousin, queen Elisabeth I of England, so that she could ascend to the throne and finally escape the house arrest under which she had been for the last 18 years. As part of this complicated plot, she sent a coded letter to Sir Anthony Babington. Mary used what’s known as a substitution cipher where each letter is transformed into a different obscure symbol (see Fig. 21.1). At a first look, such a letter might seem rather inscrutable- a meaningless sequence of strange symbols. However, after some thought, one might recognize that these symbols repeat several times and moreover that different symbols repeat with different frequencies. Now it doesn’t Figure 21.1: Snippet from encrypted communication take a large leap of faith to assume that perhaps each symbol corre- between queen Mary and Sir Babington sponds to a different letter and the more frequent symbols correspond to letters that occur in the alphabet with higher frequency. From this observation, there is a short gap to completely breaking the cipher, which was in fact done by queen Elisabeth’s spies who used the de- coded letters to learn of all the co-conspirators and to convict queen Mary of treason, a crime for which she was executed. Trusting in su- perficial security measures (such as using “inscrutable” symbols) isa trap that users of cryptography have been falling into again and again over the years. (As in many things, this is the subject of a great XKCD cartoon, see Fig. 21.2.) The Vigenère cipher is named after Blaise de Vigenère who de- scribed it in a book in 1586 (though it was invented earlier by Bellaso). The idea is to use a collection of substitution cyphers - if there are different ciphers then the first letter of the plaintext is encoded with the first cipher, the second with the second cipher, the with the푛 cipher, and then the letter is again encoded with푡ℎ the first cipher.푡ℎ The key is usually a word푠푡 or a phrase of letters,푛 and the 푛 substitution cipher is푛 obtained + 1 by shifting each letter positions Figure 21.2: XKCD’s take on the added security of in푡ℎ the alphabet. This “flattens” the frequencies푛 and makes it much 푖 using uncommon symbols 푖harder to do frequency analysis, which is why this cipher푘 was consid- ered “unbreakable” for 300+ years and got the nickname “le chiffre cryptography 581

indéchiffrable” (“the unbreakable cipher”). Nevertheless, Charles Babbage cracked the Vigenère cipher in 1854 (though he did not pub- lish it). In 1863 Friedrich Kasiski broke the cipher and published the result. The idea is that once you guess the length of the cipher, you can reduce the task to breaking a simple substitution cipher which can be done via frequency analysis (can you see why?). Confederate gen- erals used Vigenère regularly during the civil war, and their messages were routinely cryptanalzed by Union officers. The Enigma cipher was a mechanical cipher (looking like a type- writer, see Fig. 21.5) where each letter typed would get mapped into a different letter depending on the (rather complicated) key and current Figure 21.3: Confederate Cipher Disk for implement- state of the machine which had several rotors that rotated at different ing the Vigenère cipher paces. An identically wired machine at the other end could be used to decrypt. Just as many ciphers in history, this has also been believed by the Germans to be “impossible to break” and even quite late in the war they refused to believe it was broken despite mounting evidence to that effect. (In fact, some German generals refused to believe it was broken even after the war.) Breaking Enigma was an heroic effort which was initiated by the Poles and then completed by the British at Bletchley Park, with Alan Turing (of the Turing machines) playing a Figure 21.4 key role. As part of this effort the Brits built arguably the world’s first : Confederate encryption of the message “Gen’l Pemberton: You can expect no help from this large scale mechanical computation devices (though they looked more side of the river. Let Gen’l Johnston know, if possible, similar to washing machines than to iPhones). They were also helped when you can attack the same point on the enemy’s lines. Inform me also and I will endeavor to make a along the way by some quirks and errors of the German operators. For diversion. I have sent some caps. I subjoin a despatch example, the fact that their messages ended with “Heil Hitler” turned from General Johnston.” out to be quite useful. Here is one entertaining anecdote: the Enigma machine would never map a letter to itself. In March 1941, Mavis Batey, a cryptana- lyst at Bletchley Park received a very long message that she tried to decrypt. She then noticed a curious property— the message did not contain the letter “L”.1 She realized that the probability that no “L”’s appeared in the message is too small for this to happen by chance. Hence she surmised that the original message must have been com- posed only of L’s. That is, it must have been the case that the operator, Figure 21.5 perhaps to test the machine, have simply sent out a message where he : In the Enigma mechanical cipher the secret key would be the settings of the rotors and internal repeatedly pressed the letter “L”. This observation helped her decode wires. As the operator types up their message, the the next message, which helped inform of a planned Italian attack and encrypted version appeared in the display area above, and the internal state of the cipher was updated (and secure a resounding British victory in what became known as “the so typing the same letter twice would generally result Battle of Cape Matapan”. Mavis also helped break another Enigma in two different letters output). Decrypting follows machine. Using the information she provided, the Brits were able the same process: if the sender and receiver are using the same key then typing the ciphertext would result to feed the Germans with the false information that the main allied in the plaintext appearing in the display. invasion would take place in Pas de Calais rather than on Normandy. 1 Here is a nice exercise: compute (up to an order In the words of General Eisenhower, the intelligence from Bletchley of magnitude) the probability that a 50-letter long message composed of random letters will end up not park was of “priceless value”. It made a huge difference for the Allied containing the letter “L”. 582 introduction to theoretical computer science

war effort, thereby shortening World War II and saving millions of lives. See also this interview with Sir Harry Hinsley.

21.2 DEFINING ENCRYPTION

Many of the troubles that cryptosystem designers faced over history (and still face!) can be attributed to not properly defining or under- standing what are the goals they want to achieve in the first place. Let us focus on the setting of private key encryption. (This is also known as “symmetric encryption”; for thousands of years, “private key encryp- tion” was synonymous with encryption and only in the 1970’s was the concept of public key encryption invented, see Definition 21.11.) A sender (traditionally called “Alice”) wants to send a message (known also as a plaintext) to a receiver (traditionally called “Bob”). They would like their message∗ to be kept secret from an adversary who listens in or “eavesdrops”푥 ∈ {0, 1} on the communication channel (and is traditionally called “Eve”). Alice and Bob share a secret key . (While the letter is often used elsewhere in the book to denote∗ a natural number, in this chapter we use it to denote the푘 string ∈ {0, corresponding 1} to a secret푘 key.) Alice uses the key to “scramble” or encrypt the plaintext into a ciphertext , and Bob uses the key to “unscramble” or decrypt the ciphertext back into the푘 plaintext . This motivates the following푥 definition푦 which attempts to capture푘 what it means for an encryption scheme to be푦 valid or “make sense”,푥 regardless of whether or not it is secure:

Definition 21.1 — Valid encryption scheme. Let and be two functions mapping natural numbers to natural numbers. A pair of polynomial-time computable functions퐿 ∶ ℕ → ℕ map-퐶 ∶ ℕ → ℕ ping strings to strings is a valid private key encryption scheme (or encryption scheme for short) with plaintext length(퐸, function 퐷) and ciphertext length function if for every , and , and 퐿(⋅)푛 퐿(푛) 퐶(⋅) 푛 ∈ ℕ 푘 ∈ {0, 1} 푘 푥 ∈ {0, 1} |퐸 (푥)| = 퐶(푛) (21.1)

퐷(푘, 퐸(푘, 푥)) = 푥 . We will often write the first input (i.e., the key) to the encryp- tion and decryption as a subscript and so can write (21.1) also as

. Figure 21.6: A private-key encryption scheme is a pair of algorithms such that for every key Solved푘 푘 Exercise 21.1 — Lengths of ciphertext and plaintext. Prove that for 퐷 (퐸 (푥)) = 푥 and plaintext , every valid encryption scheme with functions . is a ciphertext푛 of length퐸, 퐷 . The encryption퐿(푛) scheme 푘 for every . 푘is ∈valid {0,if 1} for every such , 푥 ∈ {0, 1} . That푦 = is, 퐸 the(푥) decryption of an encryption퐶(푛) of is , as long as both (퐸, 퐷) 퐿, 퐶 퐶(푛) ≥ ■ encryption and decryption푦 퐷 use푘(푦) the = same 푥 key. 퐿(푛) 푛 푥 푥 cryptography 583

Solution: For every fixed key , the equation (21.1) implies that the map inverts the map푛 , which in partic- ular means that the map푘 ∈ {0, 1} must be one to one. Hence 푘 푘 its codomain푦 ↦ must 퐷 (푦) be at least as large as푥 its ↦ domain, 퐸 (푥) and since its 푘 domain is and its푥 codomain ↦ 퐸 (푥) is it follows that . 퐿(푛) 퐶(푛) {0, 1} {0, 1} ■ 퐶(푛) ≥ 퐿(푛) Since the ciphertext length is always at least the plaintext length (and in most applications it is not much longer than that), we typi- cally focus on the plaintext length as the quantity to optimize in an encryption scheme. The larger is, the better the scheme, since it means we need a shorter secret key to protect messages of the same length. 퐿(푛)

21.3 DEFINING SECURITY OF ENCRYPTION

Definition 21.1 says nothing about the security of and , and even allows the trivial encryption scheme that ignores the key altogether and sets for every . Defining security퐸 is not퐷 a trivial matter.

푘 퐸 (푥) = 푥 푥 P You would appreciate the subtleties of defining secu- rity of encryption more if at this point you take a five minute break from reading, and try (possibly with a partner) to brainstorm on how you would mathemat- ically define the notion that an encryption scheme is secure, in the sense that it protects the secrecy of the plaintext .

푥 Throughout history, many attacks on cryptosystems were rooted in the cryptosystem designers’ reliance on “security through obscurity”— trusting that the fact their methods are not known to their enemy will protect them from being broken. This is a faulty assumption - if you reuse a method again and again (even with a different key each time) then eventually your adversaries will figure out what you are doing. And if Alice and Bob meet frequently in a secure location to decide on a new method, they might as well take the opportunity to exchange their secrets. These considerations led

Auguste Kerckhoffs in 1883 to state the following principle: 2 The actual quote is “Il faut qu’il n’exige pas le secret, et qu’il puisse sans inconvénient tomber entre A cryptosystem should be secure even if everything about the system, except the les mains de l’ennemi” loosely translated as “The key, is public knowledge.2 system must not require secrecy and can be stolen by the enemy without causing trouble”. According to Steve Bellovin the NSA version is “assume that the Why is it OK to assume the key is secret and not the algorithm? first copy of any device we make is shipped tothe Because we can always choose a fresh key. But of course that won’t Kremlin”. 584 introduction to theoretical computer science

help us much if our key is “1234” or “passw0rd!”. In fact, if you use any deterministic algorithm to choose the key then eventually your adversary will figure this out. Therefore for security we must choose the key at random and can restate Kerckhoffs’s principle as follows: There is no secrecy without randomness

This is such a crucial point that is worth repeating:

 Big Idea 26 There is no secrecy without randomness.

At the heart of every cryptographic scheme there is a secret key, and the secret key is always chosen at random. A corollary of that is that to understand cryptography, you need to know probability theory.

R Remark 21.2 — Randomness in the real world. Choos- ing the secrets for cryptography requires generating randomness, which is often done by measuring some “unpredictable” or “high entropy” data, and then applying hash functions to the result to “extract” a uniformly random string. Great care must be taken in doing this, and randomness generators often turn out to be the Achilles heel of secure systems. In 2006 a programmer removed a line of code from the procedure to generate entropy in OpenSSL package distributed by Debian since it caused a warning in some automatic verification code. As a result for two years (until this was discovered) all the randomness generated by this procedure used only the process ID as an “unpredictable” source. This means that all communication done by users in that period is fairly easily breakable (and in particular, if some entities recorded that communication they could break it also retroactively). See XKCD’s take on that incident. In 2012 two separate teams of researchers scanned a large number of RSA keys on the web and found out that about 4 percent of them are easy to break. The main issue were devices such as routers, internet- connected printers and such. These devices sometimes run variants of - a desktop operating system- but without a hard drive, mouse or keyboard, they don’t have access to many of the entropy sources that desktop have. Coupled with some good old fashioned ignorance of cryptography and software bugs, this led to many keys that are downright trivial to break, see this blog post and this web page for more details. Since randomness is so crucial to security, breaking the procedure to generate randomness can lead to a complete break of the system that uses this random- ness. Indeed, the Snowden documents, combined with cryptography 585

observations of Shumow and Ferguson, strongly sug- gest that the NSA has deliberately inserted a trapdoor in one of the pseudorandom generators published by the National Institute of Standards and Technologies (NIST). Fortunately, this generator wasn’t widely adapted but apparently the NSA did pay 10 million dollars to RSA security so the latter would make this generator their default option in their products.

21.4 PERFECT SECRECY

If you think about encryption scheme security for a while, you might come up with the following principle for defining security: “An encryption scheme is secure if it is not possible to recover the key from ”. However, a moment’s thought shows that the key is not really what we’re trying to protect. After all, the whole point of an푘 encryp- 푘 tion퐸 (푥) is to protect the confidentiality of the plaintext . So, we can try to define that “an encryption scheme is secure if it is not possible to recover the plaintext from ”. Yet it is not clear what this means푥 either. Sup- pose that an encryption scheme reveals the first 10 bits of the plaintext 푘 . It might푥 still퐸 not(푥) be possible to recover completely, but on an in- tuitive level, this seems like it would be extremely unwise to use such 푥an encryption scheme in practice. Indeed,푥 often even partial information about the plaintext is enough for the adversary to achieve its goals. The above thinking led Shannon in 1945 to formalize the notion of perfect secrecy, which is that an encryption reveals absolutely nothing about the message. There are several equivalent ways to define it, but perhaps the cleanest one is the following:

Definition 21.3 — Perfect secrecy. A valid encryption scheme with plaintext length is perfectly secret if for every and plaintexts , the following two distributions(퐸, 퐷) and over ′ are identical:퐿(⋅)퐿(푛) 푛 ∈ ℕ ′ 푥, 푥∗ ∈ {0, 1} 푌 푌• is obtained{0, 1} by sampling and outputting . 푛 푘 • 푌 is obtained by sampling푘 ∼ {0, 1} and outputting퐸 (푥) . ′ 푛 ′ 푘 푌 푘 ∼ {0, 1} 퐸 (푥 )

P This definition might take more than one reading to parse. Try to think of how this condition would correspond to your intuitive notion of “learning no information” about from observing , and to Shannon’s quote in the beginning of this chapter. 푘 푥 퐸 (푥) 586 introduction to theoretical computer science

In particular, suppose that you knew ahead of time that Alice sent either an encryption of or an en- cryption of . Would you learn anything new from observing the′ encryption of the message푥 that Alice actually sent?푥 It may help you to look at Fig. 21.7.

21.4.1 Example: Perfect secrecy in the battlefield To understand Definition 21.3, suppose that Alice sends only one of two possible messages: “attack” or “retreat”, which we denote by and respectively, and that she sends each one of those messages 0 with probability . Let us put ourselves in the shoes of Eve, the 푥 1 eavesdropping푥 adversary. A priori we would have guessed that Alice sent either or 1/2with probability . Now we observe Figure 21.7: For any key length , we can visualize an encryption scheme as a graph with a vertex where is a uniformly chosen key in . How does this new 0 1 푘 푖 for every one of the possible푛 plaintexts and for 푥 푥 1/2 푦 = 퐸 (푥 ) information cause us to update our beliefs on푛 whether Alice sent the every one of the ciphertexts(퐸,퐿(푛) 퐷) in of the form for and . For every plaintext푘 or the plaintext ? {0, 1} 2 ∗ plaintext and key 푛, we add an{0, edge 1} 퐿(푛) labeled between푘 and . By the validity condition, if we 0 1 퐸 (푥) 푘 ∈ {0, 1} 푥 ∈ {0, 1} 푥 푥 pick any fixed푥 key 푘, the map must푘 be P one-to-one.푥 The퐸 condition푘(푥) of perfect secrecy simply Before reading the next paragraph, you might want corresponds to requiring푘 that every푥 ↦ 퐸 two푘(푥) plaintexts to try the analysis yourself. You may find it useful to and have exactly the same set of neighbors (or look at the Wikipedia entry on Bayesian Inference or multi-set,′ if there are parallel edges). these MIT lecture notes. 푥 푥

Let us define to be the probability (taken over ) that and similarly to be Pr .푛 0 Note that, since Alice푝 (푦) chooses the message to send푛 at random,푘 ∼ {0, our 1} 푘 0 1 푘∼{0,1} 푘 1 a priori푦 = probability 퐸 (푥 ) for observing푝 (푦)is [푦. = However, 퐸 (푥 )] as per Definition 21.3, the perfect secrecy1 condition1 guarantees that 2 0 2 1 ! Let us denote the number푦 푝 (푦) + 푝 (푦) by . By the formula for conditional probability, the probability that Alice sent the 0 1 0 1 message푝 (푦) = 푝 (푦)conditioned on our observation푝 (푦)is = simply 푝 (푦) 푝

0 Pr 푥 Pr 푦 (21.2) Pr 푘 푖 푘 푖 [푖 = 0 ∧ 푦 = 퐸 (푥 )] [푖 = 0|푦 = 퐸 (푥 )] = 푘 . (The equation (21.2) is a special case of[푦Bayes’ = 퐸 rule(푥)] which, although a simple restatement of the formula for conditional probability, is an extremely important and widely used tool in statistics and data analysis.) Since the probability that and is the ciphertext is equal to , and the a priori probability of observing is 푘 1 , we can rewrite (21.2)푖 as = 0 푦 퐸1 (0) 2 0 2 0 1 ⋅ 푝 (푦) 푦 푝 (푦) + 1 2 푝 (푦) Pr (21.3) 1 2 0 푘 푖 1 푝 (푦)1 푝 1 [푖 = 0|푦 = 퐸 (푥 )] = 0 1 = = 2 푝 (푦) + 2 푝 (푦) 푝 + 푝 2 cryptography 587

using the fact that . This means that observing the ciphertext did not help us at all! We still would not be able to guess 0 1 whether Alice sent푝 “attack”(푦) = 푝 or(푦) “retreat” = 푝 with better than 50/50 odds! This example푦 can be vastly generalized to show that perfect secrecy is indeed “perfect” in the sense that observing a ciphertext gives Eve no additional information about the plaintext beyond her a priori knowl- edge.

21.4.2 Constructing perfectly secret encryption Perfect secrecy is an extremely strong condition, and implies that an eavesdropper does not learn any information from observing the ci- phertext. You might think that an encryption scheme satisfying such a strong condition will be impossible, or at least extremely complicated, to achieve. However it turns out we can in fact obtain perfectly secret encryption scheme fairly easily. Such a scheme for two-bit messages is illustrated in Fig. 21.8 In fact, this can be generalized to any number of bits:

Theorem 21.4 — One Time Pad (Vernam 1917, Shannon 1949). There is a per- fectly secret valid encryption scheme with .

(퐸, 퐷) 퐿(푛) = 퐶(푛) = 푛 Proof Idea: Figure 21.8: A perfectly secret encryption scheme Our scheme is the one-time pad also known as the “Vernam Ci- for two-bit keys and messages. The blue vertices represent plaintexts and the red vertices represent pher”, see Fig. 21.9. The encryption is exceedingly simple: to encrypt ciphertexts, each edge mapping a plaintext to a ci- a message with a key we simply output phertext is labeled with the corresponding key . Since there are four possible keys, the푥 degree of where is the bitwise푛 XOR operation that outputs푛 the string corre- the graph푦 is = four 퐸푘(푥) and it is in fact a complete bipartite sponding to푥 XORing ∈ {0, 1} each coordinate푘 ∈ of {0, 1}and . 푥 ⊕ 푘 graph.푘 The encryption scheme is valid in the sense ⊕ that for every , the map is one-to-one, which in other words means that the set 푥 푘 2 of edges labeled푘 ∈ with {0, 1}is a matching푥. ↦ 퐸푘(푥) ⋆ Proof of Theorem 21.4. For two binary strings and of the same 푘 length , we define to be the string such that mod for every . The푎 encryption푏 푛 scheme 푛is defined 푎as ⊕ follows: 푏 푐 ∈and {0, 1} . 푖 푖 푖 푐By= the 푎 associative+ 푏 law2 of addition푖 ∈ (which [푛] works also modulo two), 푘 푘 (퐸, 퐷) 퐸 (푥) = 푥 ⊕ 푘 퐷 (푦), using = 푦 ⊕ the 푘 fact that for every bit , mod and푛 mod . 푘 푘 퐷Hence(퐸 (푥)) = (푥form ⊕ 푘) a valid ⊕ 푘 = encryption. 푥 ⊕ (푘 ⊕ 푘) = 푥 ⊕ 0 = 푥 To analyze the휎 perfect ∈ {0, 1} secrecy휎 + 휎 property,2 =we 0 claim휎 that + 0 for = 휎 every 2 (퐸, 퐷), the distribution where is simply the uniform푛 distribution over , and hence in particular푛 the 푥 푘 푥distributions ∈ {0, 1} and are identical푌 = 퐸푛 (푥) for every 푘 ∼ {0, 1} . Indeed, for every particular ′ {0,, the 1} value is output′ by 푛if and 푥 푥 only if 푌 which푌 holds푛 if and only if 푥, 푥 ∈ {0,. Since 1} is 푥 chosen uniformly at푦 random ∈ {0, 1} in , the푦 probability that푌 happens 푦 = 푥 ⊕ 푘 푛 푘 = 푥 ⊕ 푦 푘 {0, 1} 푘 588 introduction to theoretical computer science

to equal is exactly , which means that every string is output by with probability −푛. ■ 푥 ⊕ 푦 −푛2 푦 푥 푌 2

P The argument above is quite simple but is worth reading again. To understand why the one-time pad is perfectly secret, it is useful to envision it as a bi- partite graph as we’ve done in Fig. 21.8. (In fact the Figure 21.9: In the one time pad encryption scheme we encryption scheme of Fig. 21.8 is precisely the one- encrypt a plaintext with a key by the ciphertext where denotes the bitwise time pad for .) For every , the one-time pad 푛 푛 XOR operation. encryption scheme corresponds to a bipartite graph 푥 ∈ {0, 1} 푘 ∈ {0, 1} 푥 ⊕ 푘 ⊕ with vertices푛 on = the 2 “left side” corresponding푛 to the plaintexts푛 in and vertices on the “right side” corresponding2 to the푛 ciphertexts푛 . For every and{0, 1} 2 , we connect푛 to the vertex 푛 with an edge that푛 we{0, label 1} with . One can 푥see ∈ that {0, 1} this is the푘 complete∈ {0, 1} bipartite graph,푥 where 푘 every푦 = 퐸 vertex(푥) on the left is connected to all vertices푘 on the right. In particular this means that for every left vertex , the distribution on the ciphertexts obtained by taking a random and going to the neighbor푥 of on the edge labeled is푛 the uniform dis- tribution over 푘. This ∈ ensures {0, 1} the perfect secrecy condition. 푥 푛 푘 {0, 1}

21.5 NECESSITY OF LONG KEYS

So, does Theorem 21.4 give the final word on cryptography, and means that we can all communicate with perfect secrecy and live happily ever after? No it doesn’t. While the one-time pad is efficient, and gives perfect secrecy, it has one glaring disadvantage: to commu- nicate bits you need to store a key of length . In contrast, practically used cryptosystems such as AES-128 have a short key of bits (i.e., bytes)푛 that can be used to protect terabytes푛 or more of communica- tion! Imagine that we all needed to use the one time pad.128 If that was the16 case, then if you had to communicate with people, you would have to maintain (securely!) huge files that are each as long as the length of the maximum total communication you푚 expect with that per- son. Imagine that every time푚 you opened an account with Amazon, Google, or any other service, they would need to send you in the mail (ideally with a secure courier) a DVD full of random numbers, and every time you suspected a virus, you’d need to ask all these services for a fresh DVD. This doesn’t sound so appealing. This is not just a theoretical issue. The Soviets have used the one- time pad for their confidential communication since before the 1940’s. In fact, even before Shannon’s work, the U.S. intelligence already cryptography 589

knew in 1941 that the one-time pad is in principle “unbreakable” (see page 32 in the Venona document). However, it turned out that the hassle of manufacturing so many keys for all the communication took its toll on the Soviets and they ended up reusing the same keys for more than one message. They did try to use them for completely dif- ferent receivers in the (false) hope that this wouldn’t be detected. The Venona Project of the U.S. Army was founded in February 1943 by Gene Grabeel (see Fig. 21.10), a former home economics teacher from Madison Heights, Virgnia and Lt. Leonard Zubko. In October 1943, they had their breakthrough when it was discovered that the were reusing their keys. In the 37 years of its existence, the project has resulted in a treasure chest of intelligence, exposing hundreds of KGB agents and Russian spies in the U.S. and other countries, including Julius Rosenberg, Harry Gold, Klaus Fuchs, Alger Hiss, Harry Dexter White and many others. Unfortunately it turns out that such long keys are necessary for perfect secrecy:

Theorem 21.5 — Perfect secrecy requires long keys. For every perfectly secret encryption scheme the length function satisfies . (퐸, 퐷) 퐿

Proof퐿(푛) Idea: ≤ 푛 Figure 21.10: Gene Grabeel, who founded the U.S. The idea behind the proof is illustrated in Fig. 21.11. We define a Russian SigInt program on 1 Feb 1943. Photo taken in graph between the plaintexts and ciphertexts, where we put an edge 1942, see Page 7 in the Venona historical study. between plaintext and ciphertext if there is some key such that . The degree of this graph is at most the number of potential keys. The fact that푥 the degree is smaller푦 than the number푘 of plaintexts 푘 푦(and = 퐸 hence(푥) of ciphertexts) implies that there would be two plaintexts and with different sets of neighbors, and hence the distribution of a ciphertext′ corresponding to (with a random key) will not be identical푥 푥 to the distribution of a ciphertext corresponding to . 푥 ′ 푥 Figure 21.11: An encryption scheme where the num- ⋆ ber of keys is smaller than the number of plaintexts Proof of Theorem 21.5. Let be a valid encryption scheme with corresponds to a bipartite graph where the degree is messages of length and key of length . We will show that smaller than the number of vertices on the left side. Together with the validity condition this implies that is not perfectly secret퐸,퐷 by providing two plaintexts there will be two left vertices with non-identical

such that the퐿 distributions and푛 < 퐿are not identical, where neighborhoods, and hence the scheme′ does not satisfy 0 1 perfect secrecy. 푥, 푥 (퐸,is 퐷) the퐿 distribution obtained by picking and푥 outputting, 푥 ∈ 푥0 푥1 {0, 1}. 푌 푌 푛 푥 푌 We choose . Let be푘 the ∼ {0, set 1} of all ciphertexts 푘 퐸that(푥) have nonzero probability퐿 of being output∗ in . That is, 0 0 푥 = 0 . Since푆 ⊆ there {0, 1} are only keys, we know that 푥0 0 . 푛 푛푌 푆 = 푘∈{0,1} 푘 0 {푦 | ∃ 푛 푦 = 퐸 (푥 )} 2 0 |푆 | ≤ 2 590 introduction to theoretical computer science

We will show the following claim: Claim I: There exists some and such that . 퐿 푛 1 Claim I implies that the string푥 ∈ {0, 1}has positive푘 ∈ {0, probability 1} of 푘 1 0 being퐸 (푥 ) output ∉ 푆 by and zero probability of being output by and 푘 1 hence in particular and are퐸 not(푥 identical.) To prove Claim I, just 푥1 푥0 choose a fixed 푌 . By the validity condition, the map푌 푥0 푥1 is a one to one푌 map푛 of푌 to and hence in particular the image of this푘 map ∈ {0, which 1} is the set퐿 ∗ 푥 ↦ 푘 퐸has(푥) size at least (in fact exactly){0, 1} . Since{0, 1} 퐿 , this means 푘 푥∈{0,1} 푘 that and so in particular퐿 there퐼 = exists {푦 | some ∃ 푛 string퐿푦 =in 퐸 (푥)} . 0 But by the definition of this means2 that|푆 there| ≤ is 2 some< 2 ⧵ 푘 0 푘 0 such|퐼 that| > |푆 | which concludes the proof of Claim I and푦 퐼 hence퐿 푆 푘 of Theorem 21.5. 퐼 푥 ∈ {0, 1} 푘 0 퐸 (푥) ∉ 푆 ■

21.6 COMPUTATIONAL SECRECY

To sum up the previous episodes, we now know that:

• It is possible to obtain a perfectly secret encryption scheme with key length the same as the plaintext.

and

• It is not possible to obtain such a scheme with key that is even a single bit shorter than the plaintext.

How does this mesh with the fact that, as we’ve already seen, peo- ple routinely use cryptosystems with a 16 byte (i.e., 128 bit) key but many terabytes of plaintext? The proof of Theorem 21.5 does give in fact a way to break all these cryptosystems, but an examination of this proof shows that it only yields an algorithm with time exponential in the length of the key. This motivates the following relaxation of perfect secrecy to a condition known as “computational secrecy”. Intuitively, an encryption scheme is computationally secret if no polynomial time algorithm can break it. The formal definition is below:

Definition 21.6 — Computational secrecy. Let be a valid encryp- tion scheme where for keys of length , the plaintexts are of length and the ciphertexts are of length (퐸,. 퐷) We say that is computationally secret if for every polynomial푛 , and large 퐿(푛) 푚(푛) (퐸, 퐷) 푝 ∶ ℕ → ℕ cryptography 591

enough , if is an -input and single output NAND-CIRC program of at most lines, and then 푛 푃 푚(푛) 퐿(푛) 0 1 푝(푛) 푥 , 푥 ∈ {0, 1} (21.4) 1 푛 푘 0 푛 푘 1 ∣푘∼{0,1} 피 [푃 (퐸 (푥 ))] −푘∼{0,1} 피 [푃 (퐸 (푥 ))]∣ < 푝(푛)

P Definition 21.6 requires a second or third read and some practice to truly understand. One excellent exer- cise to make sure you follow it is to see that if we allow to be an arbitrary function mapping to , and we replace the condition in (21.4푚(푛)) that the lefthand푃 side is smaller than with{0, the 1} condition that{0, 1} it is equal to then we get1 the perfect secrecy condition of Definition 21.3. Indeed푝(푛) if the distributions and 0 are identical then applying any function to them we get the same expectation. On 푘 0 푘 1 the퐸 (푥 other) hand,퐸 (푥 if) the two distributions above give a different푃 probability for some element , then the function that outputs iff∗ 푚(푛)will have a different expectation under the푦 former∈ {0, 1} distribu-∗ tion than under the푃 (푦) latter. 1 푦 = 푦

Definition 21.6 raises two natural questions:

• Is it strong enough to ensure that a computationally secret encryp- tion scheme protects the secrecy of messages that are encrypted with it?

• It is weak enough that, unlike perfect secrecy, it is possible to obtain a computationally secret encryption scheme where the key is much smaller than the message?

To the best of our knowledge, the answer to both questions is Yes. This is just one example of a much broader phenomenon. We can use computational hardness to achieve many cryptographic goals, including some goals that have been dreamed about for millenia, and other goals that people have not even dared to imagine.

 Big Idea 27 Computational hardness is necessary and sufficient for almost all cryptographic applications.

Regarding the first question, it is not hard to show that if, for ex- ample, Alice uses a computationally secret encryption algorithm to encrypt either “attack” or “retreat” (each chosen with probability 592 introduction to theoretical computer science

), then as long as she’s restricted to polynomial-time algorithms, an adversary Eve will not be able to guess the message with probability better1/2 than, say, , even after observing its encrypted form. (We omit the proof, but it is an excellent exercise for you to work it out on your own.) 0.51 To answer the second question we will show that under the same assumption we used for derandomizing BPP, we can obtain a com- putationally secret cryptosystem where the key is almost exponentially smaller than the plaintext.

21.6.1 Stream ciphers or the “derandomized one-time pad” It turns out that if pseudorandom generators exist as in the optimal PRG conjecture, then there exists a computationally secret encryption scheme with keys that are much shorter than the plaintext. The con- struction below is known as a stream cipher, though perhaps a better name is the “derandomized one-time pad”. It is widely used in prac- tice with keys on the order of a few tens or hundreds of bits protecting many terabytes or even petabytes of communication. We start by recalling the notion of a pseudorandom generator, as de- fined in Definition 20.9. For this chapter, we will fix a special case of the definition:

Definition 21.7 — Cryptographic pseudorandom generator. Let be some function. A cryptographic pseudorandom generator with stretch is a polynomial-time computable function 퐿 ∶ ℕ → ℕ such that: ∗ ∗ Figure 21.12 퐿(⋅) 퐺 ∶ {0, 1} → {0, 1} : In a stream cipher or “derandomized one-time pad” we use a pseudorandom generator • For every and , . to obtain an encryption scheme

푛 with a key푛 length of 퐿 and plaintexts of length . • For every푛 polynomial ∈ ℕ 푠 ∈ {0, 1} |퐺(푠)|and =large 퐿(푛) enough, if is a cir- 퐺We ∶ encrypt{0, 1} → the {0, plaintext 1} with key cuit of inputs, one output, and at most gates then by the ciphertext푛 퐿 . 퐿 푝 ∶ ℕ → ℕ 푛 퐶 푛 푥 ∈ {0, 1} 푘 ∈ {0, 1} 푥 ⊕ 퐺(푘) Pr퐿(푛) Pr 푝(푛) (21.5)

ℓ 푚 1 ∣푠∼{0,1} [퐶(퐺(푠)) = 1] − 푟∼{0,1} [퐶(푟) = 1]∣ < . 푝(푛) In this chapter we will call a cryptographic pseudorandom gener- ator simply a pseudorandom generator or PRG for short. The optimal PRG conjecture of Section 20.4.2 implies that there is a pseudoran- dom generator that can “fool” circuits of exponential size and where the gap in probabilities is at most one over an exponential quantity. Since exponential grow faster than every polynomial, the optimal PRG conjecture implies the following:

The crypto PRG conjecture: For every , there is a cryptographic pseudorandom generator with . 푎푎 ∈ ℕ 퐿(푛) = 푛 cryptography 593

The crypto PRG conjecture is a weaker conjecture than the optimal PRG conjecture, but it too (as we will see) is still stronger than the conjecture that P NP.

Theorem 21.8 — Derandomized≠ one-time pad. Suppose that the crypto PRG conjecture is true. Then for every constant there is a computationally secret encryption scheme with plaintext length at least . 푎 ∈ ℕ 푎 (퐸, 퐷) Proof Idea:퐿(푛) 푛 The proof is illustrated in Fig. 21.12. We simply take the one-time pad on bit plaintexts, but replace the key with where is a string in and is a pseudorandom gen- erator. Since퐿 the푛 one time pad cannot푛 be broken,퐿 퐺(푘) an adversary푘 that breaks the{0, derandomized 1} 퐺 ∶ {0, one-time 1} → pad {0, 1} can be used to distinguish between the output of the pseudorandom generator and the uniform distribution.

Proof⋆ of Theorem 21.8. Let for be the restriction to input length of the pseudorandom푛 퐿 generator푎 whose existence we are guaranteed퐺 ∶from {0, 1} the crypto→ {0, 1} PRG conjecture.퐿 = 푛 We now define our encryption푛 scheme as follows: given key 퐺 and plaintext , the encryption is simply .푛 To decrypt a string 퐿 we output . This푘 is ∈ a {0, valid 1} 푘 encryption since푥 ∈ {0,is computable 1} 푚 in polynomial퐸 (푥) time and 푥 ⊕ 퐺(푘) 푦 ∈ {0, 1} for every 푦 ⊕ 퐺(푘). Computational퐺 secrecy follows from the condition퐿 of a pseudo-(푥 ⊕ 퐺(푘)) ⊕ random퐺(푘) = 푥 generator. ⊕ (퐺(푘) ⊕ Suppose, 퐺(푘)) = 푥towards a contradiction,푥 ∈ {0, 1} that there is a polynomial , NAND-CIRC program of at most lines and such that ′ 퐿(푛)푝 푄 푝(퐿) 푥, 푥 ∈ {0, 1} (21.6) ′ 1 푛 푘 푛 푘 (We use∣푘∼{0,1} here 피 the[푄(퐸 simple(푥))] fact − that푘∼{0,1} 피 for[푄(퐸 a (푥-valued))]∣ > 푝(퐿) random. variable , Pr .) By the definition of our encryption{0, 1} scheme, this means that 푋 [푋 = 1] = 피[푋] (21.7) 1 푛 푛 ′ Now∣푘∼{0,1} since 피 [푄(퐺(푘) (as we saw ⊕ 푥)] in − the푘∼{0,1} security 피 [푄(퐺(푘) analysis ⊕ 푥 of)]∣ the > 푝(퐿) one-time. pad), for every strings , the distribution and are identical, where ′ . Hence퐿 ′ 푥, 푥 ∈ {0,퐿 1} 푟 ⊕ 푥 푟 ⊕ 푥 푟 ∼ {0, 1} (21.8) ′ 퐿 퐿 푟∼{0,1}피 [푄(푟 ⊕ 푥)] =푟∼{0,1} 피 [푄(푟 ⊕ 푥 )] . 594 introduction to theoretical computer science

By plugging (21.8) into (21.7) we can derive that

1 푛 ′ 푛 ′ 퐿 퐿 (21.9) 푝(퐿) ∣푘∼{0,1} 피 [푄(퐺(푘) ⊕ 푥)] −푟∼{0,1} 피 [푄(푟 ⊕ 푥)] +푟∼{0,1} 피 [푄(푟 ⊕ 푥 )] −푘∼{0,1} 피 [푄(퐺(푘) ⊕ 푥 )]∣ > . (Please make sure that you can see why this is true.) Now we can use the triangle inequality that for every two numbers , applying it for and |퐴 + 퐵|푛 ≤ |퐴| + |퐵| 푘∼{0,1} to derive퐿 퐴, 퐵 퐿 퐴 =′ 피 [푄(퐺(푘)푛 ⊕ 푥)] −′ 푟∼{0,1} 푟∼{0,1} 푘∼{0,1} 피 [푄(푟⊕푥)] 퐵 = 피 [푄(푟⊕푥 )]−피 [푄(퐺(푘)⊕푥 )]

1 푛 ′ 푛 ′ 퐿 퐿 (21.10) 푝(퐿) ∣푘∼{0,1} 피 [푄(퐺(푘) ⊕ 푥)] −푟∼{0,1} 피 [푄(푟 ⊕ 푥)]∣+∣푟∼{0,1} 피 [푄(푟 ⊕ 푥 )] −푘∼{0,1} 피 [푄(퐺(푘) ⊕ 푥 )]∣ > . In particular, either the first term or the second term of the lefthand-side of (21.10) must be at least . Let us assume the first case holds (the second case is analyzed in1 exactly the same way). 2푝(퐿) Then we get that

(21.11) 1 푛 퐿 2푝(퐿) But∣ if푘∼{0,1} we 피 now[푄(퐺(푘) define ⊕the 푥)] − NAND-CIRC푟∼{0,1} 피 [푄(푟 ⊕ program 푥)]∣ > that. on input outputs then (since XOR of bits can be computed 푥 in lines),퐿 we get that has lines푃 and by (21.11) it 푟can ∈{0, distinguish 1} between푄(푟⊕푥) an input of the form 퐿 and an input of the 푥 form푂(퐿) with advantage푃 better푝(퐿) + than 푂(퐿) . Since a polynomial is dominated by푘 an exponential, if we make 퐺(푘)large1 enough, this will 2푝(퐿) contradict푟 ∼ {0, the 1} security of the pseudorandom generator . ■ 훿푛 −훿푛 퐿 (2 , 2 ) 퐺

R Remark 21.9 — Stream ciphers in practice. The two most widely used forms of (private key) encryption schemes in practice are stream ciphers and block ciphers. (To make things more confusing, a block cipher is always used in some mode of operation and some of these modes effectively turn a block cipher into a stream cipher.) A block cipher can be thought as a sort of a “random invertible map” from to , and can be used to construct a pseudorandom푛 generator푛 and from it a stream cipher, or to{0, encrypt 1} data{0, 1} directly using other modes of operations. There are a great many other security notions and consider- ations for encryption schemes beyond computational secrecy. Many of those involve handling scenarios such as chosen plaintext, man in the middle, and cho- sen ciphertext attacks, where the adversary is not just merely a passive eavesdropper but can influence the communication in some way. While this chapter is cryptography 595

meant to give you some taste of the ideas behind cryp- tography, there is much more to know before applying it correctly to obtain secure applications, and a great many people have managed to get it wrong.

21.7 COMPUTATIONAL SECRECY AND NP We’ve also mentioned before that an efficient algorithm for NP could be used to break all cryptography. We now give an example of how this can be done:

Theorem 21.10 — Breaking encryption using NP algorithm. If P NP then there is no computationally secret encryption scheme with . = Furthermore, for every valid encryption scheme with 퐿(푛) > 푛 there is a polynomial such that for every large enough there exist and a (퐸,-line 퐷) NAND- 퐿(푛)CIRC program > 푛 +EVE 100 s.t. 퐿(푛)푝 0 1 푛 푥 , 푥 ∈ {0, 1} 푝(푛) Pr EVE (21.12)

푛 푘 푖 푖∼{0,1},푘∼{0,1} Note that the “furthermore”[ (퐸 part(푥 is)) extremely = 푖] ≥ 0.99 strong. . It means that if the plaintext is even a little bit larger than the key, then we can already break the scheme in a very strong way. That is, there will be a pair of messages , (think of as “sell” and as “buy”) and an efficient strategy for Eve such that if Eve gets a ciphertext then 0 1 0 1 she will be able to tell푥 whether푥 is푥 an encryption of푥 or with probability very close to . (We model breaking the scheme as푦 Eve 0 1 outputting or corresponding푦 to whether the message푥 sent푥 was or . Note that we could1 have just as well modified Eve to output 0 instead of 0and1 instead of . The key point is that a priori Eve only푥 1 0 had푥 a 50/50 chance of guessing whether Alice sent or but after푥 1 seeing the0 ciphertext푥 this chance1 increases to better than 99/1.) The 0 1 condition P NP can be relaxed to NP BPP and푥 even푥 the weaker condition NP P poly with essentially the same proof. = ⊆ Proof Idea: / ⊆ The proof follows along the lines of Theorem 21.5 but this time paying attention to the computational aspects. If P NP then for every plaintext and ciphertext , we can efficiently tell whether there exists such that . So, to prove this= result we need to show that if the푥푛 plaintexts are푦 long enough, there would exist a pair 푘 푘such ∈ {0, that 1} the probability퐸 (푥) that = a 푦 random encryption of also is a valid encryption of will be very small. The details of how to show 0 1 1 푥this, 푥 are below. 푥 0 푥 596 introduction to theoretical computer science

Proof of Theorem 21.10. We focus on showing only the “furthermore” part since it is the more interesting and the other part follows by es- sentially the same proof. Suppose that is such an encryption, let be large enough, and let . For every we define to be the set of all valid encryption퐿(푛)(퐸, 퐷) of . That is 퐿(푛) 푛 . As 0 푥 in the proof푥 = of 0 Theorem 21.5,푥 since ∈ {0, there 1} are keys 푛 ,푆 for 푥 푘∈{0,1} 푘 every . 푥 푆 = {푦 | ∃푛 푦 = 퐸 (푥)}푛 푥 We denote by퐿(푛) the set . We define 2our algorithm푘 |푆EVE| ≤to 2 out- put on푥 ∈ input {0, 1} if and to output otherwise. This 0 푥0 can be implemented푆 in polynomial∗ 푆 time if P NP, since the key 0 can serve0 the role푦 ∈ of {0, an 1} efficiently푦 ∈ 푆 verifiable solution.1 (Can you see why?) Clearly Pr EVE and= so in the case that EVE푘 gets an encryption of then she guesses correctly with probability 푘 0 . The remainder[ of the(퐸 proof(푥 )) is = devoted 0] = 1 to showing that there ex- 0 ists such푥 that Pr EVE , which 1will conclude the퐿(푛) proof by showing that EVE guesses wrongly with 1 푘 1 probability푥 ∈ {0, at 1} most [ (퐸. (푥 )) = 0] ≤ 0.01 Consider now the1 following1 probabilistic experiment (which we 2 2 define solely for the0 + sake0.01 of < analysis). 0.01 We consider the sample space of choosing uniformly in and define the random variable to equal if and only if 퐿(푛) . For every , the map is one-to-one,푥 which{0, means 1} that the probability that 푘 푘 0 푍is equal(푥) to the probability1 that퐸 (푥) ∈ 푆 which is푘 . So by푥 the ↦ 푘 0 푘 linearity퐸 (푥) of expectation −1 |푆퐿(푛). | 푍 = 1 푛 2푛 푘 0 0 2 We will now use the following푥 ∈ extremely푛 퐸 (푆 2)퐿(푛)|푆 simple| 2 but퐿(푛) useful fact 푘∈{0,1} 푘 2 2 known as the averaging principle피[∑ (see푍 also] ≤Lemma≤ 18.10): for every random variable , if , then with positive probability . (Indeed, if with probability one, then the expected value of will have to be larger푍 than피[푍] =, just 휇 like you can’t have a class in which푍 ≤ 휇 all students푍 got > A 휇 or A- and yet the overall average is B+.) In our case푍 it means that with positive휇 probability . In other 2푛 words, there exists some such that푛 2퐿(푛) 푘∈{0,1} 푘 2 퐿(푛)∑ 푍 ≤ 푛 . Yet this means that if1 we choose a random ,푘 then1 2푛 푥 ∈ {0, 1} ∑푘∈{0,1} 푍 (푥 ) ≤ the2퐿(푛) probability that is at most 푛 . 2 2푛 So, in particular if we have an algorithm EVE1푛that푘2퐿(푛) outputs ∼ {0, 1}푛−퐿(푛)if 푘 1 0 2 2 and outputs otherwise,퐸 (푥 ) ∈ then 푆 Pr EVE ⋅ = 2 and Pr EVE which will be smaller than0 푥 ∈ 0 푘 0 푆if 1. 푛−퐿(푛) [ (퐸 (푥 )) = 0] = 1−10 푘 1 [ (퐸 (푥 )) = 0] ≤ 2 2 < 0.01■ 퐿(푛) ≥ 푛 + 10 cryptography 597

In retrospect Theorem 21.10 is perhaps not surprising. After all, as we’ve mentioned before it is known that the Optimal PRG conjecture (which is the basis for the derandomized one-time pad encryption) is false if P NP (and in fact even if NP BPP or even NP P poly).

/ 21.8 PUBLIC= KEY CRYPTOGRAPHY ⊆ ⊆ People have been dreaming about heavier-than-air flight since at least the days of Leonardo Da Vinci (not to mention Icarus from the greek mythology). Jules Verne wrote with rather insightful details about going to the moon in 1865. But, as far as I know, in all the thousands of years people have been using secret writing, until about 50 years ago no one has considered the possibility of communicating securely without first exchanging a shared secret key. Yet in the late 1960’s and early 1970’s, several people started to question this “common wisdom”. Perhaps the most surprising of these visionaries was an undergraduate student at Berkeley named Ralph Merkle. In the fall of 1974 Merkle wrote in a project proposal for his course that while “it might seem intuitively obvious that if two people have never had the opportunity to prear- range an encryption method, then they will be unable to communicate securely over an insecure channel… I believe it is false”. The project proposal was rejected by his professor as “not good enough”. Merkle later submitted a paper to the communication of the ACM where he apologized for the lack of references since he was unable to find any mention of the problem in the scientific literature, and the only source where he saw the problem even raised was in a story. The paper was rejected with the comment that “Experience shows that it is extremely dangerous to transmit key information in the clear.” Merkle showed that one can design a protocol where Alice and Bob can use invocations of a hash function to exchange a key, but an adversary (in the random oracle model, though he of course didn’t use this푇 name) would need roughly invocations to break it. He conjectured that it may be possible to obtain2 such protocols where breaking is exponentially harder than using푇 them, but could not think of any concrete way to doing so. We only found out much later that in the late 1960’s, a few years before Merkle, James Ellis of the British Intelligence agency GCHQ was having similar thoughts. His curiosity was spurred by an old World-War II manuscript from that suggested the following way that two people could communicate securely over a phone line. Alice would inject noise to the line, Bob would relay his messages, and then Alice would subtract the noise to get the signal. The idea is that an adversary over the line sees only the sum of Alice’s and Bob’s signals, and doesn’t know what came from what. This got James Ellis 598 introduction to theoretical computer science

thinking whether it would be possible to achieve something like that digitally. As Ellis later recollected, in 1970 he realized that in princi- ple this should be possible, since he could think of an hypothetical black box that on input a “handle” and plaintext would give a “ciphertext” and that there would be a secret key corresponding to , such퐵 that feeding and to the box훼 would recover푥 . However, Ellis had no idea푦 how to actually instantiate this box.훽 He and others kept훼 giving this question훽 as a푦 puzzle to bright new recruits푥 until one of them, Clifford Cocks, came up in 1973 with a candidate solution loosely based on the factoring problem; in 1974 another GCHQ re- cruit, Malcolm Williamson, came up with a solution using modular exponentiation. But among all those thinking of public key cryptography, probably the people who saw the furthest were two researchers at Stanford, Whit Diffie and Martin Hellman. They realized that with the advent of electronic communication, cryptography would find new applica- tions beyond the military domain of spies and submarines, and they understood that in this new world of many users and point to point communication, cryptography will need to scale up. Diffie and Hell- man envisioned an object which we now call “trapdoor permutation” though they called “one way trapdoor function” or sometimes simply “public key encryption”. Though they didn’t have full formal defini- tions, their idea was that this is an injective function that is easy (e.g., polynomial-time) to compute but hard (e.g., exponential-time) to in- vert. However, there is a certain trapdoor, knowledge of which would allow polynomial time inversion. Diffie and Hellman argued that us- ing such a trapdoor function, it would be possible for Alice and Bob to communicate securely without ever having exchanged a secret key. But they didn’t stop there. They realized that protecting the integrity of communication is no less important than protecting its secrecy. Thus they imagined that Alice could “run encryption in reverse” in order to certify or sign messages. At the point, Diffie and Hellman were in a position not unlike who predicted that a certain particle should exist but with- out any experimental verification. Luckily they met Ralph Merkle, and his ideas about a probabilistic key exchange protocol, together with a suggestion from their Stanford colleague John Gill, inspired them to come up with what today is known as the Diffie Hellman Key Ex- change (which unbeknownst to them was found two years earlier at GCHQ by Malcolm Williamson). They published their paper “New Directions in Cryptography” in 1976, and it is considered to have brought about the birth of modern cryptography. The Diffie-Hellman Key Exchange is still widely used today for secure communication. However, it still felt short of providing Diffie cryptography 599

and Hellman’s elusive trapdoor function. This was done the next year by Rivest, Shamir and Adleman who came up with the RSA trapdoor function, which through the framework of Diffie and Hellman yielded not just encryption but also signatures. (A close variant of the RSA function was discovered earlier by Clifford Cocks at GCHQ, though as far as I can tell Cocks, Ellis and Williamson did not realize the application to digital signatures.) From this point on began a flurry of advances in cryptography which hasn’t died down till this day.

21.8.1 Defining public key encryption A public key encryption consists of a triple of algorithms:

• The key generation algorithm, which we denote by or KG for short, is a randomized algorithm that outputs a pair of strings where is known as the public (or encryption) key,퐾푒푦퐺푒푛 and is known as the private (or decryption) key. The key generation algorithm(푒, gets 푑) as input푒 (i.e., a string of ones of length ). We refer to푑 as the Figure 21.13: Top left: Ralph Merkle, Martin Hellman security parameter푛 of the scheme. The bigger we make , the more and Whit Diffie, who together came up in 1976 1 푛 푛 with the concept of public key encryption and a key secure the encryption will be, but also the less efficient it will be. exchange protocol. Bottom left: Adi Shamir, Ron Rivest, 푛 and Leonard Adleman who, following Diffie and • The encryption algorithm, which we denote by , takes the encryp- Hellman’s paper, discovered the RSA function that tion key and a plaintext , and outputs the ciphertext . can be used for public key encryption and digital 퐸 signatures. Interestingly, one can see the equation • The decryption algorithm, which we denote by , takes the decryp-푒 P NP on the blackboard behind them. Right: John 푒 푥 푦 = 퐸 (푥) Gill, who was the first person to suggest to Diffie and tion key and a ciphertext , and outputs the plaintext . Hellman= that they use modular exponentiation as an 퐷 easy-to-compute but hard-to-invert function. We now make this a formal definition: 푑 푑 푦 푥 = 퐷 (푦) Definition 21.11 — Public Key Encryption. A computationally secret public key encryption with plaintext length is a triple of ran- domized polynomial-time algorithms KG that satisfy the following conditions: 퐿 ∶ ℕ → ℕ ( , 퐸, 퐷) • For every , if is output by KG with positive proba- bility, and , then 푛 with probability one. 푛 (푒, 푑) 퐿(푛) (1 ) 푑 푒 Figure 21.14: In a public key encryption, Alice generates 푥 ∈ {0, 1} 퐷 (퐸 (푥)) = 푥 a private/public keypair , publishes and keeps • For every polynomial , and sufficiently large , if is a NAND- secret. To encrypt a message for Alice, one only CIRC program of at most lines then for every needs to know . To decrypt(푒, 푑) it we need to푒 know . 푑 , 푝 푛 푃 ′ , where 푒 푑 this probability퐿(푛) is taken over푝(푛) the coins of′ KG and 푥,. 푥 ∈ 푒 푒 {0, 1} |피[푃 (푒, 퐸 (푥))] − 피[푃 (푒, 퐸 (푥 ))]| < 1/푝(푛) 퐸 Definition 21.11 allows and to be randomized algorithms. In fact, it turns out that it is necessary for to be randomized to obtain computational secrecy. It also퐸 turns퐷 out that, unlike the private key case, we can transform a public-key encryption퐸 that works for mes- sages that are only one bit long into a public-key encryption scheme 600 introduction to theoretical computer science

that can encrypt arbitrarily long messages, and in particular messages that are longer than the key. In particular this means that we cannot ob- tain a perfectly secret public-key encryption scheme even for one-bit long messages (since it would imply a perfectly secret public-key, and hence in particular private-key, encryption with messages longer than the key). We will not give full constructions for public key encryption schemes in this chapter, but will mention some of the ideas that underlie the most widely used schemes today. These generally belong to one of two families:

• Group theoretic constructions based on problems such as integer factor- ing and the discrete logarithm over finite fields or elliptic curves.

• Lattice/coding based constructions based on problems such as the closest vector in a lattice or bounded distance decoding.

Group-theory based encryptions such as the RSA cryptosystem, the Diffie-Hellman protocol, and Elliptic-Curve Cryptography, are cur- rently more widely implemented. But the lattice/coding schemes are recently on the rise, particularly because the known group theoretic encryption schemes can be broken by quantum computers, which we discuss in Chapter 23.

21.8.2 Diffie-Hellman key exchange As just one example of how public key encryption schemes are con- structed, let us now describe the Diffie-Hellman key exchange. We describe the Diffie-Hellman protocol in a somewhat of an informal level, without presenting a full security analysis. The computational problem underlying the Diffie Hellman protocol is the discrete logarithm problem. Let’s suppose that is some integer. We can compute the map and also its inverse log . (For example, we can compute a logarithm푥 is by binary푔 search: start with 푔 some interval 푥that ↦ 푔 is guaranteed to contain푦 ↦ log .푦 We can then test whether the interval’s midpoint satisfies , and 푚푖푛 푚푎푥 푔 based on that halve[푥 , the 푥 size] of the interval.) 푥푚푖푑 푦 푚푖푑 However, suppose now that we use modular푥 arithmetic푔 and> work 푦 modulo some prime number . If has binary digits and is in then we can compute the map mod in time polynomial in . (This is not trivial, and is a great푝 exercise푝 푥 푛 for you to work this푔 out;[푝] as a hint, start by showing that one푥 can ↦ compute 푔 the푝 map mod 푛 푘 using modular multiplications modulo , if you’re stumped,2 you can look up this Wikipedia entry.) On the other hand,푘 because ↦ 푔 of the푝 “wraparound”푘 property of ,푝 we cannot run binary search to find the inverse of this map (known as the discrete logarithm). cryptography 601

In fact, there is no known polynomial-time algorithm for computing this discrete logarithm map log mod , where we define log mod as the number such that mod . 푔 The Diffie-Hellman protocol(푔, 푥, 푝) for ↦ Bob 푥to푎 send푝 a message to Alice isas 푔 follows:푥 푝 푎 ∈ [푝] 푔 = 푥 푝

• Alice: Chooses to be a random bit long prime (which can be done by choosing random numbers and running a primality testing algorithm on them),푝 and and 푛at random in . She sends to Bob the triple mod . 푎 푔 푎 [푝] • Bob: Given(푝, the 푔, 푔 triple 푝) , Bob sends a message to Alice by choosing at random in , and sending to Alice the퐿 pair mod (푝,mod 푔, ℎ) where 푥 ∈ {0, 1} is some푏 “representation푏 푏 function” that[푝] maps to . (The∗ function(푔 does푝, 푟푒푝(ℎ not need to be푝) ⊕one-to-one 푥) and푟푒푝 you ∶ [푝] can →퐿 think {0, 1} of as simply outputting of the bits of [푝]in the{0, natural 1} binary representation,푟푒푝 it does need to satisfy certain technical conditions 푟푒푝(푧)which we omit in this description.)퐿 푧

• Alice: Given , Alice recovers by outputting mod . ′ ′푎 푔 , 푧 푥 푟푒푝(푔 푝) ⊕ 푧The correctness of the protocol follows from the simple fact that for every and this still holds if we work modulo .푎 Its푏 security푏 푎 relies on the computational assumption that computing (푔this) map= (푔 is) hard, even in푔, a 푎, certain 푏 “average case” sense (this computa- tional푝 assumption is known as the Decisional Diffie Hellman assump- tion). The Diffie-Hellman key exchange protocol can be thought ofas a public key encryption where the Alice’s first message is the public key, and Bob’s message is the encryption. One can think of the Diffie-Hellman protocol as being based on a “trapdoor pseudorandom generator” whereas the triple looks “random” to someone that doesn’t know , but someone푎 푏 that푎푏 does know can see that raising the second element to the푔 -th, 푔 , power 푔 yields the third element. The Diffie-Hellman푎 protocol can be described abstractly in푎 the context of any finite Abelian group for which푎 we can efficiently compute the group operation. It has been implemented on other groups than numbers modulo , and in particular Elliptic Curve Cryptography (ECC) is obtained by basing the Diffie Hell- man on elliptic curve groups which gives푝 some practical advantages. Another common group theoretic basis for key-exchange/public key encryption protocol is the RSA function. A big disadvantage of Diffie- Hellman (both the modular arithmetic and elliptic curve variants) and RSA is that both schemes can be broken in polynomial time by a 602 introduction to theoretical computer science

quantum computer. We will discuss quantum computing later in this course.

21.9 OTHER SECURITY NOTIONS

There is a great deal to cryptography beyond just encryption schemes, and beyond the notion of a passive adversary. A central objective is integrity or authentication: protecting communications from being modified by an adversary. Integrity is often more fundamental than secrecy: whether it is a software update or viewing the news, you might often not care about the communication being secret as much as that it indeed came from its claimed source. Digital signature schemes are the analog of public key encryption for authentication, and are widely used (in particular as the basis for public key certificates) to provide a foundation of trust in the digital world. Similarly, even for encryption, we often need to ensure security against active attacks, and so notions such as non-malleability and adaptive chosen ciphertext security have been proposed. An encryp- tion scheme is only as secure as the secret key, and mechanisms to make sure the key is generated properly, and is protected against re- fresh or even compromise (i.e., forward secrecy) have been studied as well. Hopefully this chapter provides you with some appreciation for cryptography as an intellectual field, but does not imbue you witha false self confidence in implementing it. Cryptographic hash functions is another widely used tool with a va- riety of uses, including extracting randomness from high entropy sources, achieving hard-to-forge short “digests” of files, protecting passwords, and much more.

21.10 MAGIC

Beyond encryption and signature schemes, cryptographers have man- aged to obtain objects that truly seem paradoxical and “magical”. We briefly discuss some of these objects. We do not give any details, but hopefully this will spark your curiosity to find out more.

21.10.1 Zero knowledge proofs On October 31, 1903, the mathematician Frank Nelson Cole, gave an hourlong lecture to a meeting of the American Mathe- matical Society where he did not speak a single word. Rather, he calculated on the board the value which is equal to , and then67 showed that this number is equal to 2 .− Cole’s 1 proof showed that 147, 573,is 952, not 589, a prime, 676, 412, but927 it also revealed additional information, namely67 its193, actual 707, 721factors. × 761, This 838, is often 257, 287 the case with proofs: they teach 2us more− 1 than just the validity of the statements. cryptography 603

In Zero Knowledge Proofs we try to achieve the opposite effect. We want a proof for a statement where we can rigorously show that the proofs reveals absolutely no additional information about beyond the fact that it is true. This turns푋 out to be an extremely useful object for a variety of tasks including authentication, secure protocols,푋 voting, anonymity in cryptocurrencies, and more. Constructing these ob- jects relies on the theory of NP completeness. Thus this theory that originally was designed to give a negative result (show that some prob- lems are hard) ended up yielding positive applications, enabling us to achieve tasks that were not possible otherwise.

21.10.2 Fully homomorphic encryption Suppose that we are given a bit-by-bit encryption of a string . By design, these ciphertexts are supposed to be “completely unscrutable” and we should not be able to extract 푘 0 푘 푛−1 퐸any(푥 information), … , 퐸 (푥 about) ’s from it. However, already in 1978, Rivest, Adleman and Dertouzos observed that this does not imply that we 푖 could not manipulate these푥 encryptions. For example, it turns out the security of an encryption scheme does not immediately rule out the ability to take a pair of encryptions and and compute from them NAND without knowing the secret key . But do there 푘 푘 exist encryption schemes that allow퐸 such(푎) manipulations?퐸 (푏) And if so, is 푘 this a bug or퐸 a(푎 feature?푏) 푘 Rivest et al already showed that such encryption schemes could be immensely useful, and their utility has only grown in the age of . After all, if we can compute NAND then we can use this to run any algorithm on the encrypted data, and map to . For example, a could store their secret data 푃in encrypted form on the cloud, and 푘 0 푘 푛−1 푘 0 푛−1 have퐸 (푥 the), … cloud , 퐸 (푥 provider) 퐸 perform(푃 (푥 , … all , 푥sorts)) of computation on these data without ever revealing to푥 the provider the private key, and so without the provider ever learning any information about the secret data. The question of existence of such a scheme took much longer time to resolve. Only in 2009 Craig Gentry gave the first construction of an encryption scheme that allows to compute a universal basis of gates on the data (known as a Fully Homomorphic Encryption scheme in crypto parlance). Gentry’s scheme left much to be desired in terms of efficiency, and improving upon it has been the focus of an intensive research program that has already seen significant improvements.

21.10.3 Multiparty secure computation Cryptography is about enabling mutually distrusting parties to achieve a common goal. Perhaps the most general primitive achiev- 604 introduction to theoretical computer science

ing this objective is secure multiparty computation. The idea in secure multiparty computation is that parties interact together to compute some function where is the private input of the -th party. The crucial point is that there푛 is no commonly trusted party or 0 푛−1 푖 authority and that퐹 (푥 nothing, … , 푥 is revealed) 푥 about the secret data beyond푖 the function’s output. One example is an protocol where only the total vote count is revealed, with the privacy of the individual voters protected, but without having to trust any authority to either count the votes correctly or to keep information confidential. Another example is implementing a second price (aka Vickrey) auction where parties submit bids to an item owned by the -th party, and the item goes to the highest bidder but at the price of the second highest bid. 푛Using − 1 secure multiparty computation we can implement푛 second price auction in a way that will ensure the secrecy of the numerical values of all bids (including even the top one) except the second highest one, and the secrecy of the identity of all bidders (including even the sec- ond highest bidder) except the top one. We emphasize that such a protocol requires no trust even in the auctioneer itself, that will also not learn any additional information. Secure multiparty computation can be used even for computing randomized processes, with one exam- ple being playing Poker over the net without having to trust any for correct shuffling of cards or not revealing the information.

Chapter Recap • We can formally define the notion of security of an ✓ encryption scheme. • Perfect secrecy ensures that an adversary does not learn anything about the plaintext from the cipher- text, regardless of their computational powers. • The one-time pad is a perfectly secret encryption with the length of the key equaling the length of the message. No perfectly secret encryption can have key shorter than the message. • Computational secrecy can be as good as perfect secrecy since it ensures that the advantage that computationally bounded adversaries gain from observing the ciphertext is exponentially small. If the optimal PRG conjecture is true then there exists a computationally secret encryption scheme with messages that can be (almost) exponentially bigger than the key. • There are many cryptographic tools that go well be- yond private key encryption. These include public key encryption, digital signatures and hash functions, as well as more “magical” tools such as multiparty secure computation, fully homomorphic encryption, zero knowledge proofs, and many others. cryptography 605

21.11 EXERCISES

21.12 BIBLIOGRAPHICAL NOTES

Much of this text is taken from my lecture notes on cryptography. Shannon’s manuscript was written in 1945 but was classified, and a partial version was only published in 1949. Still it has revolutionized cryptography, and is the forerunner to much of what followed. The Venona project’s history is described in this document. Aside from Grabeel and Zubko, credit to the discovery that the Soviets were reusing keys is shared by Lt. Richard Hallock, Carrie Berry, Frank Lewis, and Lt. Karl Elmquist, and there are others that have made important contribution to this project. See pages 27 and 28 in the document. In a 1955 letter to the NSA that only recently came forward, John Nash proposed an “unbreakable” encryption scheme. He wrote “I hope my handwriting, etc. do not give the impression I am just a crank or circle-squarer…. The significance of this conjecture [that certain encryption schemes are exponentially secure against key recovery attacks] .. is that it is quite feasible to design ciphers that are effectively unbreakable.”. John Nash made seminal contributions in mathematics and , and was awarded both the Abel Prize in mathematics and the Nobel Memorial Prize in Economic Sciences. However, he has struggled with mental illness throughout his life. His biography, A Beautiful Mind was made into a popular movie. It is natural to compare Nash’s 1955 letter to the NSA to Gödel’s letter to von Neumann we mentioned before. From the theoretical computer science point of view, the crucial difference is that while Nash informally talks about exponential vs polynomial computation time, he does not mention the word “Turing Machine” or other models of computation, and it is not clear if he is aware or not that his conjecture can be made mathematically precise (assuming a formalization of “sufficiently complex types of enciphering”). The definition of computational secrecy we use is the notion of computational indistinguishability (known to be equivalent to semantic security) that was given by Goldwasser and Micali in 1982. Although they used a different terminology, Diffie and Hellman already made clear in their paper that their protocol can be used as a public key encryption, with the first message being put in a “pub- lic file”. In 1985, ElGamal showed how to obtain a signature scheme based on the Diffie Hellman ideas, and since he described the Diffie- Hellman encryption scheme in the same paper, the public key encryp- tion scheme originally proposed by Diffie and Hellman is sometimes also known as ElGamal encryption. My survey contains a discussion on the different types of public key assumptions. While the standard elliptic curve cryptographic schemes 606 introduction to theoretical computer science

are as susceptible to quantum computers as Diffie-Hellman and RSA, their main advantage is that the best known classical algorithms for computing discrete logarithms over elliptic curve groups take time for some where is the number of bits to describe a group ele-휖푛 ment. In contrast, for the multiplicative group modulo a prime the2 best algorithm휖 > 0 take time푛 which means that (assum- 1/3 ing the known algorithms푂(푛 are optimal)푝표푙푦푙표푔(푛)) we need to set the prime푝 to be bigger (and so have larger2 key sizes with corresponding overhead in communication and computation) to get the same level of security. Zero-knowledge proofs were constructed by Goldwasser, Micali, and Rackoff in 1982, and their wide applicability was shown (using the theory of NP completeness) by Goldreich, Micali, and Wigderson in 1986. Two party and multiparty secure computation protocols were con- structed (respectively) by Yao in 1982 and Goldreich, Micali, and Wigderson in 1987. The latter work gave a general transformation from security against passive adversaries to security against active adversaries using zero knowledge proofs. 22 Proofs and algorithms

“Let’s not try to define knowledge, but try to define zero-knowledge.”, Shafi Goldwasser.

Proofs have captured human imagination for thousands of years, ever since the publication of Euclid’s Elements, a book second only to the bible in the number of editions printed. Plan:

• Proofs and algorithms

• Interactive proofs

• Zero knowledge proofs

• Propositions as types, Coq and other proof assistants.

22.1 EXERCISES

22.2 BIBLIOGRAPHICAL NOTES

Compiled on 8.26.2020 18:10

Learning Objectives: • See main aspects in which quantum mechanics differs from local deterministic theories. • Model of quantum circuits, or equivalently QNAND-CIRC programs • The complexity class BQP and what we know about its relation to other classes • Ideas behind Shor’s Algorithm and the 23 Quantum Fourier Transform Quantum computing

“We always have had (secret, secret, close the doors!) … a great deal of diffi- culty in understanding the world view that quantum mechanics represents … It has not yet become obvious to me that there’s no real problem. … Can I learn anything from asking this question about computers–about this may or may not be mystery as to what the world view of quantum mechanics is?” , Richard Feynman, 1981

“The only difference between a probabilistic classical world and the equations of the quantum world is that somehow or other it appears as if the probabilities would have to go negative”, Richard Feynman, 1981

There were two schools of natural philosophy in ancient Greece. Aristotle believed that objects have an essence that explains their behav- ior, and a theory of the natural world has to refer to the reasons (or “fi- nal cause” to use Aristotle’s language) as to why they exhibit certain phenomena. Democritus believed in a purely mechanistic explanation of the world. In his view, the universe was ultimately composed of elementary particles (or Atoms) and our observed phenomena arise from the interactions between these particles according to some local rules. Modern science (arguably starting with Newton) has embraced Democritus’ point of view, of a mechanistic or “clockwork” universe of particles and forces acting upon them. While the classification of particles and forces evolved with time, to a large extent the “big picture” has not changed from Newton till Einstein. In particular it was held as an axiom that if we knew fully the current state of the universe (i.e., the particles and their properties such as location and velocity) then we could predict its future state at any point in time. In computational language, in all these theories the state of a system with particles could be stored in an array of numbers, and predicting the evolution of the system can be done by running some efficient푛 (e.g., time) deterministic computation푂(푛) on this array. 푝표푙푦(푛)

Compiled on 8.26.2020 18:10 610 introduction to theoretical computer science

23.1 THE DOUBLE SLIT EXPERIMENT

Alas, in the beginning of the 20th century, several experimental re- sults were calling into question this “clockwork” or “billiard ball” theory of the world. One such experiment is the famous double slit ex- periment. Here is one way to describe it. Suppose that we buy one of those baseball pitching machines, and aim it at a soft plastic wall, but put a metal barrier with a single slit between the machine and the plastic wall (see Fig. 23.1). If we shoot baseballs at the plastic wall, then some of the baseballs would bounce off the metal barrier, while some would make it through the slit and dent the wall. If we now carve out an ad- ditional slit in the metal barrier then more balls would get through, and so the plastic wall would be even more dented. So far this is pure common sense, and it is indeed (to my knowl- edge) an accurate description of what happens when we shoot base- balls at a plastic wall. However, this is not the same when we shoot photons. Amazingly, if we shoot with a “photon gun” (i.e., a laser) at a wall equipped with photon detectors through some barrier, then (as shown in Fig. 23.2) in some positions of the wall we will see fewer hits when the two slits are open than when only one of them is!. In particular there are positions in the wall that are hit when the first slit Figure 23.1: In the “double baseball experiment” we is open, hit when the second gun is open, but are not hit at all when both shoot baseballs from a gun at a soft wall through a slits are open!. hard barrier that has one or two slits open in it. There It seems as if each photon coming out of the gun is aware of the is only “constructive interference” in the sense that the dent in each position in the wall when both slits global setup of the experiment, and behaves differently if two slits are are open is the sum of the dents when each slit is open than if only one is. If we try to “catch the photon in the act” and open on its own. place a detector right next to each slit so we can see exactly the path each photon takes then something even more bizarre happens. The mere fact that we measure the path changes the photon’s behavior, and now this “destructive interference” pattern is gone and the number of times a position is hit when two slits are open is the sum of the number of times it is hit when each slit is open.

P Figure 23.2: The setup of the double slit experiment You should read the paragraphs above more than in the case of photon or electron guns. We see also once and make sure you appreciate how truly mind destructive interference in the sense that there are boggling these results are. some positions on the wall that get fewer hits when both slits are open than they get when only one of the slits is open. See also this video.

23.2 QUANTUM AMPLITUDES

The double slit and other experiments ultimately forced scientists to accept a very counterintuitive picture of the world. It is not merely about nature being randomized, but rather it is about the probabilities in some sense “going negative” and cancelling each other! quantum computing 611

To see what we mean by this, let us go back to the baseball exper- iment. Suppose that the probability a ball passes through the left slit is and the probability that it passes through the right slit is . Then, if we shoot balls out of each gun, we expect the wall will be 퐿 푅 hit푝 times. In contrast, in the quantum world of photons푝 instead of baseballs,푁 it can sometimes be the case that in both the first 퐿 푅 and(푝 second+ 푝 case)푁 the wall is hit with positive probabilities and respectively but somehow when both slits are open the wall (or a par- 퐿 푅 ticular position in it) is not hit at all. It’s almost as if the probabilities푝 푝 can “cancel each other out”. To understand the way we model this in quantum mechanics, it is helpful to think of a “lazy evaluation” approach to probability. We can think of a probabilistic experiment such as shooting a baseball through two slits in two different ways: • When a ball is shot, “nature” tosses a coin and decides if it will go through the left slit (which happens with probability ), right slit (which happens with probability ), or bounce back. If it passes 퐿 through one of the slits then it will hit the wall. Later푝 we can look at 푅 the wall and find out whether푝 or not this event happened, butthe fact that the event happened or not is determined independently of whether or not we look at the wall.

• The other viewpoint is that when a ball is shot, “nature” computes the probabilities and as before, but does not yet “toss the coin” and determines what happened. Only when we actually 퐿 푅 look at the wall, nature푝 tosses푝 a coin and with probability ensures we see a dent. That is, nature uses “lazy evaluation”, and 퐿 푅 only determines the result of a probabilistic experiment when푝 + we 푝 decide to measure it. While the first scenario seems much more natural, the end result in both is the same (the wall is hit with probability ) and so the question of whether we should model nature as following the first 퐿 푅 scenario or second one seems like asking about the푝 proverbial+ 푝 tree that falls in the forest with no one hearing about it. However, when we want to describe the double slit experiment with photons rather than baseballs, it is the second scenario that lends itself better to a quantum generalization. Quantum mechanics as- sociates a number known as an amplitude with each probabilistic experiment. This number can be negative, and in fact even complex. We never observe the훼 amplitudes directly, since whenever we mea- sure an event with amplitude훼 , nature tosses a coin and determines that the event happens with probability . However, the sign (or in the complex case, phase) of훼 the amplitudes2 can affect whether two different events have constructive or destructive|훼| interference. 612 introduction to theoretical computer science

Specifically, consider an event that can either occur or not (e.g. “de- tector number 17 was hit by a photon”). In classical probability, we model this by a probability distribution over the two outcomes: a pair of non-negative numbers and such that , where corre- sponds to the probability that the event occurs and corresponds to the probability that the event푝 does푞 not occur.푝 + In 푞 quantum = 1 mechanics,푝 we model this also by pair of numbers, which we call푞 amplitudes. This is a pair of (potentially negative or even complex) numbers and such that . The probability that the event occurs is and the probability2 2 that it does not occur is . In isolation,훼 these훽 2 negative or|훼| complex+ |훽| numbers= 1 don’t matter much,2 since we anyway|훼| square them to obtain probabilities. But the|훽| interaction of positive and negative amplitudes can result in surprising cancellations where some- how combining two scenarios where an event happens with positive probability results in a scenario where it never does.

P If you don’t find the above description confusing and unintuitive, you probably didn’t get it. Please make sure to re-read the above paragraphs until you are thoroughly confused.

Quantum mechanics is a mathematical theory that allows us to calculate and predict the results of the double-slit and many other ex- periments. If you think of quantum mechanics as an explanation as to what “really” goes on in the world, it can be rather confusing. How- ever, if you simply “shut up and calculate” then it works amazingly well at predicting experimental results. In particular, in the double slit experiment, for any position in the wall, we can compute num- bers and such that photons from the first and second slit hit that position with probabilities and respectively. When we open both훼 slits, the훽 probability that2 the position2 will be hit is proportional to , and so in particular,|훼| if |훽| then it will be the case that, despite being2 hit when either slit one or slit two are open, the position is not|훼 + hit 훽| at all when they both are.훼 If = you −훽 are confused by quantum mechanics, you are not alone: for decades people have been trying to come up with explanations for “the underlying reality” behind quan- tum mechanics, including Bohmian Mechanics, Many Worlds and others. However, none of these interpretations have gained universal acceptance and all of those (by design) yield the same experimental predictions. Thus at this point many scientists prefer to just ignore the question of what is the “true reality” and go back to simply “shutting up and calculating”. quantum computing 613

R Remark 23.1 — Complex vs real, other simplifications. If (like the author) you are a bit intimidated by complex numbers, don’t worry: you can think of all ampli- tudes as real (though potentially negative) numbers without loss of understanding. All the “magic” of quantum computing already arises in this case, and so we will often restrict attention to real amplitudes in this chapter. We will also only discuss so-called pure quantum states, and not the more general notion of mixed states. Pure states turn out to be sufficient for understanding the algorithmic aspects of quantum computing. More generally, this chapter is not meant to be a com- plete description of quantum mechanics, quantum information theory, or quantum computing, but rather illustrate the main points where these differ from classical computing.

23.2.1 Linear algebra quick review Linear algebra underlies much of quantum mechanics, and so you would do well to review some of the basic notions such as vectors, matrices, and linear subspaces. The operations in quantum mechanics can be represented as linear functions over the complex numbers, but we stick to the real numbers in this chapter. This does not cause much loss in understanding but does allow us to simplify our notation and eliminate the use of the complex conjugate. The main notions we use are:

• A function is linear if for every and푁 푁 . 퐹 ∶ ℝ → ℝ 푁 퐹 (훼푢 + 훽푣) = 훼퐹 (푢) + 훽퐹 (푣) • The inner훼, 훽 product ∈ ℝ of푢, two 푣 ∈ vectors ℝ can be defined as . (There can be different inner푁 products but we stick to this one.) The norm of a vector푢, 푣 ∈ ℝ is defined as ⟨푢, 푣⟩ = 푖∈[푁] 푖 푖 ∑ 푢 푣 . We say that is a푁unit vector if . 2 푢 ∈ ℝ ‖푢‖ = 푖 • Two√⟨푢, vectors 푢⟩ = √∑푖∈[푁] 푢 are orthogonal푢if . An orthonormal‖푢‖ = 1 basis for is a set of푁 vectors such that for every 푁 푢, 푣 ∈and ℝ for every⟨푢, 푣⟩ = 0 . A canoncial 0 1 푁−1 푖 exampleℝ is the standard푁 basis 푣 , 푣 , …, ,where 푣 is the vector‖푣 ‖ = that 1 푖 푗 has zeroes푖 in ∈ all [푁] cooordinates⟨푣 , 푣 ⟩ except= 0 the -th푖 coordinate ≠ 푗 in which 0 푁−1 푖 its value is . A quirk of the quantum푒 , … , 푒 mechanics푒 literature is that is often denoted by . We often consider푖 the case , in which case1 we identify with and for every 푛 , 푖 푒we denote the standard|푖⟩ basis element푛 corresponding푁 to the= 2 -th푛 coordinate by . [푁] {0, 1} 푥 ∈ {0, 1} 푥 |푥⟩ 614 introduction to theoretical computer science

• If is a vector in and is an orthonormal basis for , then there are푛 coefficients such that 0 푁−1 푁푢 . Consequently,ℝ 푣 , … , the 푣 value is determined by the 0 푁−1 0 0 valuesℝ , , . Moreover,훼 , … , 훼 푢 =. 훼 푣 + 푁−1 푁−1 ⋯ + 훼 푣 퐹 (푢) 2 0 푁−1 푖 • We can퐹 represent (푣 ) … 퐹 a (푣 linear) function ‖푢‖ = √∑푖∈[푁]as훼 an matrix where the coordinate in the푁-th row푁 and -th column of (that is ) is equal to퐹 ∶ ℝ → ℝor equivalently푁 × 푁 the -th coordinate푀(퐹 ) of . 푖 푗 푖,푗 푖 푗 푀(퐹 ) 푀(퐹 ) ⟨푒 , 퐹 (푒 )⟩ 푗 •푖 A linear function 퐹 (푒 ) such that for every is called unitary. It can푁 be shown푁 that a function is unitary if and only if 퐹 ∶ ℝ → ℝwhere is the‖퐹 (푢)‖transpose = ‖푢‖operator 푢(in the complex case the⊤ conjugate transpose) and퐹 is the identity matrix푀(퐹 that )푀(퐹 has ) ’s= on 퐼 the diagonal⊤ and zeroes everywhere else. (For every two matrices , we use to denote퐼 the푁matrix × 푁 product of and .) Another1 equivalent characterization of this condition is that 퐴, 퐵 and yet퐴퐵 another is that both the rows and columns퐴 퐵 of ⊤ form−1 an orthonormal basis. 푀(퐹 ) = 푀(퐹 ) 23.3 BELL’S INEQUALITY푀(퐹 )

There is something weird about quantum mechanics. In 1935 Einstein, Podolsky and Rosen (EPR) tried to pinpoint this issue by highlighting a previously unrealized corollary of this theory. They showed that the idea that nature does not determine the results of an experiment until it is measured results in so called “spooky action at a distance”. Namely, making a measurement of one object may instantaneously effect the state (i.e., the vector of amplitudes) of another object inthe other end of the universe. Since the vector of amplitudes is just a mathematical abstraction, the EPR paper was considered to be merely a thought experiment for philosophers to be concerned about, without bearing on experiments. This changed when in 1965 John Bell showed an actual experiment to test the predictions of EPR and hence pit intuitive common sense against quantum mechanics. Quantum mechanics won: it turns out that it is in fact possible to use measurements to create correlations between the states of objects far removed from one another that cannot 1 If you are extremely paranoid about Alice and Bob be explained by any prior theory. Nonetheless, since the results of communicating with one another, you can coordinate these experiments are so obviously wrong to anyone that has ever sat with your assistant to perform the experiment exactly at the same time, and make sure that the rooms in an armchair, that there are still a number of Bell denialists arguing are sufficiently far apart (e.g., are on two different that this can’t be true and quantum mechanics is wrong. continents, or maybe even one is on the moon and So, what is this Bell’s Inequality? Suppose that Alice and Bob try another is on earth) so that Alice and Bob couldn’t communicate to each other in time the results of their to convince you they have telepathic ability, and they aim to prove it respective coins even if they do so at the speed of via the following experiment. Alice and Bob will be in separate closed light. quantum computing 615

rooms.1 You will interrogate Alice and your associate will interrogate Bob. You choose a random bit and your associate chooses a random . We let be Alice’s response and be Bob’s response. We say that Alice and푥 Bob∈ {0, win 1} this experiment if . In other푦 ∈ words,{0, 1} Alice and푎 Bob need to output two푏 bits that disagree if and agree otherwise. 푎 ⊕ 푏 = 푥 ∧Now 푦 if Alice and Bob are not telepathic, then they need to agree in advance on푥 =some 푦 = strategy. 1 It’s not hard for Alice and Bob to succeed with probability : just always output the same bit. Moreover, by doing some case analysis, we can show that no matter what strategy they use, Alice and3/4 Bob cannot succeed with higher probability than that:

Theorem 23.2 — Bell’s Inequality. For every two functions , Pr . 푓, 푔 ∶ {0, 1} → 푥,푦∈{0,1} Proof.{0, 1}Since the probability[푓(푥) ⊕ 푔(푦) is taken = 푥 ∧ over 푦] ≤ all 3/4 four choices of , the only way the theorem can be violated if if there exist two functions that satisfy 푥, 푦 ∈ {0, 1} 푓, 푔 (23.1) for all the four choices of . Let’s plug in all these four 푓(푥) ⊕ 푔(푦) = 푥 ∧ 푦 choices and see what we get (below we use2 the equalities , and ): 푥, 푦 ∈ {0, 1} 푧 ⊕ 0 = 푧 푧 ∧ 0 = 0 푧 ∧ 1 = 푧 plugging in plugging in 푓(0) ⊕ 푔(0) = 0 ( 푥 = 0, 푦 = 0) (23.2) plugging in 푓(0) ⊕ 푔(1) = 0 ( 푥 = 0, 푦 = 1) plugging in 푓(1) ⊕ 푔(0) = 0 ( 푥 = 1, 푦 = 0) If we XOR together the first and second equalities we get while푓(1) if we⊕ 푔(1) XOR together= 1 ( the third and푥 = fourth 1, 푦 = equalities 1) we get , thus obtaining a contradiction. 푔(0) ⊕ 푔(1) = 0 ■ 푔(0) ⊕ 푔(1) = 1

R Remark 23.3 — Randomized strategies. Theorem 23.2 above assumes that Alice and Bob use deterministic strategies and respectively. More generally, Alice and Bob could use a randomized strategy, or equiva- lently, each푓 could푔 choose and from some distri- butions and respectively. However the averaging principle (Lemma 18.10) implies푓 푔 that if all possible deterministicℱ strategies풢 succeed with probability at most , then the same is true for all randomized strategies. 3/4 616 introduction to theoretical computer science

An amazing experimentally verified fact is that quantum mechanics 2 2 More accurately, one either has to give up on a allows for “telepathy”. Specifically, it has been shown that using the “billiard ball type” theory of the universe or believe weirdness of quantum mechanics, there is in fact a strategy for Alice in telepathy (believe it or not, some scientists went for and Bob to succeed in this game with probability larger than (in the latter option). fact, they can succeed with probability about , see Lemma 23.5). 3/4 23.4 QUANTUM WEIRDNESS 0.85

Some of the counterintuitive properties that arise from quantum me- chanics include:

• Interference - As we’ve seen, quantum amplitudes can “cancel each other out”.

• Measurement - The idea that amplitudes are negative as long as “no one is looking” and “collapse” (by squaring them) to positive probabilities when they are measured is deeply disturbing. Indeed, as shown by EPR and Bell, this leads to various strange outcomes such as “spooky actions at a distance”, where we can create corre- lations between the results of measurements in places far removed. Unfortunately (or fortunately?) these strange outcomes have been confirmed experimentally.

• Entanglement - The notion that two parts of the system could be connected in this weird way where measuring one will affect the other is known as quantum entanglement.

As counter-intuitive as these concepts are, they have been experi- mentally confirmed, so we just have to live with them. The discussion in this chapter of quantum mechanics in general and quantum computing in particular is quite brief and superficial, the “bibliographical notes” section (Section 23.13) contains references and links to many other resources that cover this material in more depth.

23.5 QUANTUM COMPUTING AND COMPUTATION - AN EXECUTIVE SUMMARY

One of the strange aspects of the quantum-mechanical picture of the world is that unlike in the billiard ball example, there is no obvious algorithm to simulate the evolution of particles over time periods in steps. In fact, the natural way to simulate quantum par- ticles will require a number of steps that푛 is exponential in푡 . This is a huge푝표푙푦(푛, headache 푡) for scientists that actually need to do these푛 calculations in practice. 푛 In the 1981, physicist Richard Feynman proposed to “turn this lemon to lemonade” by making the following almost tautological observation: quantum computing 617

If a physical system cannot be simulated by a computer in steps, the system can be considered as performing a computation that would take more than steps. 푇

푇 So, he asked whether one could design a quantum system such that its outcome based on the initial condition would be some 3 function such that (a) we don’t know how to efficiently As its suggests, Feynman’s lecture was actually focused on the other side of simulating physics with 3 compute in any other푦 way, and (b) is actually useful푥 for something. a computer. However, he mentioned that as a “side In 1985, David푦 = 푓(푥) Deutsch formally suggested the notion of a quantum remark” one could wonder if it’s possible to simulate physics with a new kind of computer - a “quantum Turing machine, and the model has been since refined in works of computer” which would “not [be] a Turing machine, Deutsch, Josza, Bernstein and Vazirani. Such a system is now known but a machine of a different kind”. As far as I know, as a quantum computer. Feynman did not suggest that such a computer could be useful for computations completely outside the For a while these hypothetical quantum computers seemed useful domain of quantum simulation. Indeed, he was for one of two things. First, to provide a general-purpose mecha- more interested in the question of whether quantum mechanics could be simulated by a classical computer. nism to simulate a variety of the real quantum systems that people care about, such as various interactions inside molecules in quan- tum chemistry. Second, as a challenge to the Physical Extended Church Turing Thesis which says that every physically realizable computa- tion device can be modeled (up to polynomial overhead) by Turing machines (or equivalently, NAND-TM / NAND-RAM programs). Quantum chemistry is important (and in particular understand- ing it can be a bottleneck for designing new materials, drugs, and more), but it is still a rather niche area within the broader context of computing (and even scientific computing) applications. Hence fora while most researchers (to the extent they were aware of it), thought of quantum computers as a theoretical curiosity that has little bear- ing to practice, given that this theoretical “extra power” of quantum computer seemed to offer little advantage in the majority of the prob- lems people want to solve in areas such as combinatorial optimization, machine learning, data structures, etc.. To some extent this is still true today. As far as we know, quantum 4 computers, if built, will not provide exponential speed ups for 95% This “95 percent” is a figure of speech, but not com- pletely so. At the time of this writing, cryptocurrency 4 of the applications of computing. In particular, as far as we know, mining electricity consumption is estimated to use quantum computers will not help us solve NP complete problems in up at least 70Twh or 0.3 percent of the world’s pro- duction, which is about 2 to 5 percent of the total polynomial or even sub-exponential time, though Grover’s algorithm ( energy usage for the computing industry. All the Remark 23.4) does yield a quadratic advantage in many cases. current cryptocurrencies will be broken by quantum However, there is one cryptography-sized exception: In 1994 Peter computers. Also, for many web servers the TLS pro- tocol (which is based on the current non-lattice based Shor showed that quantum computers can solve the integer factoring systems would be completely broken by quantum and discrete logarithm in polynomial time. This result has captured computing) is responsible for about 1 percent of the CPU usage. the imagination of a great many people, and completely energized re- search into quantum computing. This is both because the hardness of these particular problems provides the foundations for securing such a huge part of our communications (and these days, our economy), as well as it was a powerful demonstration that quantum computers 618 introduction to theoretical computer science

could turn out to be useful for problems that a-priori seemed to have nothing to do with quantum physics. As we’ll discuss later, at the moment there are several intensive efforts to construct large scale quantum computers. It seems safe to say that, as far as we know, in the next five years or so there will not be a quantum computer large enough to factor, say, a bit number. On the other hand, it does seem quite likely that in the very near future there will be quantum computers which achieve1024some task exponentially faster than the best-known way to achieve the same task with a classical computer. When and if a quantum computer is built that is strong enough to break reasonable parameters of Diffie Hellman, RSA and elliptic curve cryptography is anybody’s guess. It could also be a “self destroying prophecy” whereby the existence of a small-scale quantum computer would cause everyone to shift away to lattice-based crypto which in turn will diminish the motivation

to invest the huge resources needed to build a large scale quantum 5 5 Of course, given that we’re still hearing of attacks computer. exploiting “export grade” cryptography that was supposed to disappear in 1990’s, I imagine that we’ll still have products running 1024 bit RSA when R everyone has a quantum laptop. Remark 23.4 — Quantum computing and NP. Despite popular accounts of quantum computers as having variables that can take “zero and one at the same time” and therefore can “explore an exponential num- ber of possibilities simultaneously”, their true power is much more subtle and nuanced. In particular, as far as we know, quantum computers do not enable us to solve NP complete problems such as 3SAT in polyno- mial or even sub-exponential time. However, Grover’s search algorithm does give a more modest advan- tage (namely, quadratic) for quantum computers over classical ones for problems in NP. In particular, due to Grover’s search algorithm, we know that the -SAT problem for variables can be solved in time on a quantum computer for every . In푘 contrast,푛/2 the best푛 known algorithms for -SAT on a classical푂(2 푝표푙푦(푛)) computer take roughly steps. 푘 1 (1− )푛 푘 푘 2

 Big Idea 28 Quantum computers are not a panacea and are un- likely to solve NP complete problems, but they can provide exponen- tial speedups to certain structured problems.

23.6 QUANTUM SYSTEMS

Before we talk about quantum computing, let us recall how we phys- ically realize “vanilla” or classical computing. We model a logical bit quantum computing 619

that can equal or a by some physical system that can be in one of two states. For example, it might be a wire with high or low voltage, charged or uncharged0 1 capacitor, or even (as we saw) a pipe with or without a flow of water, or the presence or absence of a soldier crab.A classical system of bits is composed of such “basic systems”, each of which can be in either a “zero” or “one” state. We can model the state of such a system푛 by a string 푛 . If we perform an op- eration such as writing to the 17-th bit the NAND푛 of the 3rd and 5th bits, this corresponds to applying푠 a ∈local {0,function 1} to such as setting . In the probabilistic setting, we would model the state푠 of the system 17 3 5 푠by a=distribution 1 − 푠 ⋅ 푠 . For an individual bit, we could model it by a pair of non-negative numbers such that , where is the prob- ability that the bit is zero and is the probability that the bit is one. For example, applying the훼, 훽negation (i.e.,훼 + NOT) 훽 = 1 operation훼 to this bit corresponds to mapping the pair훽 to since the probability that NOT is equal to is the same as the probability that is equal to . This means that we can think(훼, of 훽) the NOT(훽, 훼) function as the linear (휎) 1 휎 map such that or equivalently as the 0 2 2 훼 훽 푁 ∶ ℝ → ℝ 푁 ( ) = ( ) matrix . 훽 훼 If we think0 1 of the -bit system as a whole, then since the bits can ( ) take one of1 0 possible values, we model the state of the system as a vector of 푛probabilities.푛 For every , we denote푛 by the dimensional2푛 vector that has in the coordinate푛 correspond- 푠 ing to푛 푝(identifying2 it with a number푠 in ∈ {0,), 1} and so can write 푒as 2 where is the probability1 that푛 the system is in the state

. 푠 푛 [2 ] 푝 푠∈{0,1} 푠 푠 푠 ∑ Applying푝 푒 the operation푝 above of setting the -th bit to the NAND 푠of the 3rd and 5th bits, corresponds to transforming the vector to the vector where is the linear17 map that maps to 푛 푛 . (Since2 2 is a basis for , it suffices푝 to 푛 푠 define the퐹 푝 map on퐹 vectors ∶ ℝ of→ this ℝ form.)푛 2 푒 푠0⋯푠16(1−푠3⋅푠5)푠18⋯푠푛−1 푠 푠∈{0,1} 푒 {푒 } 푅 퐹 P Please make sure you understand why performing the operation will take a system in state to a system in the state . Understanding the evolution of proba- bilistic systems is a prerequisite to understanding푝 the evolution퐹 of 푝 quantum systems. If your linear algebra is a bit rusty, now would be a good time to review it, and in particular make sure you are comfortable with the notions of matrices, vec- tors, (orthogonal and orthonormal) bases, and norms. 620 introduction to theoretical computer science

23.6.1 Quantum amplitudes In the quantum setting, the state of an individual bit (or “”, to use quantum parlance) is modeled by a pair of numbers such that . While in general these numbers can be complex, for the2 rest2 of this chapter, we will often assume they(훼, are 훽) real (though|훼| potentially+ |훽| = negative), 1 and hence often drop the absolute value operator. (This turns out not to make much of a difference in explanatory power.) As before, we think of as the probability that the bit equals and as the probability that2 the bit equals . As we did before, we can model2 the NOT operation훼 by the map where 0 훽 . 1 2 2 Following quantum tradition, instead of using and 푁as ∶ ℝ we→ did ℝ above,푁(훼, from 훽)now = on(훽, we 훼) will denote the vector by and the 0 1 vector by (and moreover, think of these as푒 column푒 vectors). This is known as the Dirac “ket” notation. This(1, means 0) that|0⟩ NOT is the unique(0, 1) linear|1⟩ map that satisfies and . In other words, in2 the quantum2 case, as in the probabilistic case, NOT corresponds푁 to ∶ the ℝ matrix→ ℝ 푁|0⟩ = |1⟩ 푁|1⟩ = |0⟩

(23.3) 0 1 푁 = ( ). 1 0 In classical computation, we typically think that there are only two operations that we can do on a single bit: keep it the same or negate it. In the quantum setting, a single bit operation corresponds to any linear map OP that is norm preserving in the sense 2 2 that for every , if we apply OP to the vector then we obtain ∶ ℝ → ℝ 훼 훼, 훽 ( ) a vector such that . Such훽 a linear map ′ ′2 ′2 2 2 OP corresponds훼 to a unitary two by two matrix. (As we mentioned, ( ′) 훼 + 훽 = 훼 + 훽 quantum mechanics훽 actually models states as vectors with complex coordinates; however, this does not make any qualitative difference to our discussion.) Keeping the bit the same corresponds to the matrix

and (as we’ve seen) the NOT operations corresponds to 1 0 퐼 = ( ) the matrix0 1 . But there are other operations we can use as well. One such useful0 1 operation is the Hadamard operation, which 푁 = ( ) corresponds to the1 matrix 0

(23.4) √1 +1 +1 퐻 = 2 ( ). +1 −1 quantum computing 621

In fact it turns out that Hadamard is all that we need to add to a classical universal basis to achieve the full power of quantum comput- ing.

23.6.2 Quantum systems: an executive summary If you ignore the physics and philosophy, for the purposes of under- standing the model of quantum computers, all you need to know about quantum systems is the following. The state of a quantum system of is modeled by an dimensional vector of unit norm (i.e., squares of all coordinates푛 sums up to ), which we write as 푛 where 2 is the column vector휓 that has in all coordinates except푛 the one corresponding to1 (identifying 푥∈{0,1} 푥 휓with = ∑the numbers휓 |푥⟩ |푥⟩ ). We use the convention that0 if 푛 are strings of lengths and푛 respectively then푥 we can write{0, the 1} dimensional vector{0, with … , 2 in− the 1} -th coordinate and zero elsewhere푎,푘+ℓ 푏 not just as but also푘 as ℓ . In particular, for every 2 , we can write the vector also1 as 푎푏 . This notation satisfies푛 certain nice|푎푏⟩ distributive laws|푎⟩|푏⟩ such as 푥 ∈ {0, 1} . 0 1 푛−1 A quantum operation|푥⟩on such|푥 a system⟩|푥 ⟩ ⋯ is |푥 modeled⟩ ′ by a ′ unitary matrix (one that satisfies UU|푎⟩(|푏⟩ +where |푏 ⟩)|푐⟩ =is |푎푏푐⟩ the푛 +transpose |푎푏푛 푐⟩ operation, or conjugate transpose for complex⊤ matrices).⊤ 2 If the× 2 system is in state and푈 we apply to it the operation= 퐼 , then푈 the new state of the system is . When we휓 measure an -qubit system in a state푈 , then we observe푈휓 the value with probability . In푛 this 푥∈{0,1} 푥 case, the system collapses푛to the state 푛 . 휓 = ∑ 2 휓 |푥⟩ 푥 푥 ∈ {0, 1} |휓 | |푥⟩ 23.7 ANALYSIS OF BELL’S INEQUALITY (OPTIONAL)

Now that we have the notation in place, we can show a strategy for Alice and Bob to display “quantum telepathy” in Bell’s Game. Re- call that in the classical case, Alice and Bob can succeed in the “Bell Game” with probability at most . We now show that quan- tum mechanics allows them to succeed with probability at least . (The strategy we show is not the3/4 best = one. 0.75 Alice and Bob can in fact succeed with probability cos .) 0.8 Lemma 23.5 There is a 2-qubit2 quantum state so that if Alice (휋/8) ∼ 0.854 has access to the first qubit of , can manipulate and4 measure it and output and Bob has access to the second휓 ∈ ℂ qubit of and can manipulate and measure it and휓 output then Pr 푎 ∈. {0, 1} 휓 푏 ∈ {0, 1} [푎 ⊕ 푏 = 푥 ∧ 푦] ≥ 0.8 622 introduction to theoretical computer science

Proof. Alice and Bob will start by preparing a 2-qubit quantum system in the state

(23.5)

√1 √1 (this state is known as휓 an =EPR2 |00⟩ pair +). Alice2 |11⟩ takes the first qubit of the system to her room, and Bob takes the second qubit to his room. Now, when Alice receives if she does nothing and if she applies the unitary map to her qubit where sin 푥 푥 = 0 is the unitary operation−휋/8 corresponding to 푥 = 1sin cos 푅 rotation휃 in푐표푠휃 the plane − 휃 with angle . When Bob receives , if he 푅 = ( ) does nothing휃 and if 휃 he applies the unitary map to his qubit. Then each one of them measures휃 their qubit and sends푦 this푦 as = their 0 휋/8 response. 푦 = 1 푅 Recall that to win the game Bob and Alice want their outputs to be more likely to differ if and to be more likely to agree otherwise. We will split the analysis in one case for each of the four possible values of and 푥. = 푦 = 1 Case 1: and . If then the state does not change. Because the푥 state푦 is proportional to , the measure- ments of Bob푥 = and 0 Alice푦 will = 0 always푥 = agree 푦 = (if 0 Alice measures then the state collapses to and휓 so Bob measures as|00⟩ well, + |11⟩ and similarly for ). Hence in the case , Alice and Bob always win. 0 Case 2: |00⟩and . If and 0 then after Alice 1measures her bit, if she푥 =gets 푦 =then 1 the system collapses to the state , in which푥 = case 0 after푦 Bob = 1 performs푥 = 0 his rotation,푦 = 1 his qubit is in the state cos sin 0 . Thus, when Bob measures his qubit,|00⟩ he will get (and hence agree with Alice) with probability cos (휋/8)|0⟩. Similarly, + (휋/8)|1⟩ if Alice gets then the system collapses to 2 , in which case0 after rotation Bob’s qubit will be in the state sin(휋/8) ≥ 0.85cos and so once again1 he will agree with Alice with|11⟩ probability cos . − The(휋/8)|0⟩ analysis + for Case(휋/8)|1⟩2 3, where and , is completely analogous to Case 2.(휋/8) Hence Alice and Bob will agree with probability cos in this case as well. (To푥 show = 1 this we푦= use 0 the observation that2 the result of this experiment is the same regardless of the order in which(휋/8) Alice and Bob apply their rotations and measurements; this requires a proof but is not very hard to show.) Case 4: and . For the case that and , after both Alice and Bob perform their rotations, the state will be proportional푥 =to 1 푦 = 1 푥 = 1 푦 = 1

(23.6)

−휋/8 휋/8 −휋/8 휋/8 푅 |0⟩푅 |0⟩ + 푅 |1⟩푅 |1⟩ . quantum computing 623

Intuitively, since we rotate one state by 45 degrees and the other state by -45 degrees, they will become orthogonal to each other, and the measurements will behave like independent coin tosses that agree with probability 1/2. However, for the sake of completeness, we now show the full calculation. Opening up the coefficients and using cos cos and sin sin , we can see that (23.6) is proportional to (−푥) = (푥) (−푥) = − (푥) cos cos sin sin cos2 sin (휋/8)|00⟩ + (휋/8) (휋/8)|01⟩ (23.7) sin sin2 cos − (휋/8) (휋/8)|10⟩ + (휋/8)|11⟩ cos sin2 cos − (휋/8)|00⟩ + (휋/8) (휋/8)|01⟩ 2 Using− the trigonometric(휋/8) (휋/8)|10⟩ identities + sin(휋/8)|11⟩cos . sin and cos sin cos , we see that the probability of getting any one2 of 2 is proportional2 to(훼) cos (훼) = sin(2훼) .

Hence(훼) all − four(훼) options = for(2훼) are equally likely, which mean that√ in1 this case|00⟩, |10⟩,with |01⟩, probability |11⟩ . (휋/4) = (휋/4) = 2 Taking all the four cases(푎, together, 푏) the overall probability of winning the game푎 is = 푏 0.5 . 1 1 1 ■ 4 ⋅ 1 + 2 ⋅ 0.85 + 4 ⋅ 0.5 = 0.8

R Remark 23.6 — Quantum vs probabilistic strategies. It is instructive to understand what is it about quan- tum mechanics that enabled this gain in Bell’s Inequality. For this, consider the following anal- ogous probabilistic strategy for Alice and Bob. They agree that each one of them output if he or she get as input and outputs with prob- ability if they get as input. In this case0 one can see that0 their success probability1 would be 푝 1 . The1 quantum1 strategy1 we described above can2 be 4 2 4 thought⋅ 1 + of(1 as − a푝) variant + [2푝(1 of the − 푝)] probabilistic = 0.75 − 0.5푝 strategy≤ 0.75 for parameter set to sin . But in the case , instead of disagreeing2 only with probability 푝 , the(휋/8) existence = of 0.15 the so called “neg- ative푥 = 푦 probabilities’ = 1 ’ in the quantum world allowed 2푝(1us to −rotate 푝) the = state 1/4 in opposing directions to achieve destructive interference and hence a higher probability of disagreement, namely sin . 2 (휋/4) = 0.5 624 introduction to theoretical computer science

23.8 QUANTUM COMPUTATION

Recall that in the classical setting, we modeled computation as ob- tained by a sequence of basic operations. We had two types of computa- tional models:

• Non uniform models of computation such as Boolean circuits and NAND-CIRC programs, where a finite function is computable in size if it can be expressed as a combination푛 of basic operations (gates in a circuit or lines in a푓 NAND-CIRC ∶ {0, 1} → {0, 1} program) 푇 푇 • Uniform models of computation such as Turing machines and NAND- TM programs, where an infinite function is computable in time if there is a single algorithm∗ that on input evaluates using at most 퐹 ∶ {0,basic 1} steps.→ {0, 1} 푛 푇 (푛) 푥When ∈ {0, considering 1} efficient퐹 (푥) computation, we푇 (푛)defined the class P to consist of all infinite functions that can be com- puted by a Turing machine or NAND-TM∗ program in time for some polynomial . We defined퐹 ∶ {0, the 1} class→P {0,poly 1}to consists of all infinite functions such that for every 푝(푛), the re- / striction of to푝(⋅) can∗ be computed by a Boolean circuit or NAND-CIRC program퐹 ∶ {0, of 1}푛 size→ at most{0, 1} for some polynomial푛 . ↾푛 We will퐹 do the퐹 same{0, 1} for quantum computation, focusing mostly on the non uniform setting of quantum circuits,푝(푛) since that is simpler,푝(⋅) and already illustrates the important differences with classical computing.

23.8.1 Quantum circuits A quantum circuit is analogous to a Boolean circuit, and can be de- scribed as a directed acyclic graph. One crucial difference that the out degree of every vertex in a quantum circuit is at most one. This is because we cannot “reuse” quantum states without measuring them (which collapses their “probabilities”). Therefore, we cannot use the same qubit as input for two different gates. (This is known as the No Cloning Theorem.) Another more technical difference is that to ex- press our operations as unitary matrices, we will need to make sure all our gates are reversible. This is not hard to ensure. For example, in the quantum context, instead of thinking of NAND as a (non reversible) map from to , we will think of it as the reversible map on three qubits that2 maps to NAND (i.e., flip the last bit if NAND{0, 1} of the{0, first 1} two bits is ). Equivalently, the NAND operation corresponds to푎, the 푏, 푐 푎,unitary 푏, 푐 ⊕ matrix(푎, 푏) such that 6 Readers familiar with quantum computing should (identifying with ) for every1 , if is the note that is a close variant of the so called 푁퐴푁퐷 Toffoli gate and so QNAND-CIRC programs corre- basis element with3 in the 8-th × 8 coordinate and zero푈 elsewhere, then spond to quantum푈푁퐴푁퐷 circuits with the Hadamard and {0, 1} NAND[8] .푎,6 If 푏, we푐 ∈ order {0, 1} the rows|푎푏푐⟩ and Toffoli gates. 1 푎푏푐 푁퐴푁퐷 푈 |푎푏푐⟩ = |푎푏(푐 ⊕ (푎, 푏))⟩ quantum computing 625

columns as , then can be written as the following matrix: 푁퐴푁퐷 000, 001, 010, … , 111 푈

0 1 0 0 0 0 0 0 ⎛ ⎞ ⎜1 0 0 0 0 0 0 0⎟ ⎜ ⎟ (23.8) ⎜0 0 0 1 0 0 0 0⎟ ⎜ ⎟ 푁퐴푁퐷 ⎜0 0 1 0 0 0 0 0⎟ 푈 = ⎜ ⎟ ⎜0 0 0 0 0 1 0 0⎟ ⎜ ⎟ ⎜0 0 0 0 1 0 0 0⎟ ⎜ ⎟ ⎜0 0 0 0 0 0 1 0⎟ If we have an qubit system,⎜ then for , we⎟ will denote 0 0 0 0 0 0 0 1 by as the unitary⎝ matrix that corresponds⎠ to applying 푖,푗,푘to the -th,푛 푛 -th,푛 and -th bits, leaving푖, 푗, 푘 the ∈ [푛] others intact. That is, 푁퐴푁퐷 for푈 every 2 × 2 , 푁퐴푁퐷 푈NAND 푖 푗 푛 . 푘 푖,푗,푘 푛 푥∈{0,1} 푥 푁퐴푁퐷 푥∈{0,1} 푥 0 푘−1 푘 As mentioned푣 = ∑ above,푣 we|푥⟩ will푈 also use푣 = the ∑Hadamard푣 |푥or HAD⋯ 푥 oper-(푥 ⊕ 푖 푗 푘+1 푛−1 ation, A(푥quantum, 푥 ))푥 circuit⋯ 푥 is obtained⟩ by applying a sequence of and HAD gates, where a HAD gates corresponding to applying the 푁퐴푁퐷 matrix 푈 (23.9) √1 +1 +1 Another way to write define퐻 = 2is( that for ). , +1 −1 . We define HAD to be the unitary matrix√1 that 2 푖퐻 푏 ∈ {0, 1} 퐻|푏⟩ = |0⟩ + applies√1 푏HAD to the -th qubit and leaves푛 the others푛 intact. Using the ket2 (−1) notation,|1⟩ we can write this as 2 × 2 푖 HAD 푖 푖 √1 푥 푛 푥 2 푛 0 푖−1 (23.10)푖+1 푛−1 푥∈{0,1}∑ 푣 |푥⟩ = 푥∈{0,1}∑ |푥 ⋯ 푥 ⟩ (|0⟩ + (−1) |1⟩) |푥 ⋯ 푥 ⟩. A quantum circuit is obtained by composing these basic operations on some qubits. If , we use a circuit to compute a function : 푛푚 푚 ≥ 푛 푓• ∶ On {0, input1} → {0,, we 1} initialize the system to hold in the first qubits, and initialize all remaining qubits to zero. 0 푛−1 푥 푥 , … , 푥 푛 • We execute each elementary operation one by one: at every step we 푚 − 푛 apply to the current state either an operation of the form or an operation of the form HAD for . 푖,푗,푘 푁퐴푁퐷 푖 푈 • At the end of the computation, we measure the system, and output 푖, 푗, 푘 ∈ [푚] the result of the last qubit (i.e. the qubit in location ). (For simplicity we restrict attention to functions with a single bit of output, though the definition of quantum circuits푚 − naturally 1 extends to circuits with multiple outputs.) 626 introduction to theoretical computer science

• We say that the circuit computes the function if the probability that this output equals is at least . Note that this probability is obtained by summing up the squares of the푓 amplitudes of all coordinates in the final푓(푥) state of 2/3the system corresponding to vectors where .

푚−1 |푦⟩Formally we푦 define= 푓(푥) quantum circuits as follows:

Definition 23.7 — Quantum circuit. Let .A quantum circuit of inputs, auxiliary bits, and gates over the HAD basis is a sequence of unitary 푠 ≥ 푚matrices ≥ 푛 such 푁퐴푁퐷 푛that each푚 matrix − 푛 is either of the푚 form푠 푚NAND for{푈 , } 0 푠−1 or HAD for . 푠 2 × 2 푖,푗,푘 푈 , … , 푈 ℓ A quantum푖 circuit푈 computes a function 푖, 푗, 푘 ∈if [푚]the following is true푖 ∈ [푚] for every : 푛 Let be the vector 푛 푓 ∶ {0, 1} → {0, 1} 푥 ∈ {0, 1} 푣 (23.11) 푚−푛 푠−1 푠−2 1 0 and write as 푣 = 푈 푈. Then⋯ 푈 푈 |푥0 ⟩

푚 푦∈{0,1} 푦 푣 ∑ 푣 |푦⟩ (23.12) s.t. 2 푚 푦 2 ∑ 푚−1 |푣 | ≥ . 푦∈{0,1} 푦 =푓(푥) 3

P Please stop here and see that this definition makes sense to you.

 Big Idea 29 Just as we did with classical computation, we can de- fine mathematical models for quantum computation, and represent quantum algorithms as binary strings.

Once we have the notion of quantum circuits, we can define the

quantum analog of P poly (i.e., the class of functions computable by polynomial size quantum circuits) as follows: /

Definition 23.8 — BQP poly. Let . We say that

BQP poly if there/ exists some polynomial∗ such that for every , if is the퐹 restriction ∶ {0, 1} of →to {0,inputs 1} of length , / 퐹then ∈ there is a quantum circuit of size at most 푝 ∶ ℕthat → computes ℕ ↾푛 . 푛 ∈ ℕ 퐹 퐹 푛 푝(푛) ↾푛 퐹 quantum computing 627

R Remark 23.9 — The obviously exponential fallacy. A priori it might seem “obvious” that quantum com- puting is exponentially powerful, since to perform a quantum computation on bits we need to maintain the dimensional state vector and apply ma- trices푛 to it. Indeed popular푛 descriptions of quantum푛 푛 computing2 (too) often say something along2 the× lines2 that the difference between quantum and classical computer is that a classical bit can either be zero or one while a qubit can be in both states at once, and so in many qubits a quantum computer can perform exponentially many computations at once. Depending on how you interpret it, this description is either false or would apply equally well to proba- bilistic computation, even though we’ve already seen that every randomized algorithm can be simulated by a similar-sized circuit, and in fact we conjecture that BPP P. Moreover, this “obvious” approach for simulating a quantum= computation will take not just exponential time but exponential space as well, while can be shown that using a simple recursive formula one can calcu- late the final quantum state using polynomial space (in physics this is known as “Feynman path integrals”). So, the exponentially long vector description by itself does not imply that quantum computers are exponen- tially powerful. Indeed, we cannot prove that they are (i.e., we have not been able to rule out the possibility that every QNAND-CIRC program could be simu- lated by a NAND-CIRC program/ Boolean circuit with polynomial overhead), but we do have some problems (integer factoring most prominently) for which they do provide exponential speedup over the currently best known classical (deterministic or probabilistic) algorithms.

23.8.2 QNAND-CIRC programs (optional) Just like in the classical case, there is an equivalence between circuits and straight-line programs, and so we can define the programming language QNAND-CIRC that is the quantum analog of our NAND- CIRC programming language. To do so, we only add a single op- eration: HAD(foo) which applies the single-bit operation to the variable foo. We also use the following interpretation to make NAND reversible: foo = NAND(bar,blah) means that we modify퐻foo to be the XOR of its original value and the NAND of bar and blah. (In other words, apply the by unitary transformation defined above to the three qubits corresponding to foo, bar and blah.) If foo 푁퐴푁퐷 is initialized to zero then8 this8 makes no difference. 푈 628 introduction to theoretical computer science

If is a QNAND-CIRC program with input variables, workspace variables, and output variables, then running it on the input푃 corresponds to setting up푛 a system with ℓ qubits and performing푛 the푚 following process: 푥 ∈ {0, 1} 푛 + 푚 + ℓ 1. We initialize the input variables X[ ] X[ ] to and all other variables to . 0 푛−1 0 … 푛 − 1 푥 , … , 푥 2. We execute the program line by line, applying the corresponding 0 physical operation or to the qubits that are referred to by the line. 푁퐴푁퐷 퐻 푈 3. We measure the output variables Y[ ], , Y[ ] and output the result (if there is more than one output then we measure more variables). 0 … 푚 − 1

23.8.3 Uniform computation Just as in the classical case, we can define uniform computational models for quantum computing as well. We will let BQP be the quantum analog to P and BPP: the class of all Boolean functions that can be computed by quantum algorithms in polynomial∗ time. There are several equivalent ways to define BQP. For퐹 ∶ example, {0, 1} → there {0, 1} is a computational model of Quantum Turing Machines that can be used to define BQP just as standard Turing machines are used to define P. Another alternative is to define the QNAND-TM programming language to be QNAND-CIRC augmented with loops and arrays just like NAND-TM is obtained from NAND- CIRC. Once again, we can define BQP using QNAND-TM programs analogously to the way P can be defined using NAND-TM programs. However, we use the following equivalent definition (which is also the one most popular in the literature):

Definition 23.10 — The class BQP. Let . We say that BQP if there exists a polynomial time∗ NAND-TM program such that for every , is the퐹 description ∶ {0, 1} → of a {0, quantum 1} circuit 퐹that ∈ computes the restriction푛 of to . 푃 푛 푃 (1 ) 푛 푛 퐶 퐹 {0, 1}

P Definition 23.10 is the quantum analog of the alter- characterization of P that appears in Solved Exercise 13.4. One way to verify that you’ve under- stood Definition 23.10 it to see that you can prove (1) P BQP and in fact the stronger statement BPP BQP, (2) BQP EXP, and (3) For every NP-complete⊆ function , if BQP then NP BQP. Exercise⊆ 23.1 asks you to work⊆ these out. 퐹 퐹 ∈ ⊆ quantum computing 629

The relation between NP and BQP is not known (see also Re- mark 23.4). It is widely believed that NP BQP, but there is no consensus whether or not BQP NP. It is quite possible that these two classes are incomparable, in the sense that⊈ NP BQP (and in par- ticular no NP-complete function⊆ belongs to BQP) but also BQP NP (and there are some interesting candidates for such⊈ problems). It can be shown that QNANDEVAL (evaluating a quantum circuit⊈ on an input) is computable by a polynomial size QNAND-CIRC pro- gram, and moreover this program can even be generated uniformly and hence QNANDEVAL is in BQP. This allows us to “port” many of the results of classical computational complexity into the quantum realm, including the notions of a universal quantum Turing machine, as well as all of the uncomputability results. There is even a quantum analog of the Cook-Levin Theorem.

R Remark 23.11 — Restricting attention to circuits. Because the non uniform model is a little cleaner to work with, in the rest of this chapter we mostly restrict attention to this model, though all the algorithms we discuss can be implemented using uniform algorithms as well.

23.9 PHYSICALLY REALIZING QUANTUM COMPUTATION

To realize quantum computation one needs to create a system with independent binary states (i.e., “qubits”), and be able to manipulate small subsets of two or three of these qubits to change their state. 푛 While by the way we defined operations above it might seem that one needs to be able to perform arbitrary unitary operations on these two or three qubits, it turns out that there are several choices for universal sets - a small constant number of gates that generate all others. The biggest challenge is how to keep the system from being measured and collapsing to a single classical combination of states. This is sometimes known as the coherence time of the system. The threshold theorem says that there is some absolute constant level of errors so that if errors are created at every gate at rate smaller than then we can recover from those and perform arbitrary long computations.휏 (Of course there are different ways to model the errors 휏and so there are actually several threshold theorems corresponding to various noise models). There have been several proposals to build quantum computers:

• Superconducting quantum computers use superconducting electric circuits to do quantum computation. This is the direction where Figure 23.3: Superconducting quantum computer there has been most recent progress towards “beating” classical prototype at Google. Image credit: Google / MIT computers. Technology Review. 630 introduction to theoretical computer science

• Trapped ion quantum computers Use the states of an ion to sim- ulate a qubit. People have made some recent advances on these computers too. While it’s not at all clear that’s the right measur- ing yard, the current best implementation of Shor’s algorithm (for factoring 15) is done using an ion-trap computer.

• Topological quantum computers use a different technology. Topo- logical qubits are more stable by design and hence error correction is less of an issue, but constructing them is extremely challenging.

These approaches are not mutually exclusive and it could be that ultimately quantum computers are built by combining all of them together. In the near future, it seems that we will not be able to achieve full fledged large scale universal quantum computers, but rather more restricted machines, sometimes called “Noisy Intermediate-Scale Quantum Computers” or “NISQ”. See this article by John Preskil for some of the progress and applications of such more restricted devices.

23.10 SHOR’S ALGORITHM: HEARING THE SHAPE OF PRIME FAC- TORS

Bell’s Inequality is a powerful demonstration that there is some- thing very strange going on with quantum mechanics. But could this “strangeness” be of any use to solve computational problems not directly related to quantum systems? A priori, one could guess the answer is no. In 1994 Peter Shor showed that one would be wrong:

Theorem 23.12 — Shor’s Algorithm. There is a polynomial-time quan- tum algorithm that on input an integer (represented in base two), outputs the prime factorization of . 푀 Another way to state Theorem 23.12 is that푀 if we define FACTORING to be the function that on input a pair of numbers ∗ outputs if and only if has a factor such that ∶ {0,, then 1} FACTORING→ {0, 1} is in BQP. This is an exponential improvement over(푀, the 푋) best known1 classical algorithms,푀 which take푃 roughly2 ≤ 푃 ≤ 푋time, where the notation hides factors that are 1/3 polylogarithmic푂(푛̃ ) in . While we will not prove Theorem 23.12 in this chapter, we2 will sketch some of the푂̃ ideas behind the proof. 푛 23.10.1 Period finding At the heart of Shor’s Theorem is an efficient for finding periods of a given function. For example, a function is periodic if there is some such that for every Figure 23.4: Top: A periodic function. Bottom: An (e.g., see Fig. 23.4). 푓 ∶ ℝ → ℝ a-periodic function. ℎ > 0 푓(푥 + ℎ) = 푓(푥) 푥 quantum computing 631

Musical notes yield one type of periodic function. When you pull on a string on a musical instrument, it vibrates in a repeating pattern. Hence, if we plot the speed of the string (and so also the speed of the air around it) as a function of time, it will correspond to some periodic function. The length of the period is known as the wave length of the note. The frequency is the number of times the function repeats itself within a unit of time. For example, the “Middle C” note has a frequency of Hertz, which means its period is seconds. If we play a chord261.63by playing several notes at once, we1/(261.63) get a more complex periodic function obtained by combining the functions of the individual notes (see Fig. 23.5). The human ear contains many small hairs, each of which is sensitive to a narrow band of frequencies. Hence when we hear the sound corresponding to a chord, the hairs in our ears actually separate it out to the components corresponding to each frequency. It turns out that (essentially) every periodic function can be decomposed into a sum of simple wave functions (namely functions of the form sin or cos ). This푓 is ∶ known ℝ → ℝ as the Fourier Transform (see Fig. 23.6). The Fourier transform makes it easy to compute the period푥 ↦ of(휃푥) a given푥 function: ↦ (휃푥) it will simply be the least common multiple of the periods of the constituent waves.

23.10.2 Shor’s Algorithm: A bird’s eye view Figure 23.5: Left: The air-pressure when playing a On input an integer , Shor’s algorithm outputs the prime factoriza- “C Major” chord as a function of time. Right: The coefficients of the Fourier transform of the same func- tion of in time that is polynomial in log . The main steps in the tion, we can see that it is the sum of three frequencies algorithm are the following:푀 corresponding to the C, E and G notes (261.63, 329.63 푀 푀 and 392 Hertz respectively). Credit: Bjarke Mønsted’s Quora answer. Step 1: Reduce to period finding. The first step in the algorithm isto pick a random and define the function as mod where we identify the 퐴 string푚 퐴푚 ∈with {0, 1 an … integer , 푀 −푥 1} using the binary representation,퐹 ∶ 퐴 {0,and 1} similarly→ {0, represent 1}푚 퐹 the(푥) integer = 퐴 ( mod푀) as a string. (We will choose푥 ∈to {0, be 1} some polynomial in log푥 and so in particular is a large enough set to represent all퐴 the( numbers푀) in 푚). Some푚 not-too-hard (though somewhat푀 technical) calculations{0, show 1} that: (1) The function is periodic (i.e., there is some{0, integer 1, … , 푀 −such 1} that for “almost” every ) and more importantly Figure 23.6: If is a periodic function then when we 퐴 퐴 (2) If we can recover the퐹 period of for several randomly푝 cho- represent it in the Fourier transform, we expect the 퐴 퐴 퐴 coefficients푓 corresponding to wavelengths that do sen 퐹’s,(푥 then + 푝 we) =can 퐹 recover(푥) the factorization푥 of . (We’ll ignore the 퐴 퐴 not evenly divide the period to be very small, as they “almost” qualifier in the discussion푝 퐹 below; it causes some annoying, would tend to “cancel out”. yet ultimately퐴 manageable, technical issues in the푀 full-fledged algo- rithm.) Hence, factoring reduces to finding out the period of the function . Exercise 23.2 asks you to work out this for the related 푀 퐴 퐹 632 introduction to theoretical computer science

task of computing the discrete logarithm (which underlies the security of the Diffie-Hellman key exchange and elliptic curve cryptography).

Step 2: Period finding via the Quantum Fourier Transform. Using a simple trick known as “repeated squaring”, it is possible to compute the map in time polynomial in , which means we can also compute this map using a polynomial number of NAND gates,and so 퐴 in particular푥 ↦ 퐹 we(푥) can generate in polynomial푚 quantum time a quantum state that is (up to normalization) equal to

휌 (23.13)

푚 퐴 In particular, if we were푥∈{0,1} to∑measure|푥⟩|퐹the(푥)⟩ state . , we would get a ran- dom pair of the form where . So far, this is not at all impressive. After all, we did not need the power휌 of quantum comput- 퐴 ing to generate such pairs:(푥, 푦) we could푦 = simply 퐹 (푥) generate a random and then compute . Another way to describe the state is that the coefficient of 푥 in 퐴 is proportional퐹 to(푥) where is the function such that 휌 푚 |푥⟩|푦⟩ 퐴,푦 퐴,푦 휌 푓 (푥) 푓 ∶ {0,mod 1} → ℝ (23.14) otherwise푥 ⎧ 퐴,푦 {1 푦 = 퐴 ( 푀) The magic of Shor’s푓 (푥) algorithm = comes from a procedure. known {⎨ as the Quantum Fourier Transform⎩0 . It allows to change the state into the state where the coefficient of is now proportional to the -th Fourier coefficient of . In other words, if we measure the휌 state , we will obtain̂휌 a pair such that|푥⟩|푦⟩ the probability of choosing 퐴,푦 푥is proportional to the square푓 of the weight of the frequency in the representation̂휌 of the function(푥, 푦) . Since for every , the function 푥 has the period , it can be shown that the frequency 푥 will be 퐴,푦 (almost) a multiple of . If we푓 make several such samples푦 퐴,푦 퐴 푓and obtain the frequencies푝 , then the true period 푥 divides 퐴 0 푘 all of them, and it can be푝 shown that it is going to be in fact the푦 , …greatest , 푦 1 푘 퐴 common divisor (g.c.d.) of all푥 these, … , 푥 frequencies: a value which푝 can be computed in polynomial time. As mentioned above, we can recover the factorization of from the periods of for some randomly chosen in and which is polynomial in log . 푀 퐴0 퐴푡 0 푡 The resulting퐹 algorithm, … , 퐹 can be described in a high (and퐴 , … somewhat , 퐴 {0,inaccurate) … , 푀 −1} level as푡 follows: 푀 Shor’s Algorithm: (sketch) Input: Number . Output: Prime factorization of . 푀 ∈ ℕ Operations: 푀 quantum computing 633

1. Repeat the following log number of times: a. Choose at random, and let be 푘 = 푝표푙푦( 푀) the map mod . 퐴 푀 푀 b. For 퐴 ∈ {0,log푥 … , 푀, repeat − 1} times the following푓 step:∶ ℤ Quantum→ ℤ Fourier Transform푥 ↦ 퐴 to create푀 a quantum state over log qubits,푡 = such 푝표푙푦( that푀) if we measure푡 we obtain a pair of strings with probability proportional to the square|휓⟩ of푝표푙푦( the coefficient(푚)) corresponding to the wave function|휓⟩ cos or (푗,sin 푦) in the Fourier transform of the function defined as iff 푥 ↦ . (푥휋푗/푀) 푥 ↦ 퐴,푦 푚 (푥휋푗/푀) 푓 ∶ ℤ → c. If are the coefficients퐴,푦 퐴 we obtained in the previous step, {0,then 1} the least common푓 (푥) multiple = 1 푓 of(푥) = 푦 is likely to be 1 푡 the푗 period, … , 푗 of the function . 1 푡 푀/푗 , … , 푀/푗 2. If we let and 퐴 be the numbers we chose in 푓 the previous step and the corresponding periods of the functions 0 푘−1 0 푘−1 퐴 , …then , 퐴 we can푝 use, … classical , 푝 results in number theory to obtain from these a non-trivial prime factor of (if such exists). 퐴0 퐴푘−1 푓We, can … , now푓 run the algorithm again with the (smaller) input to obtain all other factors. 푄 푀 푀/푄 Reducing factoring to order finding is cumbersome, but can be done in polynomial time using a classical computer. The key quantum ingredient in Shor’s algorithm is the quantum fourier transform.

R Remark 23.13 — Quantum Fourier Transform. Despite its name, the Quantum Fourier Transform does not actually give a way to compute the Fourier Trans- form of a function . This would be impossible to do in time polynomial푚 in , as simply writing down the Fourier푓 ∶ {0, Transform 1} → would ℝ require coefficients. Rather the Quantum Fourier푚 Transform 푚 gives a quantum state where the amplitude correspond-2 ing to an element (think: frequency) is equal to the corresponding Fourier coefficient. This allows to sample from a distribution where isℎ drawn with probability proportional to the square of its Fourier coefficient. This is not the ℎsame as computing the Fourier transform, but is good enough for recovering the period.

23.11 QUANTUM FOURIER TRANSFORM (ADVANCED, OPTIONAL)

The above description of Shor’s algorithm skipped over the implemen- tation of the main quantum ingredient: the Quantum Fourier Transform algorithm. In this section we discuss the ideas behind this algorithm. We will be rather brief and imprecise. Section 23.13 contain references to sources of more information about this topic. 634 introduction to theoretical computer science

To understand the Quantum Fourier Transform, we need to better understand the Fourier Transform itself. In particular, we will need to understand how it applies not just to functions whose input is a real number but to functions whose domain can be any arbitrary commutative group. Therefore we now take a short detour to (very basic) group theory, and define the notion of periodic functions over groups.

R Remark 23.14 — Group theory. While we define the con- cepts we use, some background in group or number theory will be very helpful for fully understanding this section. In particular we will use the notion of finite commutative (a.k.a. Abelian) groups. These are defined as follows.

• A finite group is a pair where is a finite set of elements and is a binary operation mapping a pair of elements픾 in(퐺,to ⋆) the element퐺 in . We often identify ⋆with the set of its elements, and so푔, use ℎ notation such퐺 as to indicate푔 ⋆ ℎ that퐺 is an element of 픾and to denote퐺 the number of elements in . 푔 ∈ 픾 푔 • The operation 픾satisfies|픾| the sort of properties that a product operation픾 does, namely, it is associative (i.e., ⋆ ) and there is some element such that for all , and for every(푔 ⋆ ℎ) ⋆ 푓there = exists푔 ⋆ (ℎ an ⋆ element 푓) such that 1 . 푔 ⋆ 1 = 푔 −1 푔 • A group−1푔 is ∈ called 픾 commutative (also known푔 as Abelian푔 ⋆ 푔 )= if 1 for all .

푔 ⋆ ℎ = ℎ ⋆ 푔 푔, ℎ ∈ 픾 The Fourier transform is a deep and vast topic, on which we will barely touch upon here. Over the real numbers, the Fourier trans- form of a function is obtained by expressing in the form where the ’s are “wave functions” (e.g. sines and cosines). How- 훼 ever, it turns out that푓 the same notion exists for푓every Abelian∑ group푓(훼)휒̂ . 훼 Specifically,휒 for every such group , if is a function mapping to , then we can write as 픾 픾 푓 픾 ℂ 푓 (23.15)

̂ 푔 where the ’s are functions푓 = ∑mapping푔∈픾 푓(푔)휒 ,to that are analogs of the “wave functions” for the group and for every , is a 푔 complex number휒 known as the Fourier coefficient픾 ℂ of corresponding to . Specifically, the equation (23.15) means픾 that if we푔 think ∈ 픾 of푓(푔)̂ as a dimensional vector over the complex numbers,푓 then we can write 푔this vector as a sum (with certain coefficients) of the vectors 푓 . |픾| 푔 푔∈픾 {휒 } quantum computing 635

The representation (23.15) is known as the Fourier expansion or Fourier transform of , the numbers are known as the Fourier coef- ficients of and the functions are known as the Fourier char- 푔∈픾 acters. The central푓 property of(푓(푔))̂ the Fourier characters is that they are 푔 푔∈픾 homomorphisms푓 of the group into(휒 ) the complex numbers, in the sense that for every , , where is the group operation. One corollary′ of this′ property is that′ if then 푔 푔 푔 is periodic in푥, the 푥 sense∈ 픾 휒 that(푥 ⋆ 푥 ) = 휒 (푥)휒 (푥 )for every⋆ . It turns 푔 푔 out that if is periodic with minimal period , then휒 the(ℎ) only = 1 Fourier휒 푔 푔 charactersℎ that have non zero휒 coefficients(푥 ⋆ ℎ) = 휒 in(푥) the expression푥 (23.15) are those that푓 are periodic as well. This canℎ be used to recover the period of from its Fourier expansion. ℎ 23.11.1 Quantum푓 Fourier Transform over the Boolean Cube: Simon’s Algorithm We now describe the simplest setting of the Quantum Fourier Trans- form: the group with the XOR operation, which we’ll denote by . (Since 푛XOR is equal to addition modulo two, this group is also푛 often{0, denoted 1} as .) It can be shown that the Fourier transform({0, 1} over, ⊕) corresponds푛 to expressing 2 as 푛 (ℤ ) 푛 ({0, 1} , ⊕) 푓 ∶ {0, 1} → ℂ (23.16)

̂ 푦 where 푓is = defined푦∈{0,1} ∑ 푓(푦)휒 as and 푖 푖 푛 . ∑푖 푦 푥 푦 푦 푖 푖 푖 The Quantum√1휒푛 ∶ {0, Fourier 1}푛 → Transform ℂ ∑ 푦 푥 over 휒 (푥) = (−1)is actually quite ̂ 2 푥∈{0,1} simple:푓(푦) = ∑ 푓(푥)(−1) 푛 ({0, 1} , ⊕) Theorem 23.15 — QFT Over the Boolean Cube. Let

be a quantum state where is some function푛 satisfy- 푥∈{0,1} ing . Then we can푛 use 휌gates = ∑ to transform푓(푥)|푥⟩to

the state 푛 2 푓 ∶ {0, 1} → ℂ ∑푥∈{0,1} |푓(푥)| = 1 푛 휌 (23.17)

∑ 푛 푓(푦)|푦⟩̂ where 푦∈{0,1}and is the function . 푛 푦 ̂ 푦 푦 푓∑ = 푥푖푦푖 ∑ 푓(푦)휒 휒 ∶ {0, 1} → ℂ 푦 Proof휒 (푥) Idea: = −1 The idea behind the proof is that the Hadamard operation corre- sponds to the Fourier transform over the group (with the XOR operations). To show this, we just need to do the calculations.푛 {0, 1}

⋆ 636 introduction to theoretical computer science

Proof of Theorem 23.15. We can express the Hadamard operation HAD as follows:

HAD (23.18) We are given the state √1 푎 |푎⟩ = 2 (|0⟩ + (−1) |1⟩) . (23.19)

푛 Now suppose that we apply휌 =푥∈{0,1} the∑ HAD푓(푥)|푥⟩operation . to each of the qubits. We can see that we get the state 푛

(23.20) 푛−1 −푛/2 푥푖 푛 We can now use2 the푥∈{0,1} distributive∑ 푓(푥) ∏푖=0 law(|0⟩ and + (−1) open up|1⟩) a . term of the form

(23.21) 0 푛−1 to the following sum over푥 terms: 푥 푓(푥)(|0⟩ + (−1) |1⟩) ⋯ (|0⟩ + (−1) |1⟩) 푛 2 (23.22) ∑ 푦푖푥푖 푛 (If you find the above푓(푥)푦∈{0,1} confusing, ∑ (−1) try to|푦⟩ work . out explicitly this calculation for ; namely show that is the same as the sum over푥0 terms 푥1 푛 = 3푥2 .) (|0⟩ + (−1) |1⟩)(|0⟩3 + (−1)By푥2 changing|1⟩)(|0⟩+(−1) the order|1⟩)푥0+푥 of1 summations,+푥2 we see that the2 final state|000⟩+ is (−1) |001⟩ + ⋯ + (−1) |111⟩ (23.23) −푛/2 ∑ 푥푖푦푖 푛 푛 which exactly푦∈{0,1} corresponds∑ 2 (푥∈{0,1} to ∑. 푓(푥)(−1) )|푦⟩ ■ ̂휌 23.11.2 From Fourier to Period finding: Simon’s Algorithm (advanced, optional) Using Theorem 23.15 it is not hard to get an algorithm that can recover a string given a circuit that computes a function ∗ that푛 is periodic in the sense that for distinct푛 ℎ ∈if {0, and∗ 1} only if∗ . The key observation is퐹 that′ ∶ if {0,we 1}compute→ {0,′ the 1} state ℎ ′ ∗ , and perform퐹 (푥) the = Quantum 퐹 (푥 )

Fourier transform푥, 푥 on the first푥 =푛 qubits, 푥 ⊕ ℎ then we would get a state such 푥∈{0,1} that the only basis elements∑ with|푥⟩|퐹 nonzero (푥)⟩ coefficients would be of the form where 푛

|푦⟩ mod (23.24) ∗ 푖 푖 ∑ 푦 ℎ = 0( 2) quantum computing 637

So, by measuring the state, we can obtain a sample of a random satisfying (23.24). But since (23.24) is a linear equation modulo about the unknown variables , if we repeat this proce- 푦dure to get such equations, we will∗ have∗ at least as many equations2 0 푛−1 as variables and (it can푛 be shownℎ that), … , ℎ this will suffice to recover . This result푛 is known as Simon’s Algorithm, and it preceded and ∗ inspired Shor’s algorithm. ℎ

23.11.3 From Simon to Shor (advanced, optional) Theorem 23.15 seemed to really use the special bit-wise structure of the group , and so one could wonder if it can be extended to other groups. However,푛 it turns out that we can in fact achieve such a generalization.{0, 1} The key step in Shor’s algorithm is to implement the Fourier trans- form for the group which is the set of numbers with the operation being addition modulo . In this case it turns out that 퐿 the Fourier charactersℤ are the functions {0,where … , 퐿 − 1} ( here denotes the complex number 퐿 ). The -th푦푥 Fourier coeffi-2휋푖/퐿 푦 휒 (푥) = 휔 휔 = 푒 cient of a function is √ 푖 −1 푦 퐿 푓 ∶ ℤ → ℂ (23.25)

√1 푥푦 ̂ 퐿 퐿 The key to implementing푓(푦) = the Quantum푥∈ℤ∑ 푓(푥)휔 Fourier. Transform for such groups is to use the same recursive equations that enable the classical Fast Fourier Transform (FFT) algorithm. Specifically, consider the case that . We can separate the sum over in (23.25) to the terms correspondingℓ to even ’s (of the form ) and odd ’s (of the form퐿 = 2 ) to obtain 푥 푥 푥 = 2푧 푥 푥 = 2푧 + 1 (23.26) 푦 √1 2 푦푧 √휔 2 푦푧 ̂ 퐿 퐿/2 퐿 퐿/2 which푓(푦) = reduces푧∈푍∑ computing푓(2푧)(휔 the) + Fourier푧∈ℤ∑ transform푓(2푧 + of1)(휔over) the group to computing the Fourier transform of the functions and ℓ (corresponding to the applying to only the even푓 and odd ’s 2 푒푣푒푛 ℤrespectively) which have inputs that we can identify푓 with the 표푑푑 푓group . ℓ−1 푓 푥 Specifically,ℓ−1 the Fourier2 characters of the group are the func- 2 퐿/2 tions ℤ = ℤ for every . Moreover, mod 퐿/2 since , 2휋푖/(퐿/2)푦푥 2 푦푥 for every ℤ . Thus (23.26) 푦 퐿/2 translates휒퐿(푥) into = 푒 2 푦 2=푦 (휔 ) 퐿/2 푥, 푦 ∈ ℤ 휔 = 1 (휔 ) = (휔 ) 푦 ∈ ℕ mod mod (23.27) 푦 This observation푒푣푒푛 is usually used to obtain표푑푑 a fast (e.g. log ) 푓(푦)̂ = 푓 ̂ (푦 퐿/2) + 휔 푓 ̂ (푦 퐿/2) . time to compute the Fourier transform in a classical setting, but it can 푂(퐿 퐿) 638 introduction to theoretical computer science

be used to obtain a quantum circuit of log gates to transform a state of the form to a state of the form . The case that is not an exact power푝표푙푦( of two causes퐿) some complica- 푥∈ℤ퐿 푦∈ℤ퐿 tions in both the classical∑ 푓(푥)|푥⟩ case of the Fast Fourier Transform∑ and푓(푦)|푦⟩̂ the quantum setting퐿 of Shor’s algorithm. However, it is possible to handle these. The idea is that we can embed in the group for any integer , and we can find an integer such that will be close 퐿 퐴⋅퐿 enough to a power of (i.e., a number푍 of the form ℤfor some ), so that if we퐴 do the Fourier transform over퐴 the group퐴푚 ⋅ 퐿 then we will not introduce too many2 errors. 2 푚 푚 2 ℤ

Figure 23.7: Conjectured status of BQP with respect to other complexity classes. We know that P BPP BQP and BQP PSPACE EXP. It is not known if any of these inclusions are strict though it is⊆ believed⊆ that they are. The⊆ relation between⊆ BQP and NP is unknown but they are believed to be incomparable, and that NP-complete problems can not be solved in polynomial time by quantum computers. However, it is possible that BQP contains NP and even PSPACE, and it is also possible that quantum computers offer no super-polynomial speedups and that P BQP. The class “NISQ” above is not a well defined complexity class, but rather captures the current= status of quantum devices, which seem to be able to solve a set of computational tasks that is incomparable with the set of tasks solvable by classical computers. The diagram is also inaccurate in the sense that at the moment the “” tasks on which such devices seem to offer exponential speedups do not correspond to Boolean functions/ decision problems.

Chapter Recap

• The state of an -qubit quantum system can be ✓ modeled as a dimensional vector • An operation on푛 the state corresponds to applying a unitary matrix2 to this vector. • Quantum circuits are obtained by composing basic operations such as HAD and . • We can use quantum circuits to define the classes 푁퐴푁퐷 BQP poly and BQP which are푈 the quantum analogs of P poly and BPP respectively. • There/ are some problems for which the best known quantum/ algorithm is exponentially faster than the best known, but quantum computing is not a panacea. In particular, as far as we know, quantum computers could still require exponential time to solve NP-complete problems such as SAT. quantum computing 639

23.12 EXERCISES

Exercise 23.1 — Quantum and classical complexity class relations. Prove the following relations between quantum complexity classes and classical ones:

7 7 You can use to simulate NAND gates. 1. P poly BQP poly. See footnote for hint. 8 Use the alternative푁퐴푁퐷 characterization of P as in Solved 2. P/ BQP. See/ footnote for hint.8 푈 ⊆ Exercise 13.4. 9 3. BPP BQP. See footnote for hint.9 You can use the HAD gate to simulate a coin toss. ⊆ 10 4. BQP EXP. See footnote for hint.10 In exponential time simulating quantum computa- ⊆ tion boils down to matrix multiplication. 11 5. If SAT BQP then NP BQP. See footnote for hint.11 If a reduction can be implemented in P it can be ⊆ implemented in BQP as well.

■ ∈ ⊆ Exercise 23.2 — Discrete logarithm from order finding. Show a probabilistic polynomial time classical algorithm that given an Abelian finite group (in the form of an algorithm that computes the group operation), a generator for the group, and an element , as well as access to a픾 black box that on input outputs the order of (the smallest such that 푔 ), computes the discrete logarithmℎ ∈ 픾 of with respect to . That is the푎 algorithm should푓 ∈ 픾 output a number such푓 that 푎. 12 12 See footnote푓 for= 1 hint. ℎ 푥 We are given and need to recover . To do so we can compute the order of various elements 푔 푥 푔 = ℎ ■ 푥 of the form ℎ. =The 푔 order of such an element푥 is a number satisfying푎 푏 mod . 23.13 BIBLIOGRAPHICAL NOTES With a few randomℎ 푔 examples we will get a non trivial equation on푐 (where 푐(푥푎is not + zero푏) = modulo 0 ( |픾|)) and An excellent gentle introduction to quantum computation is given in then we can use our knowledge of to recover . 푥 푐 |픾| Mermin’s book [Mer07]. In particular the first 100 pages (Chapter 푎, 푏, 푐 푥 1 to 4) of [Mer07] cover all the material of this chapter in a much more comprehensive way. This material is also covered in the first 5 chapters of De-Wolf’s online lecture notes. For a more condensed exposition, the chapter on quantum computation in my book with Arora (see draft here) is one relatively short source that contains full descriptions of Grover’s, Simon’s and Shor’s algorithms. This blog post of Aaronson contains a high level explanation of Shor’s algorithm which ends with links to several more detailed expositions. Chapters 9 and 10 in Aaronson’s book [Aar13] give an informal but highly informative introduction to the topics of this chapter and much more. Chapter 10 in ’s book also provides a high level overview of quantum computing. Other recommended resources include Andrew Childs’ lecture notes on quantum algorithms, as well as the lecture notes of Umesh Vazirani, John Preskill, and John Watrous. There are many excellent videos available online covering some of these materials. The videos of Umesh Vazirani’s EdX course are an 640 introduction to theoretical computer science

accessible and recommended introduction to quantum computing. Regarding quantum mechanics in general, this video illustrates the double slit experiment, this Scientific American video is a nice ex- position of Bell’s Theorem. This talk and panel moderated by Brian Greene discusses some of the philosophical and technical issues around quantum mechanics and its so called “measurement prob- lem”. The Feynman lecture on the Fourier Transform and quantum mechanics in general are very much worth reading. The Fourier trans- form is covered in these videos of Dr. Chris Geoscience, Clare Zhang and Vi Hart. See also Kelsey Houston-Edwards’s video on Shor’s Al- gorithm. The form of Bell’s game we discuss in Section 23.3 was given by Clauser, Horne, Shimony, and Holt. The Fast Fourier Transform, used as a component in Shor’s algo- rithm, is one of the most widely used algorithms invented. The stories of its discovery by Gauss in trying to calculate asteroid orbits and rediscovery by Tukey during the cold war are fascinating as well. The image in Fig. 23.2 is taken from Wikipedia. Thanks to Scott Aaronson for many helpful comments about this chapter. VI APPENDICES

Bibliography

[Aar05] Scott Aaronson. “NP-complete problems and physical reality”. In: ACM Sigact News 36.1 (2005). Available on https://arxiv.org/abs/quant-ph/0502072, pp. 30–52. [Aar13] Scott Aaronson. Quantum computing since Democritus. Cambridge University Press, 2013. [Aar16] Scott Aaronson. “P =? NP”. In: Open problems in mathe- matics. Available on https://www.scottaaronson.com/ papers/pnp.pdf. Springer, 2016, pp. 1–122. [Aar20] Scott Aaronson. “The Busy Beaver Frontier”. In: SIGACT News (2020). Available on https://www.scottaaronson. com/papers/bb.pdf. [AKS04] Manindra Agrawal, Neeraj Kayal, and Nitin Saxena. “PRIMES is in P”. In: Annals of mathematics (2004), pp. 781–793. [Asp18] James Aspens. Notes on Discrete Mathematics. Online textbook for CS 202. Available on http://www.cs.yale. edu/homes/aspnes/classes/202/notes.pdf. 2018. [Bar84] H. P. Barendregt. The lambda calculus : its syntax and semantics. Amsterdam New York New York, N.Y: North- Holland Sole distributors for the U.S.A. and Canada, Elsevier Science Pub. Co, 1984. isbn: 0444875085. [Bjo14] Andreas Bjorklund. “Determinant sums for undirected hamiltonicity”. In: SIAM Journal on Computing 43.1 (2014), pp. 280–299. [Blä13] Markus Bläser. Fast Matrix Multiplication. Graduate Sur- veys 5. Theory of Computing , 2013, pp. 1–60. doi: 10.4086/toc.gs.2013.005. url: http://www. theoryofcomputing.org/library.html. [Boo47] George Boole. The mathematical analysis of logic. Philo- sophical Library, 1847.

Compiled on 8.26.2020 18:10 644 introduction to theoretical computer science

[BU11] David Buchfuhrer and Christopher Umans. “The com- plexity of Boolean formula minimization”. In: Journal of Computer and System Sciences 77.1 (2011), pp. 142–153. [Bur78] Arthur W Burks. “Booke review: Charles S. Peirce, The new elements of mathematics”. In: Bulletin of the Ameri- can Mathematical Society 84.5 (1978), pp. 913–918. [Cho56] Noam Chomsky. “Three models for the description of language”. In: IRE Transactions on information theory 2.3 (1956), pp. 113–124. [Chu41] Alonzo Church. The calculi of lambda-conversion. Prince- ton London: Princeton University Press H. Milford, Oxford University Press, 1941. isbn: 978-0-691-08394-0. [CM00] Bruce Collier and James MacLachlan. Charles Babbage: And the engines of perfection. Oxford University Press, 2000. [Coh08] Paul Cohen. Set theory and the continuum hypothesis. Mine- ola, N.Y: Dover Publications, 2008. isbn: 0486469212. [Coh81] Danny Cohen. “On holy wars and a plea for peace”. In: Computer 14.10 (1981), pp. 48–54. [Coo66] SA Cook. “On the minimum computation time for mul- tiplication”. In: Doctoral dissertation, Harvard University Department of Mathematics, Cambridge, Mass. 1 (1966). [Coo87] James W Cooley. “The re-discovery of the fast Fourier transform algorithm”. In: Microchimica Acta 93.1-6 (1987), pp. 33–45. [Cor+09] Thomas H. Cormen et al. Introduction to Algorithms, 3rd Edition. MIT Press, 2009. isbn: 978-0-262-03384-8. url: http://mitpress.mit.edu/books/introduction-algorithms. [CR73] Stephen A Cook and Robert A Reckhow. “Time bounded random access machines”. In: Journal of Computer and System Sciences 7.4 (1973), pp. 354–375. [CRT06] Emmanuel J Candes, Justin K Romberg, and Terence Tao. “Stable signal recovery from incomplete and inaccurate measurements”. In: Communications on Pure and : A Journal Issued by the Courant Institute of Mathematical Sciences 59.8 (2006), pp. 1207–1223. [CT06] Thomas M. Cover and Joy A. Thomas. “Elements of information theory 2nd edition”. In: Willey-Interscience: NJ (2006). [Dau90] Joseph Warren Dauben. Georg Cantor: His mathematics and philosophy of the infinite. Princeton University Press, 1990. BIBLIOGRAPHY 645

[De 47] Augustus De Morgan. Formal logic: or, the calculus of inference, necessary and probable. Taylor and Walton, 1847. [Don06] David L Donoho. “Compressed sensing”. In: IEEE Trans- actions on information theory 52.4 (2006), pp. 1289–1306. [DPV08] Sanjoy Dasgupta, Christos H. Papadimitriou, and Umesh V. Vazirani. Algorithms. McGraw-Hill, 2008. isbn: 978-0- 07-352340-8. [Ell10] Jordan Ellenberg. “Fill in the blanks: Using math to turn lo-res datasets into hi-res samples”. In: Wired Magazine 18.3 (2010), pp. 501–509. [Ern09] Elizabeth Ann Ernst. “Optimal combinational multi- level ”. Available on https://deepblue. lib.umich.edu/handle/2027.42/62373. PhD thesis. University of Michigan, 2009. [Fle18] Margaret M. Fleck. Building Blocks for Theoretical Com- puter Science. Online book, available at http://mfleck.cs. illinois.edu/building-blocks/. 2018. [Für07] Martin Fürer. “Faster integer multiplication”. In: Pro- ceedings of the 39th Annual ACM Symposium on Theory of Computing, San Diego, California, USA, June 11-13, 2007. 2007, pp. 57–66. [FW93] Michael L Fredman and Dan E Willard. “Surpassing the information theoretic bound with fusion trees”. In: Journal of computer and system sciences 47.3 (1993), pp. 424–436. [GKS17] Daniel Günther, Ágnes Kiss, and Thomas Schneider. “More efficient universal circuit constructions”. In: Inter- national Conference on the Theory and Application of Cryp- tology and . Springer. 2017, pp. 443– 470. [Gom+08] Carla P Gomes et al. “Satisfiability solvers”. In: Founda- tions of Artificial Intelligence 3 (2008), pp. 89–134. [Gra05] Judith Grabiner. The origins of Cauchy’s rigorous cal- culus. Mineola, N.Y: Dover Publications, 2005. isbn: 9780486438153. [Gra83] Judith V Grabiner. “Who gave you the epsilon? Cauchy and the origins of rigorous calculus”. In: The American Mathematical Monthly 90.3 (1983), pp. 185–194. [Hag98] Torben Hagerup. “Sorting and searching on the word RAM”. In: Annual Symposium on Theoretical Aspects of Computer Science. Springer. 1998, pp. 366–398. 646 introduction to theoretical computer science

[Hal60] Paul R Halmos. Naive set theory. Republished in 2017 by Courier Dover Publications. 1960. [Har41] GH Hardy. “A Mathematician’s Apology”. In: (1941). [HJB85] Michael T Heideman, Don H Johnson, and C Sidney Burrus. “Gauss and the history of the fast Fourier trans- form”. In: Archive for history of exact sciences 34.3 (1985), pp. 265–277. [HMU14] , Rajeev Motwani, and . Introduction to , languages, and compu- tation. Harlow, Essex: Pearson Education, 2014. isbn: 1292039051. [Hof99] Douglas Hofstadter. Gödel, Escher, Bach : an eternal golden braid. New York: Basic Books, 1999. isbn: 0465026567. [Hol01] Jim Holt. “The Ada perplex: how Byron’s daughter came to be celebrated as a cybervisionary”. In: The New Yorker 5 (2001), pp. 88–93. [Hol18] Jim Holt. When Einstein walked with Gödel : excursions to the edge of thought. New York: Farrar, Straus and Giroux, 2018. isbn: 0374146705. [HU69] John E Hopcroft and Jeffrey D Ullman. “Formal lan- guages and their relation to automata”. In: (1969). [HU79] John E. Hopcroft and Jeffrey D. Ullman. Introduction to Automata Theory, Languages and Computation. Addison- Wesley, 1979. isbn: 0-201-02988-X. [HV19] David Harvey and Joris Van Der Hoeven. “Integer multi- plication in time O(n log n)”. working paper or preprint. Mar. 2019. url: https://hal.archives-ouvertes.fr/hal- 02070778. [Joh12] David S Johnson. “A brief history of NP-completeness, 1954–2012”. In: Documenta Mathematica (2012), pp. 359– 376. [Juk12] Stasys Jukna. Boolean function complexity: advances and frontiers. Vol. 27. Springer Science & Media, 2012. [Kar+97] David R. Karger et al. “Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web”. In: Proceedings of the Twenty-Ninth Annual ACM Symposium on the Theory of Computing, El Paso, Texas, USA, May 4-6, 1997. 1997, pp. 654–663. doi: 10.1145/258533.258660. url: https: //doi.org/10.1145/258533.258660. BIBLIOGRAPHY 647

[Kar95] ANATOLII ALEXEEVICH Karatsuba. “The complexity of computations”. In: Proceedings of the Steklov Institute of Mathematics-Interperiodica Translation 211 (1995), pp. 169– 183. [Kle91] Israel Kleiner. “Rigor and Proof in Mathematics: A Historical Perspective”. In: Mathematics Magazine 64.5 (1991), pp. 291–314. issn: 0025570X, 19300980. url: http://www.jstor.org/stable/2690647. [Kle99] Jon M Kleinberg. “Authoritative sources in a hyper- linked environment”. In: Journal of the ACM (JACM) 46.5 (1999), pp. 604–632. [Koz97] Dexter Kozen. Automata and computability. New York: Springer, 1997. isbn: 978-3-642-85706-5. [KT06] Jon M. Kleinberg and Éva Tardos. Algorithm design. Addison-Wesley, 2006. isbn: 978-0-321-37291-8. [Kun18] Jeremy Kun. A programmer’s introduction to mathematics. Middletown, DE: CreateSpace Independent Publishing Platform, 2018. isbn: 1727125452. [Liv05] Mario Livio. The equation that couldn’t be solved : how mathematical genius discovered the language of symmetry. New York: Simon & Schuster, 2005. isbn: 0743258207. [LLM18] Eric Lehman, Thomson F. Leighton, and Albert R. Meyer. Mathematics for Computer Science. 2018. [LMS16] Helger Lipmaa, Payman Mohassel, and Seyed Saeed Sadeghian. “Valiant’s Universal Circuit: Improvements, Implementation, and Applications.” In: IACR Cryptology ePrint Archive 2016 (2016), p. 17. [Lup58] O Lupanov. “A circuit synthesis method”. In: Izv. Vuzov, Radiofizika 1.1 (1958), pp. 120–130. [Lup84] Oleg B. Lupanov. Asymptotic complexity bounds for control circuits. In Russian. MSU, 1984. [Lus+08] Michael Lustig et al. “Compressed sensing MRI”. In: IEEE signal processing magazine 25.2 (2008), pp. 72–82. [Lüt02] Jesper Lützen. “Between Rigor and Applications: De- velopments in the Concept of Function in Mathematical Analysis”. In: The Cambridge History of Science. Ed. by Mary JoEditor Nye. Vol. 5. The Cambridge History of Science. Cambridge University Press, 2002, pp. 468–487. doi: 10.1017/CHOL9780521571999.026. 648 introduction to theoretical computer science

[LZ19] Harry Lewis and Rachel Zax. Essential Discrete Mathemat- ics for Computer Science. Princeton University Press, 2019. isbn: 9780691190617. [Maa85] Wolfgang Maass. “Combinatorial lower bound argu- ments for deterministic and nondeterministic Turing machines”. In: Transactions of the American Mathematical Society 292.2 (1985), pp. 675–693. [Mer07] N David Mermin. Quantum computer science: an introduc- tion. Cambridge University Press, 2007. [MM11] Cristopher Moore and Stephan Mertens. The nature of computation. Oxford University Press, 2011. [MP43] Warren S McCulloch and Walter Pitts. “A logical calcu- lus of the ideas immanent in nervous activity”. In: The bulletin of mathematical biophysics 5.4 (1943), pp. 115–133. [MR95] Rajeev Motwani and Prabhakar Raghavan. Randomized algorithms. Cambridge university press, 1995. [MU17] Michael Mitzenmacher and Eli Upfal. Probability and computing: randomization and probabilistic techniques in algorithms and data analysis. Cambridge university press, 2017. [Neu45] John von Neumann. “First Draft of a Report on the ED- VAC”. In: (1945). Reprinted in the IEEE Annals of the journal, 1993. [NS05] Noam Nisan and Shimon Schocken. The elements of com- puting systems: building a modern computer from first princi- ples. MIT press, 2005. [Pag+99] Lawrence Page et al. The PageRank citation ranking: Bring- ing order to the web. Tech. rep. Stanford InfoLab, 1999. [Pie02] Benjamin Pierce. Types and programming languages. Cam- bridge, Mass: MIT Press, 2002. isbn: 0262162091. [Ric53] Henry Gordon Rice. “Classes of recursively enumerable sets and their decision problems”. In: Transactions of the American Mathematical Society 74.2 (1953), pp. 358–366. [Rog96] Yurii Rogozhin. “Small universal Turing machines”. In: Theoretical Computer Science 168.2 (1996), pp. 215–240. [Ros19] Kenneth Rosen. Discrete mathematics and its applications. New York, NY: McGraw-Hill, 2019. isbn: 125967651x. [RS59] Michael O Rabin and Dana Scott. “Finite automata and their decision problems”. In: IBM journal of research and development 3.2 (1959), pp. 114–125. BIBLIOGRAPHY 649

[Sav98] John E Savage. Models of computation. Vol. 136. Available electronically at http://cs.brown.edu/people/jsavage/ book/. Addison-Wesley Reading, MA, 1998. [Sch05] Alexander Schrijver. “On the history of combinatorial optimization (till 1960)”. In: Handbooks in operations research and management science 12 (2005), pp. 1–68. [Sha38] Claude E Shannon. “A symbolic analysis of relay and switching circuits”. In: 57.12 (1938), pp. 713–723. [Sha79] Adi Shamir. “Factoring numbers in O (logn) arith- metic steps”. In: Information Processing Letters 8.1 (1979), pp. 28–31. [She03] Saharon Shelah. “Logical dreams”. In: Bulletin of the American Mathematical Society 40.2 (2003), pp. 203–228. [She13] Henry Maurice Sheffer. “A set of five independent pos- tulates for Boolean , with application to logical constants”. In: Transactions of the American mathematical society 14.4 (1913), pp. 481–488. [She16] Margot Shetterly. Hidden figures : the American dream and the untold story of the Black women mathematicians who helped win the space race. New York, NY: William Morrow, 2016. isbn: 9780062363602. [Sin97] Simon Singh. Fermat’s enigma : the quest to solve the world’s greatest mathematical problem. New York: Walker, 1997. isbn: 0385493622. [Sip97] Michael Sipser. Introduction to the theory of computation. PWS Publishing Company, 1997. isbn: 978-0-534-94728- 6. [Sob17] Dava Sobel. The Glass Universe : How the Ladies of the Harvard Observatory Took the Measure of the Stars. New York: Penguin Books, 2017. isbn: 9780143111344. [Sol14] Daniel Solow. How to read and do proofs : an introduction to mathematical thought processes. Hoboken, New Jersey: John Wiley & Sons, Inc, 2014. isbn: 9781118164020. [SS71] Arnold Schönhage and Volker Strassen. “Schnelle mul- tiplikation grosser zahlen”. In: Computing 7.3-4 (1971), pp. 281–292. [Ste87] Dorothy Stein. Ada : a life and a legacy. Cambridge, Mass: MIT Press, 1987. isbn: 0262691167. [Str69] Volker Strassen. “Gaussian elimination is not optimal”. In: Numerische mathematik 13.4 (1969), pp. 354–356. 650 introduction to theoretical computer science

[Swa02] Doron Swade. The difference engine : Charles Babbage and the quest to build the first computer. New York: Penguin Books, 2002. isbn: 0142001449. [Tho84] Ken Thompson. “Reflections on trusting trust”. In: Com- munications of the ACM 27.8 (1984), pp. 761–763. [Too63] Andrei L Toom. “The complexity of a scheme of func- tional elements realizing the multiplication of integers”. In: Soviet Mathematics Doklady. Vol. 3. 4. 1963, pp. 714– 716. [Tur37] Alan M Turing. “On computable numbers, with an ap- plication to the Entscheidungsproblem”. In: Proceedings of the London mathematical society 2.1 (1937), pp. 230–265. [Vad+12] Salil P Vadhan et al. “Pseudorandomness”. In: Founda- tions and Trends® in Theoretical Computer Science 7.1–3 (2012), pp. 1–336. [Val76] Leslie G Valiant. “Universal circuits (preliminary re- port)”. In: Proceedings of the eighth annual ACM sympo- sium on Theory of computing. ACM. 1976, pp. 196–203. [Wan57] Hao Wang. “A variant to Turing’s theory of computing machines”. In: Journal of the ACM (JACM) 4.1 (1957), pp. 63–92. [Weg87] Ingo Wegener. The complexity of Boolean functions. Vol. 1. BG Teubner Stuttgart, 1987. [Wer74] PJ Werbos. “Beyond regression: New tools for prediction and analysis in the behavioral sciences. Ph. D. thesis, Harvard University, Cambridge, MA, 1974.” In: (1974). [Wig19] Avi Wigderson. Mathematics and Computation. Draft available on https://www.math.ias.edu/avi/book. Princeton University Press, 2019. [Wil09] Ryan Williams. “Finding paths of length k in time”. In: Information Processing Letters 109.6 (2009),∗ 푘 pp. 315–318. 푂 (2 ) [WN09] Damien Woods and Turlough Neary. “The complex- ity of small universal Turing machines: A survey”. In: Theoretical Computer Science 410.4-5 (2009), pp. 443–450. [WR12] Alfred North Whitehead and Bertrand Russell. Principia mathematica. Vol. 2. University Press, 1912.