Distributed computing problems From Wikipedia, the free encyclopedia Contents

1 Atomic broadcast 1 1.1 References ...... 1

2 Atomic commit 2 2.1 Usage ...... 2 2.2 systems ...... 2 2.3 Revision control ...... 3 2.4 Atomic commit convention ...... 3 2.5 See also ...... 4 2.6 References ...... 4

3 Automatic vectorization 5 3.1 Background ...... 5 3.2 Guarantees ...... 6 3.2.1 Data dependencies ...... 6 3.2.2 Data precision ...... 6 3.3 Theory ...... 6 3.3.1 Building the dependency graph ...... 6 3.3.2 Clustering ...... 6 3.3.3 Detecting idioms ...... 7 3.4 General framework ...... 7 3.5 Run-time vs. compile-time ...... 7 3.6 Techniques ...... 8 3.6.1 Loop-level automatic vectorization ...... 8 3.6.2 Basic block level automatic vectorization ...... 8 3.6.3 In the presence of control flow ...... 9 3.6.4 Reducing vectorization overhead in the presence of control flow ...... 10 3.7 See also ...... 10 3.8 References ...... 10

4 Big data 11 4.1 Definition ...... 12 4.2 Characteristics ...... 13

i CONTENTS

4.3 Architecture ...... 14 4.4 Technologies ...... 14 4.5 Applications ...... 15 4.5.1 Government ...... 16 4.5.2 International development ...... 17 4.5.3 Manufacturing ...... 17 4.5.4 Media ...... 17 4.5.5 Private sector ...... 18 4.5.6 Science ...... 18 4.6 Research activities ...... 19 4.6.1 Sampling Big Data ...... 20 4.7 Critique ...... 20 4.7.1 Critiques of the big data paradigm ...... 21 4.7.2 Critiques of big data execution ...... 22 4.8 See also ...... 22 4.9 References ...... 23 4.10 Further reading ...... 28 4.11 External links ...... 28

5 Big Memory 29 5.1 References ...... 29 5.2 External links ...... 29

6 Brooks–Iyengar algorithm 31 6.1 Background ...... 31 6.2 Algorithm ...... 31 6.3 Algorithm characteristics ...... 32 6.4 See also ...... 32 6.5 References ...... 32

7 Byzantine fault tolerance 33 7.1 Origin ...... 33 7.2 Known examples of Byzantine failures ...... 34 7.3 Early solutions ...... 34 7.4 Practical Byzantine fault tolerance ...... 35 7.5 Byzantine fault tolerance software ...... 35 7.6 Byzantine fault tolerance in practice ...... 35 7.7 See also ...... 35 7.8 References ...... 36 7.9 External links ...... 37

8 Clock synchronization 38 8.1 Problems ...... 38 CONTENTS iii

8.2 Solutions ...... 38 8.2.1 Cristian’s algorithm ...... 38 8.2.2 Berkeley algorithm ...... 39 8.2.3 Network Time Protocol ...... 39 8.2.4 Clock Sampling Mutual Network Synchronization ...... 39 8.2.5 Precision Time Protocol ...... 39 8.2.6 Reference broadcast synchronization ...... 40 8.2.7 Reference Broadcast Infrastructure Synchronization ...... 40 8.2.8 Global Positioning System ...... 40 8.3 See also ...... 40 8.4 External Links ...... 40 8.5 References ...... 40

9 Consensus (computer science) 42 9.1 Problem description ...... 42 9.2 Models of computation ...... 43 9.3 Equivalency of agreement problems ...... 43 9.3.1 Terminating Reliable Broadcast ...... 43 9.3.2 Consensus ...... 44 9.3.3 Weak Interactive Consistency ...... 44 9.4 Solvability results for some agreement problems ...... 44 9.5 Some consensus protocols ...... 44 9.6 Applications of consensus protocols ...... 45 9.7 See also ...... 45 9.8 References ...... 45 9.9 Further reading ...... 46

10 Data lineage 47 10.1 Case for Data Lineage ...... 47 10.1.1 Big Data Debugging ...... 47 10.1.2 Challenges in Big Data Debugging ...... 48 10.1.3 Proposed Solution ...... 48 10.2 Data Provenance ...... 49 10.3 Lineage Capture ...... 49 10.4 Active vs Lazy Lineage ...... 50 10.5 Actors ...... 50 10.6 Associations ...... 50 10.7 Architecture ...... 51 10.8 Data flow Reconstruction ...... 51 10.8.1 Association tables ...... 51 10.8.2 Association Graph ...... 52 10.8.3 Topological Sorting ...... 52 iv CONTENTS

10.9 Tracing & Replay ...... 52 10.10Challenges ...... 54 10.10.1 ...... 54 10.10.2 Fault tolerance ...... 54 10.10.3 Black-box operators ...... 54 10.10.4 Efficient tracing ...... 54 10.10.5 Sophisticated replay ...... 54 10.10.6 Anomaly detection ...... 55 10.11See also ...... 55 10.12References ...... 55

11 57 11.1 Examples ...... 58 11.2 Necessary conditions ...... 58 11.3 Avoiding database ...... 58 11.4 Deadlock handling ...... 59 11.4.1 Ignoring deadlock ...... 59 11.4.2 Detection ...... 59 11.4.3 Prevention ...... 59 11.4.4 Avoidance ...... 60 11.5 Livelock ...... 61 11.6 Distributed deadlock ...... 61 11.7 See also ...... 61 11.8 References ...... 61 11.9 Further reading ...... 62 11.10External links ...... 62

12 Distributed 63 12.1 See also ...... 63 12.2 References ...... 63

13 Distributed 65 13.1 History ...... 66 13.2 Definition and terminology ...... 67 13.2.1 Vertex coloring ...... 67 13.2.2 Chromatic polynomial ...... 67 13.2.3 ...... 68 13.2.4 Total coloring ...... 68 13.2.5 Unlabeled coloring ...... 68 13.3 Properties ...... 68 13.3.1 Bounds on the chromatic number ...... 68 13.3.2 Lower bounds on the chromatic number ...... 69 CONTENTS v

13.3.3 Graphs with high chromatic number ...... 70 13.3.4 Bounds on the chromatic index ...... 70 13.3.5 Other properties ...... 70 13.3.6 Open problems ...... 70 13.4 Algorithms ...... 71 13.4.1 Polynomial time ...... 71 13.4.2 Exact algorithms ...... 71 13.4.3 Contraction ...... 71 13.4.4 Greedy coloring ...... 72 13.4.5 Parallel and distributed algorithms ...... 72 13.4.6 Decentralized algorithms ...... 73 13.4.7 Computational complexity ...... 73 13.5 Applications ...... 73 13.5.1 Scheduling ...... 73 13.5.2 Register allocation ...... 74 13.5.3 Other applications ...... 74 13.6 Other colorings ...... 74 13.6.1 Ramsey theory ...... 74 13.6.2 Other colorings ...... 74 13.7 See also ...... 74 13.8 Notes ...... 75 13.9 References ...... 76 13.10External links ...... 78

14 Embarrassingly parallel 81 14.1 Etymology of the term ...... 81 14.2 Examples ...... 81 14.3 Implementations ...... 82 14.4 See also ...... 82 14.5 References ...... 82 14.6 External links ...... 83

15 Failure semantics 84 15.1 Types of errors ...... 84 15.2 References ...... 84

16 Fallacies of 85 16.1 The fallacies ...... 85 16.2 Effects of the fallacies ...... 85 16.3 History ...... 86 16.4 See also ...... 86 16.5 References ...... 86 vi CONTENTS

16.6 External links ...... 86

17 Global concurrency control 87 17.1 See also ...... 87

18 Happened-before 88 18.1 See also ...... 88 18.2 References ...... 88

19 Leader election 89 19.1 Definition ...... 89 19.2 Algorithms ...... 90 19.2.1 Leader election in rings ...... 90 19.2.2 Leader Election in Mesh ...... 91 19.2.3 Election in Hypercubes ...... 94 19.2.4 Election in complete networks ...... 94 19.2.5 Universal leader election techniques ...... 94 19.3 Applications ...... 96 19.3.1 Radio networks ...... 96 19.4 See also ...... 97 19.5 References ...... 97

20 Quantum Byzantine agreement 99 20.1 Introduction ...... 99 20.2 Byzantine Failure and Resilience ...... 99 20.3 Sketch of the Algorithm ...... 100 20.3.1 Verifiable secret sharing...... 100 20.3.2 The Fail-stop protocol...... 100 20.3.3 The Byzantine protocol...... 101 20.4 Remarks ...... 101 20.5 References ...... 101

21 Race condition 103 21.1 Electronics ...... 103 21.1.1 Critical and non-critical race conditions ...... 103 21.1.2 Static, dynamic, and essential race conditions ...... 103 21.2 Software ...... 105 21.2.1 Example ...... 105 21.2.2 File systems ...... 105 21.2.3 Networking ...... 105 21.2.4 Life-critical systems ...... 106 21.2.5 Computer security ...... 106 21.3 Examples outside of Computing ...... 106 CONTENTS vii

21.3.1 Biology ...... 106 21.4 See also ...... 106 21.5 References ...... 106 21.6 External links ...... 107

22 Self-stabilization 108 22.1 History ...... 108 22.2 Overview ...... 108 22.2.1 Efficiency improvements ...... 109 22.2.2 Time complexity ...... 109 22.3 Definition ...... 110 22.4 Related work ...... 110 22.5 References ...... 110 22.6 External links ...... 111

23 112 23.1 ...... 112 23.2 Correctness ...... 113 23.2.1 Correctness - serializability ...... 113 23.2.2 Relaxing serializability ...... 114 23.3 View and conflict serializability ...... 114 23.4 Enforcing conflict serializability ...... 115 23.4.1 Testing conflict serializability ...... 115 23.4.2 Common mechanism - SS2PL ...... 116 23.4.3 Other enforcing techniques ...... 116 23.5 Distributed serializability ...... 117 23.5.1 Overview ...... 117 23.6 See also ...... 118 23.7 Notes ...... 118 23.8 References ...... 118

24 Shared register 119 24.1 Classification ...... 119 24.2 Constructions ...... 119 24.2.1 Implementing an atomic SWSR register in a message passing system ...... 120 24.2.2 Implementing a SWMR register from SWSR registers ...... 120 24.2.3 Implementing a MWMR register from a SW Snapshot object ...... 121 24.3 See also ...... 122 24.4 References ...... 122

25 Shared snapshot objects 123 25.1 General ...... 123 25.2 Implementation ...... 124 viii CONTENTS

25.2.1 Basic swmr snapshot algorithm ...... 124 25.2.2 Single-Writer Multi-Reader implementation by Afek et al...... 124 25.2.3 Multi-Writer Multi-Reader implementation by Afek et al...... 126 25.3 Complexity ...... 127 25.4 Applications ...... 127 25.5 See also ...... 127 25.6 References ...... 127

26 State machine replication 129 26.1 Problem definition ...... 129 26.1.1 Distributed services ...... 129 26.1.2 State machine ...... 129 26.1.3 Fault Tolerance ...... 130 26.2 The State Machine Approach ...... 130 26.2.1 Ordering Inputs ...... 131 26.2.2 Sending Outputs ...... 131 26.2.3 System Failure ...... 132 26.2.4 Auditing and Failure Detection ...... 132 26.3 Appendix: Extensions ...... 132 26.3.1 Input Log ...... 132 26.3.2 Checkpoints ...... 132 26.3.3 Reconfiguration ...... 132 26.3.4 Quitting ...... 133 26.3.5 Joining ...... 133 26.3.6 State Transfer ...... 133 26.3.7 Leader Election (for Paxos) ...... 133 26.4 Historical background ...... 134 26.5 References ...... 134 26.6 External links ...... 135

27 Superstabilization 136 27.1 Definitions ...... 136 27.2 References ...... 136

28 Terminating Reliable Broadcast 137 28.1 Problem description ...... 137 28.2 References ...... 137

29 Timing failure 138

30 Transitive data skew 139

31 Two Generals’ Problem 140 31.1 Definition ...... 140 CONTENTS ix

31.2 Illustrating the problem ...... 141 31.3 Proof ...... 141 31.3.1 For deterministic protocols with a fixed number of messages ...... 141 31.3.2 For nondeterministic and variable-length protocols ...... 141 31.4 Engineering approaches ...... 141 31.4.1 Using Bitcoin ...... 142 31.5 History ...... 142 31.6 References ...... 143

32 Uniform consensus 144 32.1 References ...... 144

33 Version vector 145 33.1 Other Mechanisms ...... 145 33.2 References ...... 146

34 Weak coloring 147 34.1 Properties ...... 148 34.2 Applications ...... 148 34.3 References ...... 148 34.4 Text and image sources, contributors, and licenses ...... 150 34.4.1 Text ...... 150 34.4.2 Images ...... 153 34.4.3 Content license ...... 155 Chapter 1

Atomic broadcast

In distributed systems, atomic broadcast or total order broadcast is a broadcast messaging protocol that ensures that messages are received reliably and in the same order by all participants.[1] Distributed systems are ones where each computer runs independently toward a common goal, and as a result, designing a successful atomic broadcast system is a significant challenge.[2] Atomic broadcast is a fundamental problem in distributed computing. A successful system must be a reliable broad- cast. In addition, such a system must satisfy the total order property. This means that if computer A sends message 1 first and message 2 second, a success means that computer B receives both messages and that it receives message 1 before message 2. Atomic broadcasts are simple when computers are correct, meaning that they never fail. However, real computers are faulty, and do fail, and even if failures are temporary, this is where the challenge results.[2] The following properties are usually required from an atomic broadcast protocol. Validity means that if a correct participant broadcasts a message, then all correct participants will eventually receive it. Uniform agreement means that if a participant delivers a message, then all correct participants will eventually deliver it as well. Uniform integrity means that any given message is delivered by each participant at most once, and only if it was previously broadcast. The definitions for validity and integrity may be sometimes formulated in different way. E.g. Michel Raynal et al.[3] and Schiper et al.[4] define validity property of atomic broadcast slightly differently, but the main requirement that messages are broadcast in the correct order remains. A number of protocols have been proposed for performing atomic broadcast, under various assumptions about the network, failure models, availability of hardware support for multicast, and so forth.[1] One widely popular technology in which atomic broadcast is available as a primitive is virtual synchrony, a kind of computing 'model' used for fault tolerance and data replication in many real-world systems and products.

1.1 References

• Défago, X., Schiper, A., and Urbán, P. 2004. Total order broadcast and multicast algorithms: Taxonomy and survey. ACM Comput. Surv. 36, 4 (Dec. 2004), 372-421. DOI=10.1145/1041680.1041682 (alternate source)

[1] Défago et al.. 2004

[2] Kshemkalyani, Ajay; Singhal, Mukesh (2008). Distributed Computing: Principles, Algorithms, and Systems (Google eBook). Cambridge University Press. pp. 583–585. ISBN 9781139470315.

[3] Rodrigues L, Raynal M.: Atomic Broadcast in Asynchronous Crash-Recovery Distributed Systems , ICDCS '00: Proceed- ings of The 20th International Conference on Distributed Computing Systems ( ICDCS 2000)

[4] Ekwall, R.; Schiper, A.: Solving Atomic Broadcast with Indirect Consensus. Dependable Systems and Networks, 2006. DSN 2006. International Conference on 2006.

1 Chapter 2

Atomic commit

In the field of computer science, an atomic commit is an operation that applies a set of distinct changes as a single operation. If the changes are applied then the atomic commit is said to have succeeded. If there is a failure before the atomic commit can be completed then all of the changes completed in the atomic commit are reversed. This ensures that the system is always left in a consistent state. The other key property of isolation comes from their nature as atomic operations. Isolation ensures that only one atomic commit is processed at a time. The most common uses of atomic commits are in database systems and revision control systems. The problem with atomic commits is that they require coordination between multiple systems.[1] As computer net- works are unreliable services this means no algorithm can coordinate with all systems as proven in the Two Generals Problem. As become more and more distributed this coordination will increase the difficulty of making truly atomic commits.[2]

2.1 Usage

Atomic commits are essential for multi-step updates to data. This can be clearly shown in a simple example of a money transfer between two checking accounts.[3] This example is complicated by a transaction to check the balance of account Y during a transaction for transferring 100 dollars from account X to Y. To start, first 100 dollars is removed from account X. Second, 100 dollars is added to account Y. If the entire operation is not completed as one atomic commit, then several problems could occur. If the system fails in the middle of the operation, after removing the money from X and before adding into Y, then 100 dollars has just disappeared. Another issue is if the balance of Y is checked before the 100 dollars is added, the wrong balance for Y will be reported. With atomic commits neither of these cases can happen, in the first case of the system failure, the atomic commit would be rolled back and the money returned to X. In the second case, the request of the balance of Y cannot occur until the atomic commit is fully completed.

2.2 Database systems

Atomic commits in database systems fulfil two of the key properties of ACID,[4] atomicity and consistency. Consis- tency is only achieved if each change in the atomic commit is consistent. As shown in the example atomic commits are critical to multistep operations in databases. Due to modern hardware design of the physical disk on which the database resides true atomic commits cannot exist. The smallest area that can be written to on disk is known as a sector. A single database entry may span several different sectors. Only one sector can be written at a time. This writing limit is why true atomic commits are not possible. After the database entries in memory have been modified they are queued up to be written to disk. This means the same problems identified in the example have reoccurred. Any algorithmic solution to this problem will still encounter the Two Generals’ Problem. The two-phase commit protocol and three-phase commit protocol attempt to solve this and some of the other problems associated with atomic commits.

2 2.3. REVISION CONTROL 3

The two-phase commit protocol requires a coordinator to maintain all the information needed to recover the original state of the database if something goes wrong. As the name indicates there are two phases, voting and commit. During the voting phase each node writes the changes in the atomic commit to its own disk. The nodes then report their status to the coordinator. If any node does not report to the coordinator or their status message is lost the coordinator assumes the node’s write failed. Once all of the nodes have reported to the coordinator the second phase begins. During the commit phase the coordinator sends a commit message to each of the nodes to record in their individual logs. Until this message is added to a node’s log, any changes made will be recorded as incomplete. If any of the nodes reported a failure the coordinator will instead send a rollback message. This will remove any changes the nodes have written to disk.[5][6] The three-phase commit protocol seeks to remove the main problem with the two phase commit protocol, which occurs if a coordinator and another node fail at the same time during the commit phase neither can tell what action should occur. To solve this problem a third phase is added to the protocol. The prepare to commit phase occurs after the voting phase and before the commit phase. In the voting phase, similar to the two-phase commit, the coordinator requests that each node is ready to commit. If any node fails the coordinator will timeout while waiting for the failed node. If this happens the coordinator sends an abort message to every node. The same action will be undertaken if any of the nodes return a failure message. Upon receiving success messages from each node in the voting phase the prepare to commit phase begins. During this phase the coordinator sends a prepare message to each node. Each node must acknowledge the prepare message and reply. If any reply is missed or any node return that they are not prepared then the coordinator sends an abort message. Any node that does not receive a prepare message before the timeout expires aborts the commit. After all nodes have replied to the prepare message then the commit phase begins. In this phase the coordinator sends a commit message to each node. When each node receives this message it performs the actual commit. If the commit message does not reach a node due to the message being lost or the coordinator fails they will perform the commit if the timeout expires. If the coordinator fails upon recovery it will send a commit message to each node.[7]

2.3 Revision control

The other area where atomic commits are employed is revision control systems. This allows multiple modified files to be uploaded and merged into the source. Most revision control systems support atomic commits (CVS, VSS and IBM Rational ClearCase (when in UCM mode)[8] are the major exceptions). Like database systems, commits may fail due to a problem in applying the changes on disk. Unlike a database system, which overwrites any existing data with the data from the changeset, revision control systems merge the modification in the changeset into the existing data. If the system cannot complete the merge then the commit will be rejected. If a merge cannot be resolved by the revision control software it is up to the user to merge the changes. For revision control systems that support atomic commits, this failure in merging would result in a failed commit. Atomic commits are crucial for maintaining a consistent state in the repository. Without atomic commits some changes a developer has made may be applied but other changes may not. If these changes have any kind of coupling this will result in errors. Atomic commits prevent this by not applying partial changes that would create these errors. Note that if the changes already contain errors, atomic commits offer no fix.

2.4 Atomic commit convention

When using a revision control systems a common convention is to use small commits. These are sometimes referred to as atomic commits as they (ideally) only affect a single aspect of the system. These atomic commits allow for greater understandability, less effort to roll back changes, easier bug identification.[9] The greater understandability comes from the small size and focused nature of the commit. It is much easier to understand what is changed and reasoning behind the changes if you are only looking for one kind of change. This becomes especially important when making format changes to the source code. If format and functional changes are combined it becomes very difficult to identify useful changes. Imagine if the spacing in a file is changed from using tabs to three spaces every tab in the file will show as having been changed. This becomes critical if some functional 4 CHAPTER 2. ATOMIC COMMIT changes are also made as a reviewer may simply not see the functional changes.[10][11] If only atomic commits are made then commits that introduce errors become much simpler to identify. You are not required to look though every commit to see if it was the cause of the error, only the commits dealing with that functionality need to be examined. If the error is to be rolled back, atomic commits again make the job much simpler. Instead of having to revert to the offending revision and remove the changes manually before integrating any later changes; the developer can simply revert any changes in the identified commit. This also reduces the risk of a developer accidentally removing unrelated changes that happened to be in the same commit. Atomic commits also allow bug fixes to be easily reviewed if only a single bug fix is committed at a time. Instead of having to check multiple potentially unrelated files the reviewer must only check files and changes that directly impact the bug being fixed. This also means that bug fixes can be easily packaged for testing as only the changes that fix the bug are in the commit.

2.5 See also

• Two-phase commit protocol

• Three-phase commit protocol • Commit (data management)

• Atomic operation

2.6 References

[1] Bocchi, Wischik (2004). A Process Calculus of Atomic Commit.

[2] Garcia-Molina, Hector; Ullman, Jeff; Widom, Jennifer (2009). Database Systems The Complete Book. Prentice Hall. pp. 1008–1009.

[3] Garcia-Molina, Hector; Ullman, Jeff; Widom, Jennifer (2009). Database Systems The Complete Book. Prentice Hall. p. 299.

[4] Elmasri, Ramez (2006). Fundamentals of Database Systems 5th Edition. Addison Wesley. p. 620.

[5] Elmasri, Ramez (2006). Fundamentals of Database Systems 5th Edition. Addison Wesley. p. 688.

[6] Bernstein, Philip A.; Hadzilacos, Vassos; Goodman, Nathan (1987). “Chapter 7”. Concurrency Control and Recovery in Database Systems. Addison Wesley Publishing Company.

[7] Gaddam, Srinivas R. Three-Phase Commit Protocol.

[8] http://pic.dhe.ibm.com/infocenter/cchelp/v8r0m0/topic/com.ibm.rational.clearcase.ccrc.help.doc/topics/u_checkin.htm? resultof=%22%61%74%6f%6d%69%63%22%20%22%61%74%6f%6d%22%20%22%63%6f%6d%6d%69%74%22% 20

[9] “Subversion Best Practices”. Apache.

[10] Barney, Boisvert. Atomic Commits to Version Control.

[11] “The Benefits of Small Commits”. Conifer Systems. Chapter 3

Automatic vectorization

For vectorization as a programming idiom, see Array programming.

Automatic vectorization, in parallel computing, is a special case of automatic parallelization, where a computer program is converted from a scalar implementation, which processes a single pair of operands at a time, to a vector implementation, which processes one operation on multiple pairs of operands at once. For example, modern conven- tional computers, including specialized supercomputers, typically have vector operations that simultaneously perform operations such as the following four additions:

c1 = a1 + b1 c2 = a2 + b2 c3 = a3 + b3 c4 = a4 + b4

However, in most programming languages one typically writes loops that sequentially perform additions of many numbers. Here is an example of such a loop, written in C: for (i=0; i

A vectorizing compiler transforms such loops into sequences of vector operations. These vector operations perform additions on length-four (in our example) blocks of elements from the arrays a, b and c. Automatic vectorization is a major research topic in computer science.

3.1 Background

Early computers generally had one logic unit that sequentially executed one instruction on one operand pair at a time. Computer programs and programming languages were accordingly designed to execute sequentially. Modern computers can do many things at once. Many optimizing compilers feature auto-vectorization, a compiler feature where particular parts of sequential programs are transformed into equivalent parallel ones, to produce code which will well utilize a vector processor. For a compiler to produce such efficient code for a programming language intended for use on a vector-processor would be much simpler, but, as much real-world code is sequential, the optimization is of great utility. Loop vectorization converts procedural loops that iterate over multiple pairs of data items and assigns a separate processing unit to each pair. Most programs spend most of their execution times within such loops. Vectorizing loops can lead to significant performance gains without programmer intervention, especially on large data sets. Vectorization can sometimes instead slow execution because of pipeline synchronization, data movement timing and other issues. Intel's MMX, SSE, AVX and Power Architecture's AltiVec and ARM's NEON instruction sets support such vector- ized loops. Many constraints prevent or hinder vectorization. Loop dependence analysis identifies loops that can be vectorized,

5 6 CHAPTER 3. AUTOMATIC VECTORIZATION

relying on the data dependence of the instructions inside loops.

3.2 Guarantees

Automatic vectorization, like any loop optimization or other compile-time optimization, must exactly preserve pro- gram behavior.

3.2.1 Data dependencies

All dependencies must be respected during execution to prevent incorrect results. In general, loop invariant dependencies and lexically forward dependencies can be easily vectorized, and lexically backward dependencies can be transformed into lexically forward dependencies. However, these transformations must be done safely, in order to ensure that the dependence between all statements remain true to the original. Cyclic dependencies must be processed independently of the vectorized instructions.

3.2.2 Data precision

Integer precision (bit-size) must be kept during vector instruction execution. The correct vector instruction must be chosen based on the size and behavior of the internal integers. Also, with mixed integer types, extra care must be taken to promote/demote them correctly without losing precision. Special care must be taken with sign extension (because multiple integers are packed inside the same register) and during shift operations, or operations with carry bits that would otherwise be taken into account. Floating-point precision must be kept as well, unless IEEE-754 compliance is turned off, in which case operations will be faster but the results may vary slightly. Big variations, even ignoring IEEE-754 usually means programmer error. The programmer can also force constants and loop variables to single precision (default is normally double) to execute twice as many operations per instruction.

3.3 Theory

To vectorize a program, the compiler’s optimizer must first understand the dependencies between statements and re- align them, if necessary. Once the dependencies are mapped, the optimizer must properly arrange the implementing instructions changing appropriate candidates to vector instructions, which operate on multiple data items.

3.3.1 Building the dependency graph

The first step is to build the dependency graph, identifying which statements depend on which other statements. This involves examining each statement and identifying every data item that the statement accesses, mapping array access modifiers to functions and checking every access’ dependency to all others in all statements. Alias analysis can be used to certify that the different variables access (or intersects) the same region in memory. The dependency graph contains all local dependencies with distance not greater than the vector size. So, if the vector register is 128 bits, and the array type is 32 bits, the vector size is 128/32 = 4. All other non-cyclic dependencies should not invalidate vectorization, since there won't be any concurrent access in the same vector instruction. Suppose the vector size is the same as 4 ints: for (i = 0; i < 128; i++) { a[i] = a[i-16]; // 16 > 4, safe to ignore a[i] = a[i-1]; // 1 < 4, stays on dependency graph }

3.3.2 Clustering

Using the graph, the optimizer can then cluster the strongly connected components (SCC) and separate vectorizable statements from the rest. 3.4. GENERAL FRAMEWORK 7

For example, consider a program fragment containing three statement groups inside a loop: (SCC1+SCC2), SCC3 and SCC4, in that order, in which only the second group (SCC3) can be vectorized. The final program will then contain three loops, one for each group, with only the middle one vectorized. The optimizer cannot join the first with the last without violating statement execution order, which would invalidate the necessary guarantees.

3.3.3 Detecting idioms

Some non-obvious dependencies can be further optimized based on specific idioms. For instance, the following self-data-dependencies can be vectorized because the value of the right-hand values (RHS) are fetched and then stored on the left-hand value, so there is no way the data will change within the assignment. a[i] = a[i] + a[i+1];

Self-dependence by scalars can be vectorized by variable elimination.

3.4 General framework

The general framework for loop vectorization is split into four stages:

• Prelude: Where the loop-independent variables are prepared to be used inside the loop. This normally involves moving them to vector registers with specific patterns that will be used in vector instructions. This is also the place to insert the run-time dependence check. If the check decides vectorization is not possible, branch to Cleanup.

• Loop(s): All vectorized (or not) loops, separated by SCCs clusters in order of appearance in the original code.

• Postlude: Return all loop-independent variables, inductions and reductions.

• Cleanup: Implement plain (non-vectorized) loops for iterations at the end of a loop that are not a multiple of the vector size or for when run-time checks prohibit vector processing.

3.5 Run-time vs. compile-time

Some vectorizations cannot be fully checked at compile time. Compile-time optimization requires an explicit array index. Library functions can also defeat optimization if the data they process is supplied by the caller. Even in these cases, run-time optimization can still vectorize loops on-the-fly. This run-time check is made in the prelude stage and directs the flow to vectorized instructions if possible, otherwise reverting to standard processing, depending on the variables that are being passed on the registers or scalar variables. The following code can easily be vectorized on compile time, as it doesn't have any dependence on external parame- ters. Also, the language guarantees that neither will occupy the same region in memory as any other variable, as they are local variables and live only in the execution stack. int a[128]; int b[128]; // initialize b for (i = 0; i<128; i++) a[i] = b[i] + 5;

On the other hand, the code below has no information on memory positions, because the references are pointers and the memory they point to lives in the heap. int *a = malloc(128*sizeof(int)); int *b = malloc(128*sizeof(int)); // initialize b for (i = 0; i<128; i++, a++, b++) *a = *b + 5; // ... // ... // ... free(b); free(a);

A quick run-time check on the address of both a and b, plus the loop iteration space (128) is enough to tell if the arrays overlap or not, thus revealing any dependencies. There exist some tools to dynamically analyze existing applications to assess the inherent latent potential for SIMD parallelism, exploitable through further compiler advances and/or via manual code changes. [1] 8 CHAPTER 3. AUTOMATIC VECTORIZATION

3.6 Techniques

An example would be a program to multiply two vectors of numeric data. A scalar approach would be something like: for (i = 0; i < 1024; i++) C[i] = A[i]*B[i];

This could be vectorized to look something like: for (i = 0; i < 1024; i+=4) C[i:i+3] = A[i:i+3]*B[i:i+3];

Here, C[i:i+3] represents the four array elements from C[i] to C[i+3] and the vector processor can perform four operations for a single vector instruction. Since the four vector operations complete in roughly the same time as one scalar instruction, the vector approach can run up to four times faster than the original code. There are two distinct compiler approaches: one based on the conventional vectorization technique and the other based on loop unrolling.

3.6.1 Loop-level automatic vectorization

This technique, used for conventional vector machines, tries to find and exploit SIMD parallelism at the loop level. It consists of two major steps as follows.

1. Find an innermost loop that can be vectorized 2. Transform the loop and generate vector codes

In the first step, the compiler looks for obstacles that can prevent vectorization. A major obstacle for vectorization is true data dependency shorter than the vector length. Other obstacles include function calls and short iteration counts. Once the loop is determined to be vectorizable, the loop is stripmined by the vector length and each scalar instruction within the loop body is replaced with the corresponding vector instruction. Below, the component transformations for this step are shown using the above example.

• After stripmining for (i = 0; i < 1024; i+=4) for (ii = 0; ii < 4; ii++) C[i+ii] = A[i+ii]*B[i+ii];

• After loop distribution using temporary arrays for (i = 0; i < 1024; i+=4) { for (ii = 0; ii < 4; ii++) tA[ii] = A[i+ii]; for (ii = 0; ii < 4; ii++) tB[ii] = B[i+ii]; for (ii = 0; ii < 4; ii++) tC[ii] = tA[ii]*tB[ii]; for (ii = 0; ii < 4; ii++) C[i+ii] = tC[ii]; }

• After replacing with vector codes for (i = 0; i < 1024; i+=4) { vA = vec_ld( &A[i] ); vB = vec_ld( &B[i] ); vC = vec_mul( vA, vB ); vec_st( vC, &C[i] ); }

3.6.2 Basic block level automatic vectorization

This relatively new technique specifically targets modern SIMD architectures with short vector lengths.[2] Although loops can be unrolled to increase the amount of SIMD parallelism in basic blocks, this technique exploits SIMD parallelism within basic blocks rather than loops. The two major steps are as follows.

1. The innermost loop is unrolled by a factor of the vector length to form a large loop body. 3.6. TECHNIQUES 9

2. Isomorphic scalar instructions (that perform the same operation) are packed into a vector instruction if depen- dencies do not prevent doing so.

To show step-by-step transformations for this approach, the same example is used again.

• After loop unrolling (by the vector length, assumed to be 4 in this case) for (i = 0; i < 1024; i+=4) { sA0 = ld( &A[i+0] ); sB0 = ld( &B[i+0] ); sC0 = sA0 * sB0; st( sC0, &C[i+0] ); ... sA3 = ld( &A[i+3] ); sB3 = ld( &B[i+3] ); sC3 = sA3 * sB3; st( sC3, &C[i+3] ); }

• After packing for (i = 0; i < 1024; i+=4) { (sA0,sA1,sA2,sA3) = ld( &A[i+0:i+3] ); (sB0,sB1,sB2,sB3) = ld( &B[i+0:i+3] ); (sC0,sC1,sC2,sC3) = (sA0,sA1,sA2,sA3) * (sB0,sB1,sB2,sB3); st( (sC0,sC1,sC2,sC3), &C[i+0:i+3] ); }

• After code generation for (i = 0; i < 1024; i+=4) { vA = vec_ld( &A[i] ); vB = vec_ld( &B[i] ); vC = vec_mul( vA, vB ); vec_st( vC, &C[i] ); }

Here, sA1, sB1, ... represent scalar variables and vA, vB, and vC represent vector variables. Most automatically vectorizing commercial compilers use the conventional loop-level approach except the IBM XL Compiler,[3] which uses both.

3.6.3 In the presence of control flow

The presence of if-statements in the loop body requires the execution of instructions in all control paths to merge the multiple values of a variable. One general approach is to go through a sequence of code transformations: predication → vectorization(using one of the above methods) → remove vector predicates → remove scalar predicates.[4] If the following code is used as an example to show these transformations; for (i = 0; i < 1024; i++) if (A[i] > 0) C[i] = B[i]; else D[i] = D[i-1];

• After predication for (i = 0; i < 1024; i++) { P = A[i] > 0; NP = !P; C[i] = B[i]; (P) D[i] = D[i-1]; (NP) } where (P) denotes a predicate guarding the statement.

• After vectorization for (i = 0; i < 1024; i+=4) { vP = A[i:i+3] > (0,0,0,0); vNP = vec_not(vP); C[i:i+3] = B[i:i+3]; (vP) (NP1,NP2,NP3,NP4) = vNP; D[i+3] = D[i+2]; (NP4) D[i+2] = D[i+1]; (NP3) D[i+1] = D[i]; (NP2) D[i] = D[i-1]; (NP1) }

• After removing vector predicates for (i = 0; i < 1024; i+=4) { vP = A[i:i+3] > (0,0,0,0); vNP = vec_not(vP); C[i:i+3] = vec_sel(C[i:i+3],B[i:i+3],vP); (NP1,NP2,NP3,NP4) = vNP; D[i+3] = D[i+2]; (NP4) D[i+2] = D[i+1]; (NP3) D[i+1] = D[i]; (NP2) D[i] = D[i-1]; (NP1) }

• After removing scalar predicates for (i = 0; i < 1024; i+=4) { vP = A[i:i+3] > (0,0,0,0); vNP = vec_not(vP); C[i:i+3] = vec_sel(C[i:i+3],B[i:i+3],vP); (NP1,NP2,NP3,NP4) = vNP; if (NP4) D[i+3] = D[i+2]; if (NP3) D[i+2] = D[i+1]; if (NP2) D[i+1] = D[i]; if (NP1) D[i] = D[i-1]; } 10 CHAPTER 3. AUTOMATIC VECTORIZATION

3.6.4 Reducing vectorization overhead in the presence of control flow

Having to execute the instructions in all control paths in vector code has been one of the major factors that slow down the vector code with respect to the scalar baseline. The more complex the control flow becomes and the more instructions are bypassed in the scalar code the larger the vectorization overhead grows. To reduce this vectorization overhead, vector branches can be inserted to bypass vector instructions similar to the way scalar branches bypass scalar instructions.[5] Below, AltiVec predicates are used to show how this can be achieved.

• Scalar baseline (original code) for (i = 0; i < 1024; i++) { if (A[i] > 0) { C[i] = B[i]; if (B[i] < 0) D[i] = E[i]; } }

• After vectorization in the presence of control flow for (i = 0; i < 1024; i+=4) { vPA = A[i:i+3] > (0,0,0,0); C[i:i+3] = vec_sel(C[i:i+3],B[i:i+3],vPA); vT = B[i:i+3] < (0,0,0,0); vPB = vec_sel((0,0,0,0), vT, vPA); D[i:i+3] = vec_sel(D[i:i+3],E[i:i+3],vPB); }

• After inserting vector branches for (i = 0; i < 1024; i+=4) if (vec_any_gt(A[i:i+3],(0,0,0,0))) { vPA = A[i:i+3] > (0,0,0,0); C[i:i+3] = vec_sel(C[i:i+3],B[i:i+3],vPA); vT = B[i:i+3] < (0,0,0,0); vPB = vec_sel((0,0,0,0), vT, vPA); if (vec_any_ne(vPB,(0,0,0,0))) D[i:i+3] = vec_sel(D[i:i+3],E[i:i+3],vPB); }

There are two things to note in the final code with vector branches; First, the predicate defining instruction for vPA is also included within the body of the outer vector branch by using vec_any_gt. Second, the profitability of the inner vector branch for vPB depends on the conditional probability of vPB having false values in all fields given vPA has false values in all fields. Consider an example where the outer branch in the scalar baseline is always taken, bypassing most instructions in the loop body. The intermediate case above, without vector branches, executes all vector instructions. The final code, with vector branches, executes both the comparison and the branch in vector mode, potentially gaining performance over the scalar baseline.

3.7 See also

• Chaining (vector processing)

3.8 References

[1]

[2] Larsen, S.; Amarasinghe, S. (2000). “Proceedings of the ACM SIGPLAN conference on Programming language design and implementation”. ACM SIGPLAN Notices 35 (5): 145–156. doi:10.1145/358438.349320. |chapter= ignored (help)

[3]

[4] Shin, J.; Hall, M. W.; Chame, J. (2005). “Proceedings of the international symposium on Code generation and optimiza- tion”. pp. 165–175. doi:10.1109/CGO.2005.33. ISBN 0-7695-2298-X. |chapter= ignored (help)

[5] Shin, J. (2007). “Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques”. pp. 280–291. doi:10.1109/PACT.2007.41. |chapter= ignored (help) Chapter 4

Big data

This article is about large collections of data. For the graph database, see Graph database. For the band, see Big Data (band). Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate.

Visualization of daily Wikipedia edits created by IBM. At multiple terabytes in size, the text and images of Wikipedia are an example of big data.

Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, and information privacy. The term often refers simply to the use of predictive analytics or other certain advanced methods to extract value from data, and seldom to a particular size of data set. Accuracy in big data may lead to more confident decision making. And better decisions can mean greater operational efficiency, cost reduction and reduced risk. Analysis of data sets can find new correlations, to “spot business trends, prevent diseases, combat crime and so on.”[1] Scientists, business executives, practitioners of media and advertising and governments alike regularly meet diffi- culties with large data sets in areas including search, finance and business informatics. Scientists encounter

11 12 CHAPTER 4. BIG DATA

Growth of and Digitization of Global Information Storage Capacity Source

limitations in e-Science work, including meteorology, genomics,[2] connectomics, complex physics simulations,[3] and biological and environmental research.[4] Data sets grow in size in part because they are increasingly being gathered by cheap and numerous information-sensing mobile devices, aerial (remote sensing), software logs, cameras, microphones, radio-frequency identification (RFID) readers, and wireless sensor networks.[5][6][7] The world’s technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s;[8] as of 2012, every day 2.5 exabytes (2.5×1018) of data were created;[9] The challenge for large enterprises is determining who should own big data initiatives that straddle the entire organization.[10] Work with big data is necessarily uncommon; most analysis is of “PC size” data, on a desktop PC or notebook[11] that can handle the available data set. Relational database management systems and desktop statistics and visualization packages often have difficulty han- dling big data. The work instead requires “massively parallel software running on tens, hundreds, or even thousands of servers”.[12] What is considered “big data” varies depending on the capabilities of the users and their tools, and expanding capabilities make Big Data a moving target. Thus, what is considered “big” one year becomes ordinary later. “For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration.”[13]

4.1 Definition

Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time.[14] Big data “size” is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data. Big data is a set of techniques and technologies that require new forms of integration to uncover large hidden values from large datasets that are diverse, complex, and of a massive scale.[15] 4.2. CHARACTERISTICS 13

In a 2001 research report[16] and related lectures, META Group (now Gartner) analyst Doug Laney defined data growth challenges and opportunities as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources). Gartner, and now much of the industry, continue to use this “3Vs” model for describing big data.[17] In 2012, Gartner updated its definition as follows: “Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.”[18] Additionally, a new V “Veracity” is added by some organizations to describe it.[19] Gartner’s definition of the 3Vs is still widely used, and in agreement with a consensual definition that states that “Big Data represents the Information assets characterized by such a High Volume, Velocity and Variety to require specific Technology and Analytical Methods for its transformation into Value”.[20] The 3Vs have been expanded to other complementary characteristics of big data:[21][22]

• Volume: big data doesn't sample. It just observes and tracks what happens • Velocity: big data is often available in real-time • Variety: big data draws from text, images, audio, video; plus it completes missing pieces through data fusion • Machine Learning: big data often doesn't ask why and simply detects patterns[23] • Digital footprint: big data is often a cost-free byproduct of digital interaction[22]

The growing maturity of the concept fosters a more sound difference between big data and Business Intelligence, regarding data and their use:[24]

• Business Intelligence uses descriptive statistics with data with high information density to measure things, detect trends etc.; • Big data uses inductive statistics and concepts from nonlinear system identification [25] to infer laws (regressions, nonlinear relationships, and causal effects) from large sets of data with low information density[26] to reveal relationships, dependencies and perform predictions of outcomes and behaviors.[25][27]

4.2 Characteristics

Big data can be described by the following characteristics:[21][22]

Volume The quantity of generated data is important in this context. The size of the data determines the value and potential of the data under consideration, and whether it can actually be considered big data or not. The name ‘big data’ itself contains a term related to size, and hence the characteristic.

Variety This is the category of big data, and an essential fact that data analysts must know. This helps people who analyze the data and are associated with it effectively use the data to their advantage and thus uphold the importance of the big data.

Velocity ‘Velocity’ in this context means how fast the data is generated and processed to meet the demands and the challenges that lie in the path of growth and development.

Variability This refers to inconsistency the data can show at times—which hampers the process of handling and managing the data effectively.

Veracity The quality of captured data can vary greatly. Accurate analysis depends on the veracity of source data.

Complexity Data management can be very complex, especially when large volumes of data come from multiple sources. Data must be linked, connected, and correlated so users can grasp the information the data is supposed to convey.

Factory work and Cyber-physical systems may have a 6C system: 14 CHAPTER 4. BIG DATA

• Connection (sensor and networks)

• Cloud (computing and data on demand)

• Cyber (model and memory)

• Content/context (meaning and correlation)

• Community (sharing and collaboration)

• Customization (personalization and value)

Data must be processed with advanced tools (analytics and algorithms) to reveal meaningful information. Considering visible and invisible issues in, for example, a factory, the information generation algorithm must detect and address invisible issues such as machine degradation, component wear, etc. on the factory floor.[28][29]

4.3 Architecture

In 2000, Seisint Inc. developed a C++ based distributed file sharing framework for data storage and query. The system stores and distributes structured, semi-structured, and unstructured data across multiple servers. Users can build queries in a modified C++ called ECL. ECL uses an “apply schema on read” method to infer the structure of stored data at the time of the query. In 2004, LexisNexis acquired Seisint Inc.[30] and in 2008 acquired ChoicePoint, Inc.[31] and their high speed parallel processing platform. The two platforms were merged into HPCC Systems and in 2011 was open sourced under the Apache v2.0 License. Currently HPCC and Quantcast [32] are the only publicly available platforms capable of analyzing multiple exabytes of data. In 2004, Google published a paper on a process called MapReduce that used such an architecture. The MapReduce framework provides a parallel processing model and associated implementation to process huge amounts of data. With MapReduce, queries are split and distributed across parallel nodes and processed in parallel (the Map step). The results are then gathered and delivered (the Reduce step). The framework was very successful,[33] so others wanted to replicate the algorithm. Therefore, an implementation of the MapReduce framework was adopted by an Apache open source project named Hadoop.[34] MIKE2.0 is an open approach to information management that acknowledges the need for revisions due to big data implications in an article titled “Big Data Solution Offering”.[35] The methodology addresses handling big data in terms of useful permutations of data sources, complexity in interrelationships, and difficulty in deleting (or modifying) individual records.[36] Recent studies show that the use of a multiple layer architecture is an option for dealing with big data. The Dis- tributed Parallel architecture distributes data across multiple processing units and parallel processing units provide data much faster, by improving processing speeds. This type of architecture inserts data into a parallel DBMS, which implements the use of MapReduce and Hadoop frameworks. This type of framework looks to make the processing power transparent to the end user by using a front end application server.[37] Big Data Analytics for Manufacturing Applications can be based on a 5C architecture (connection, conversion, cyber, cognition, and configuration).[38] The data lake allows an organization to shift its focus from centralized control to a shared model to respond to the changing dynamics of information management. This enables quick segregation of data into the data lake, thereby reducing the overhead time.[39]

4.4 Technologies

Big data requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times. A 2011 McKinsey report[40] suggests suitable technologies include A/B testing, crowdsourcing, data fusion and integration, genetic algorithms, machine learning, natural language processing, signal processing, simulation, time series analysis and visualisation. Multidimensional big data can also be represented as tensors, which can be more ef- ficiently handled by tensor-based computation,[41] such as multilinear subspace learning.[42] Additional technologies 4.5. APPLICATIONS 15 being applied to big data include massively parallel-processing (MPP) databases, search-based applications, data min- ing, distributed file systems, distributed databases, cloud-based infrastructure (applications, storage and computing resources) and the Internet. Some but not all MPP relational databases have the ability to store and manage petabytes of data. Implicit is the ability to load, monitor, back up, and optimize the use of the large data tables in the RDBMS.[43] DARPA’s Topological Data Analysis program seeks the fundamental structure of massive data sets and in 2008 the technology went public with the launch of a company called Ayasdi.[44] The practitioners of big data analytics processes are generally hostile to slower shared storage,[45] preferring direct- attached storage (DAS) in its various forms from solid state drive (SSD) to high capacity SATA disk buried in- side parallel processing nodes. The perception of shared storage architectures—Storage area network (SAN) and Network-attached storage (NAS) —is that they are relatively slow, complex, and expensive. These qualities are not consistent with big data analytics systems that thrive on system performance, commodity infrastructure, and low cost. Real or near-real time information delivery is one of the defining characteristics of big data analytics. Latency is therefore avoided whenever and wherever possible. Data in memory is good—data on spinning disk at the other end of a FC SAN connection is not. The cost of a SAN at the scale needed for analytics applications is very much higher than other storage techniques. There are advantages as well as disadvantages to shared storage in big data analytics, but big data analytics practitioners as of 2011 did not favour it.[46]

4.5 Applications

Bus wrapped with SAP Big data parked outside IDF13.

Big data has increased the demand of information management specialists in that Software AG, Oracle Corporation, IBM, , SAP, EMC, HP and Dell have spent more than $15 billion on software firms specializing in data management and analytics. In 2010, this industry was worth more than $100 billion and was growing at almost 10 16 CHAPTER 4. BIG DATA percent a year: about twice as fast as the software business as a whole.[1] Developed economies increasingly use data-intensive technologies. There are 4.6 billion mobile-phone subscriptions worldwide, and between 1 billion and 2 billion people accessing the internet.[1] Between 1990 and 2005, more than 1 billion people worldwide entered the middle class, which means more people become more literate, which in turn leads to information growth. The world’s effective capacity to exchange information through telecommunication networks was 281 petabytes in 1986, 471 petabytes in 1993, 2.2 exabytes in 2000, 65 exabytes in 2007[8] and predictions put the amount of internet traffic at 667 exabytes annually by 2014.[1] According to one estimate, one third of the globally stored information is in the form of alphanumeric text and still image data,[47] which is the format most useful for most big data applications. This also shows the potential of yet unused data (i.e. in the form of video and audio content). While many vendors offer off-the-shelf solutions for Big Data, experts recommend the development of in-house solu- tions custom-tailored to solve the company’s problem at hand if the company has sufficient technical capabilities.[48]

4.5.1 Government

The use and adoption of Big Data within governmental processes is beneficial and allows efficiencies in terms of cost, productivity, and innovation. That said, this process does not come without its flaws. Data analysis often requires multiple parts of government (central and local) to work in collaboration and create new and innovative processes to deliver the desired outcome. Below are the thought leading examples within the Governmental Big Data space.

United States of America

• In 2012, the Obama administration announced the Big Data Research and Development Initiative, to explore how big data could be used to address important problems faced by the government.[49] The initiative is com- posed of 84 different big data programs spread across six departments.[50]

• Big data analysis played a large role in Barack Obama's successful 2012 re-election campaign.[51]

• The United States Federal Government owns six of the ten most powerful supercomputers in the world.[52]

• The Utah Data Center is a data center currently being constructed by the United States National Security Agency. When finished, the facility will be able to handle a large amount of information collected by the NSA over the Internet. The exact amount of storage space is unknown, but more recent sources claim it will be on the order of a few exabytes.[53][54][55]

India

• Big data analysis was, in parts, responsible for the BJP and its allies to win a highly successful Indian General Election 2014.[56]

• The Indian Government utilises numerous techniques to ascertain how the Indian electorate is responding to government action, as well as ideas for policy augmentation

United Kingdom

Examples of uses of big data in public services:

• Data on prescription drugs: by connecting origin, location and the time of each prescription, a research unit was able to exemplify the considerable delay between the release of any given drug, and a UK-wide adaptation of the National Institute for Health and Care Excellence guidelines. This suggests that new/most up-to-date drugs take some time to filter through to the general patient.

• Joining up data: a local authority blended data about services, such as road gritting rotas, with services for people at risk, such as 'meals on wheels’. The connection of data allowed the local authority to avoid any weather related delay. 4.5. APPLICATIONS 17

4.5.2 International development

Research on the effective usage of information and communication technologies for development (also known as ICT4D) suggests that big data technology can make important contributions but also present unique challenges to International development.[57][58] Advancements in big data analysis offer cost-effective opportunities to improve decision-making in critical development areas such as health care, employment, economic productivity, crime, se- curity, and natural disaster and resource management.[59][60][61] However, longstanding challenges for developing regions such as inadequate technological infrastructure and economic and human resource scarcity exacerbate exist- ing concerns with big data such as privacy, imperfect methodology, and interoperability issues.[59]

4.5.3 Manufacturing

Based on TCS 2013 Global Trend Study, improvements in supply planning and product quality provide the great- est benefit of big data for manufacturing.[62] Big data provides an infrastructure for transparency in manufacturing industry, which is the ability to unravel uncertainties such as inconsistent component performance and availabil- ity. Predictive manufacturing as an applicable approach toward near-zero downtime and transparency requires vast amount of data and advanced prediction tools for a systematic process of data into useful information.[63] A con- ceptual framework of predictive manufacturing begins with data acquisition where different type of sensory data is available to acquire such as acoustics, vibration, pressure, current, voltage and controller data. Vast amount of sensory data in addition to historical data construct the big data in manufacturing. The generated big data acts as the input into predictive tools and preventive strategies such as Prognostics and Health Management (PHM).[64]

Cyber-Physical Models

Current PHM implementations mostly utilize data during the actual usage while analytical algorithms can perform more accurately when more information throughout the machine’s lifecycle, such as system configuration, physical knowledge and working principles, are included. There is a need to systematically integrate, manage and analyze machinery or process data during different stages of machine life cycle to handle data/information more efficiently and further achieve better transparency of machine health condition for manufacturing industry. With such motivation a cyber-physical (coupled) model scheme has been developed. Please see http://www.imscenter. net/cyber-physical-platform The coupled model is a digital twin of the real machine that operates in the cloud plat- form and simulates the health condition with an integrated knowledge from both data driven analytical algorithms as well as other available physical knowledge. It can also be described as a 5S systematic approach consisting of Sensing, Storage, Synchronization, Synthesis and Service. The coupled model first constructs a digital image from the early design stage. System information and physical knowledge are logged during product design, based on which a simulation model is built as a reference for future analysis. Initial parameters may be statistically generalized and they can be tuned using data from testing or the manufacturing process using parameter estimation. After that step, the simulation model can be considered a mirrored image of the real machine—able to continuously record and track machine condition during the later utilization stage. Finally, with the increased connectivity offered by cloud com- puting technology, the coupled model also provides better accessibility of machine condition for factory managers in cases where physical access to actual equipment or machine data is limited.[29][65]

4.5.4 Media

Internet of Things (IoT)

Main article: Internet of Things

To understand how the media utilises Big Data, it is first necessary to provide some context into the mechanism used for media process. It has been suggested by Nick Couldry and Joseph Turow that practitioners in Media and Advertising approach big data as many actionable points of information about millions of individuals. The industry appears to be moving away from the traditional approach of using specific media environments such as newspapers, magazines, or television shows and instead tap into consumers with technologies that reach targeted people at optimal times in optimal locations. The ultimate aim is to serve, or convey, a message or content that is (statistically speak- ing) in line with the consumers mindset. For example, publishing environments are increasingly tailoring messages 18 CHAPTER 4. BIG DATA

(advertisements) and content (articles) to appeal to consumers that have been exclusively gleaned through various data-mining activities.[66]

• Targeting of consumers (for advertising by marketers) • Data-capture

Big Data and the IoT work in conjunction. From a media perspective, data is the key derivative of device inter connectivity and allows accurate targeting. The Internet of Things, with the help of big data, therefore transforms the media industry, companies and even governments, opening up a new era of economic growth and competitiveness. The intersection of people, data and intelligent algorithms have far-reaching impacts on media efficiency. The wealth of data generated allows an elaborate layer on the present targeting mechanisms of the industry.

Technology

• eBay.com uses two data warehouses at 7.5 petabytes and 40PB as well as a 40PB Hadoop cluster for search, consumer recommendations, and merchandising. Inside eBay’s 90PB data warehouse • Amazon.com handles millions of back-end operations every day, as well as queries from more than half a million third-party sellers. The core technology that keeps Amazon running is -based and as of 2005 they had the world’s three largest Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB.[67] • Facebook handles 50 billion photos from its user base.[68] • As of August 2012, Google was handling roughly 100 billion searches per month.[69] • Oracle NoSQL Database has been tested to past the 1M ops/sec mark with 8 shards and proceeded to hit 1.2M ops/sec with 10 shards.[70]

4.5.5 Private sector

Retail

• Walmart handles more than 1 million customer transactions every hour, which are imported into databases esti- mated to contain more than 2.5 petabytes (2560 terabytes) of data – the equivalent of 167 times the information contained in all the books in the US Library of Congress.[1]

Retail banking

• FICO Card Detection System protects accounts world-wide.[71] • The volume of business data worldwide, across all companies, doubles every 1.2 years, according to estimates.[72][73]

Real estate

• Windermere Real Estate uses anonymous GPS signals from nearly 100 million drivers to help new home buyers determine their typical drive times to and from work throughout various times of the day.[74]

4.5.6 Science

The Large Hadron Collider experiments represent about 150 million sensors delivering data 40 million times per second. There are nearly 600 million collisions per second. After filtering and refraining from recording more than 99.99995% [75] of these streams, there are 100 collisions of interest per second.[76][77][78]

• As a result, only working with less than 0.001% of the sensor stream data, the data flow from all four LHC experiments represents 25 petabytes annual rate before replication (as of 2012). This becomes nearly 200 petabytes after replication. 4.6. RESEARCH ACTIVITIES 19

• If all sensor data were recorded in LHC, the data flow would be extremely hard to work with. The data flow would exceed 150 million petabytes annual rate, or nearly 500 exabytes per day, before replication. To put the number in perspective, this is equivalent to 500 quintillion (5×1020) bytes per day, almost 200 times more than all the other sources combined in the world.

The Square Kilometre Array is a radio telescope built of thousands of antennas. It is expected to be operational by 2024. Collectively, these antennas are expected to gather 14 exabytes and store one petabyte per day.[79][80] It is considered one of the most ambitious scientific projects ever undertaken.

Science and research

• When the Sloan Digital Sky Survey (SDSS) began to collect astronomical data in 2000, it amassed more in its first few weeks than all data collected in the history of astronomy previously. Continuing at a rate of about 200 GB per night, SDSS has amassed more than 140 terabytes of information. When the Large Synoptic Survey Telescope, successor to SDSS, comes online in 2016, its designers expect it to acquire that amount of data every five days.[1]

• Decoding the human genome originally took 10 years to process, now it can be achieved in less than a day. the DNA sequencers have divided the sequencing cost by 10,000 in the last ten years, which is 100 times cheaper than the reduction in cost predicted by Moore’s Law.[81]

• The NASA Center for Climate Simulation (NCCS) stores 32 petabytes of climate observations and simulations on the Discover supercomputing cluster.[82]

• Google’s DNAStack, compiles and organizes DNA samples of genetic data from around the world to identify diseases and other medical defects. These fast and exact calculations eliminate any ‘friction points,’ or human errors that could be made by one of the numerous science and biology experts working with the DNA. DNAS- tack, a part of Google Genomics allows scientists to use the vast sample of resources from Google’s search server to scale social experiments that would usually take years, instantly.

4.6 Research activities

Encrypted search and cluster formation in big data was demonstrated in March 2014 at the American Society of Engineering Education. Gautam Siwach engaged at Tackling the challenges of Big Data by MIT Computer Science and Artificial Intelligence Laboratory and Dr. Amir Esmailpour at UNH Research Group investigated the key features of big data as formation of clusters and their interconnections. They focused on the security of big data and the actual orientation of the term towards the presence of different type of data in an encrypted form at cloud interface by providing the raw definitions and real time examples within the technology. Moreover, they proposed an approach for identifying the encoding technique to advance towards an expedited search over encrypted text leading to the security enhancements in big data.[83] In March 2012, The White House announced a national “Big Data Initiative” that consisted of six Federal departments and agencies committing more than $200 million to big data research projects.[84] The initiative included a National Science Foundation “Expeditions in Computing” grant of $10 million over 5 years to the AMPLab[85] at the University of California, Berkeley.[86] The AMPLab also received funds from DARPA, and over a dozen industrial sponsors and uses big data to attack a wide range of problems from predicting traffic congestion[87] to fighting cancer.[88] The White House Big Data Initiative also included a commitment by the Department of Energy to provide $25 million in funding over 5 years to establish the Scalable Data Management, Analysis and Visualization (SDAV) Institute,[89] led by the Energy Department’s Lawrence Berkeley National Laboratory. The SDAV Institute aims to bring together the expertise of six national laboratories and seven universities to develop new tools to help scientists manage and visualize data on the Department’s supercomputers. The U.S. state of Massachusetts announced the Massachusetts Big Data Initiative in May 2012, which provides funding from the state government and private companies to a variety of research institutions.[90] The Massachusetts Institute of Technology hosts the Intel Science and Technology Center for Big Data in the MIT Computer Science and Artificial Intelligence Laboratory, combining government, corporate, and institutional funding and research efforts.[91] 20 CHAPTER 4. BIG DATA

The European Commission is funding the 2-year-long Big Data Public Private Forum through their Seventh Frame- work Program to engage companies, academics and other stakeholders in discussing big data issues. The project aims to define a strategy in terms of research and innovation to guide supporting actions from the European Commission in the successful implementation of the big data economy. Outcomes of this project will be used as input for Horizon 2020, their next framework program.[92] The British government announced in March 2014 the founding of the Alan Turing Institute, named after the com- puter pioneer and code-breaker, which will focus on new ways to collect and analyse large data sets.[93] At the University of Waterloo Stratford Campus Canadian Open Data Experience (CODE) Inspiration Day, par- ticipants demonstrated how using data visualization can increase the understanding and appeal of big data sets and communicate their story to the world.[94] To make manufacturing more competitive in the United States (and globe), there is a need to integrate more American ingenuity and innovation into manufacturing ; Therefore, National Science Foundation has granted the Industry Uni- versity cooperative research center for Intelligent Maintenance Systems (IMS) at university of Cincinnati to focus on developing advanced predictive tools and techniques to be applicable in a big data environment.[64][95] In May 2013, IMS Center held an industry advisory board meeting focusing on big data where presenters from various industrial companies discussed their concerns, issues and future goals in Big Data environment. Computational social sciences — Anyone can use Application Programming Interfaces () provided by Big Data holders, such as Google and Twitter, to do research in the social and behavioral sciences.[96] Often these APIs are provided for free.[96] Tobias Preis et al. used Google Trends data to demonstrate that Internet users from countries with a higher per capita gross domestic product (GDP) are more likely to search for information about the future than information about the past. The findings suggest there may be a link between online behaviour and real-world economic indicators.[97][98][99] The authors of the study examined Google queries logs made by ratio of the volume of searches for the coming year (‘2011’) to the volume of searches for the previous year (‘2009’), which they call the ‘future orientation index’.[100] They compared the future orientation index to the per capita GDP of each country, and found a strong tendency for countries where Google users inquire more about the future to have a higher GDP. The results hint that there may potentially be a relationship between the economic success of a country and the information-seeking behavior of its citizens captured in big data. Tobias Preis and his colleagues Helen Susannah Moat and H. Eugene Stanley introduced a method to identify on- line precursors for stock market moves, using trading strategies based on search volume data provided by Google Trends.[101] Their analysis of Google search volume for 98 terms of varying financial relevance, published in Scientific Reports,[102] suggests that increases in search volume for financially relevant search terms tend to precede large losses in financial markets.[103][104][105][106][107][108][109][110]

4.6.1 Sampling Big Data

An important research question that can be asked about big data sets is whether you need to look at the full data to draw certain conclusions about the properties of the data or is a sample good enough. The name big data itself contains a term related to size and this in an important characteristic of big data. But Sampling (statistics) enables the selection of right data points from within the larger data set to estimate the characteristics of the whole population. For example, there are about 600 million tweets produced every day. Is it necessary to look at all of them to determine the topics that are discussed during the day? Is it necessary to look at all the tweets to determine the sentiment on each of the topics? In manufacturing different types of sensory data such as acoustics, vibration, pressure, current, voltage and controller data are available at short time intervals. To predict down-time it may not be necessary to look at all the data but a sample may be sufficient. There has been some work done in Sampling algorithms for Big Data. A theoretical formulation for sampling Twitter data has been developed.[111]

4.7 Critique

Critiques of the big data paradigm come in two flavors, those that question the implications of the approach itself, and those that question the way it is currently done. 4.7. CRITIQUE 21

Cartoon critical of big data application, by T. Gregorius

4.7.1 Critiques of the big data paradigm

“A crucial problem is that we do not know much about the underlying empirical micro-processes that lead to the emergence of the[se] typical network characteristics of Big Data”.[14] In their critique, Snijders, Matzat, and Reips point out that often very strong assumptions are made about mathematical properties that may not at all reflect what is really going on at the level of micro-processes. Mark Graham has leveled broad critiques at Chris Anderson's assertion that big data will spell the end of theory: focusing in particular on the notion that big data must always be contextualized in their social, economic, and political contexts.[112] Even as companies invest eight- and nine-figure sums to derive insight from information streaming in from suppliers and customers, less than 40% of employees have sufficiently mature processes and skills to do so. To overcome this insight deficit, “big data”, no matter how comprehensive or well analyzed, must be complemented by “big judgment,” according to an article in the Harvard Business Review.[113] Much in the same line, it has been pointed out that the decisions based on the analysis of big data are inevitably “informed by the world as it was in the past, or, at best, as it currently is”.[59] Fed by a large number of data on past 22 CHAPTER 4. BIG DATA

experiences, algorithms can predict future development if the future is similar to the past. If the systems dynamics of the future change, the past can say little about the future. For this, it would be necessary to have a thorough understand- ing of the systems dynamic, which implies theory.[114] As a response to this critique it has been suggested to combine big data approaches with computer simulations, such as agent-based models[59] and Complex Systems.[115] Agent- based models are increasingly getting better in predicting the outcome of social complexities of even unknown future scenarios through computer simulations that are based on a collection of mutually interdependent algorithms.[116][117] In addition, use of multivariate methods that probe for the latent structure of the data, such as factor analysis and cluster analysis, have proven useful as analytic approaches that go well beyond the bi-variate approaches (cross-tabs) typically employed with smaller data sets. In health and biology, conventional scientific approaches are based on experimentation. For these approaches, the limiting factor is the relevant data that can confirm or refute the initial hypothesis.[118] A new postulate is accepted now in biosciences: the information provided by the data in huge volumes (omics) without prior hypothesis is comple- mentary and sometimes necessary to conventional approaches based on experimentation. In the massive approaches it is the formulation of a relevant hypothesis to explain the data that is the limiting factor. The search logic is reversed and the limits of induction (“Glory of Science and Philosophy scandal”, C. D. Broad, 1926) are to be considered. Privacy advocates are concerned about the threat to privacy represented by increasing storage and integration of personally identifiable information; expert panels have released various policy recommendations to conform practice to expectations of privacy.[119][120][121]

4.7.2 Critiques of big data execution

Big data has been called a “fad” in scientific research and its use was even made fun of as an absurd practice in a satirical example on “pig data”.[96] Researcher danah boyd has raised concerns about the use of big data in science neglecting principles such as choosing a representative sample by being too concerned about actually handling the huge amounts of data.[122] This approach may lead to results bias in one way or another. Integration across heterogeneous data resources—some that might be considered “big data” and others not—presents formidable logistical as well as analytical challenges, but many researchers argue that such integrations are likely to represent the most promising new frontiers in science.[123] In the provocative article “Critical Questions for Big Data”,[124] the authors title big data a part of mythology: “large data sets offer a higher form of intelligence and knowledge [...], with the aura of truth, objectivity, and accuracy”. Users of big data are often “lost in the sheer volume of numbers”, and “working with Big Data is still subjective, and what it quantifies does not necessarily have a closer claim on objective truth”.[124] Recent developments in BI domain, such as pro-active reporting especially target improvements in usability of Big Data, through automated filtering of non-useful data and correlations.[125] Big data analysis is often shallow compared to analysis of smaller data sets.[126] In many big data projects, there is no large data analysis happening, but the challenge is the extract, transform, load part of data preprocessing.[126] Big data is a buzzword and a “vague term”,[127] but at the same time an “obsession”[127] with entrepreneurs, consul- tants, scientists and the media. Big data showcases such as Google Flu Trends failed to deliver good predictions in recent years, overstating the flu outbreaks by a factor of two. Similarly, Academy awards and election predictions solely based on Twitter were more often off than on target. Big data often poses the same challenges as small data; and adding more data does not solve problems of bias, but may emphasize other problems. In particular data sources such as Twitter are not representative of the overall population, and results drawn from such sources may then lead to wrong conclusions. Google Translate—which is based on big data statistical analysis of text—does a good job at translating web pages. However, results from specialized domains may be dramatically skewed. On the other hand, big data may also introduce new problems, such as the multiple comparisons problem: simultaneously testing a large set of hypotheses is likely to produce many false results that mistakenly appear significant. Ioannidis argued that “most published research findings are false” [128] due to essentially the same effect: when many scientific teams and researchers each perform many experiments (i.e. process a big amount of scientific data; although not with big data technology), the likelihood of a “significant” result being actually false grows fast - even more so, when only positive results are published.

4.8 See also

• Apache Accumulo

4.9. REFERENCES 23

• Big Memory • Big Data to Knowledge • Data Defined Storage • Data mining • Cask (company) • Cloudera • HPCC Systems • Intelligent Maintenance Systems • Internet of Things • MapReduce • Hortonworks • Oracle NoSQL Database • Nonlinear system identification • Operations research • Programming with Big Data in R (a series of R packages) • Sqrrl • Supercomputer • Talend • Transreality gaming • Tuple space • Unstructured data • Spark

4.9 References

[1] “Data, data everywhere”. The Economist. 25 February 2010. Retrieved 9 December 2012.

[2] “Community cleverness required”. Nature 455 (7209): 1. 4 September 2008. doi:10.1038/455001a.

[3] “Sandia sees data management challenges spiral”. HPC Projects. 4 August 2009.

[4] Reichman, O.J.; Jones, M.B.; Schildhauer, M.P. (2011). “Challenges and Opportunities of Open Data in Ecology”. Science 331 (6018): 703–5. doi:10.1126/science.1197962. PMID 21311007.

[5] “Data Crush by Christopher Surdak”. Retrieved 14 February 2014.

[6] Hellerstein, Joe (9 November 2008). “Parallel Programming in the Age of Big Data”. Gigaom Blog.

[7] Segaran, Toby; Hammerbacher, Jeff (2009). Beautiful Data: The Stories Behind Elegant Data Solutions. O'Reilly Media. p. 257. ISBN 978-0-596-15711-1.

[8] Hilbert & López 2011

[9] “IBM What is big data? — Bringing big data to the enterprise”. www.ibm.com. Retrieved 2013-08-26.

[10] Oracle and FSN, “Mastering Big Data: CFO Strategies to Transform Insight into Opportunity”, December 2012

[11] “Computing Platforms for Analytics, Data Mining, Data Science”. kdnuggets.com. Retrieved 15 April 2015. 24 CHAPTER 4. BIG DATA

[12] Jacobs, A. (6 July 2009). “The Pathologies of Big Data”. ACMQueue.

[13] Magoulas, Roger; Lorica, Ben (February 2009). “Introduction to Big Data”. Release 2.0 (Sebastopol CA: O’Reilly Media) (11).

[14] Snijders, C.; Matzat, U.; Reips, U.-D. (2012). "'Big Data': Big gaps of knowledge in the field of Internet”. International Journal of Internet Science 7: 1–5.

[15] Ibrahim; Targio Hashem, Abaker; Yaqoob, Ibrar; Badrul Anuar, Nor; Mokhtar, Salimah; Gani, Abdullah; Ullah Khan, Samee (2015). “big data” on : Review and open research issues”. Information Systems 47: 98–115. doi:10.1016/j.is.2014.07.006.

[16] Laney, Douglas. “3D Data Management: Controlling Data Volume, Velocity and Variety” (PDF). Gartner. Retrieved 6 February 2001.

[17] Beyer, Mark. “Gartner Says Solving 'Big Data' Challenge Involves More Than Just Managing Volumes of Data”. Gartner. Archived from the original on 10 July 2011. Retrieved 13 July 2011.

[18] Laney, Douglas. “The Importance of 'Big Data': A Definition”. Gartner. Retrieved 21 June 2012.

[19] “What is Big Data?". Villanova University.

[20] De Mauro, Andrea; Greco, Marco; Grimaldi, Michele (2015). “What is big data? A consensual definition and a review of key research topics”. AIP Conference Proceedings 1644: 97–104. doi:10.1063/1.4907823.

[21] Hilbert, M. Big Data for Development: A Review of Promises and Challenges. Development Policy Review. accessible at martinhilbert.net/big-data-for-development

[22] Hilbert, M. (2015). Digital Technology and Social Change [Open Online Course at the University of California] (freely available). https://www.youtube.com/watch?v=XRVIh1h47sA&index=51&list=PLtjBSCvWCU3rNm46D3R85efM0hrzjuAIg Retrieved from https://canvas.instructure.com/courses/949415

[23] Mayer-Schönberger, V., & Cukier, K. (2013). Big data: a revolution that will transform how we live, work and think. London: John Murray.

[24] http://www.bigdataparis.com/presentation/mercredi/PDelort.pdf?PHPSESSID=tv7k70pcr3egpi2r6fi3qbjtj6#page=4

[25] Billings S.A. “Nonlinear System Identification: NARMAX Methods in the Time, Frequency, and Spatio-Temporal Do- mains”. Wiley, 2013

[26] Delort P., Big data Paris 2013 http://www.andsi.fr/tag/dsi-big-data/

[27] Delort P., Big Data car Low-Density Data ? La faible densité en information comme facteur discriminant http://lecercle. lesechos.fr/entrepreneur/tendances-innovation/221169222/big-data-low-density-data-faible-densite-information-com

[28] Lee, Jay; Bagheri, Behrad; Kao, Hung-An (2014). “Recent Advances and Trends of Cyber-Physical Systems and Big Data Analytics in Industrial Informatics”. IEEE Int. Conference on Industrial Informatics (INDIN) 2014.

[29] Lee, Jay; Lapira, Edzel; Bagheri, Behrad; Kao, Hung-an. “Recent advances and trends in predictive manufacturing systems in big data environment”. Manufacturing Letters 1 (1): 38–41. doi:10.1016/j.mfglet.2013.09.005.

[30] “LexisNexis To Buy Seisint For $775 Million”. Washington Post. Retrieved 15 July 2004.

[31] “LexisNexis Parent Set to Buy ChoicePoint”. Washington Post. Retrieved 22 February 2008.

[32] “Quantcast Opens Exabyte-Ready File System”. www.datanami.com. Retrieved 1 October 2012.

[33] Bertolucci, Jeff “Hadoop: From Experiment To Leading Big Data Platform”, “Information Week”, 2013. Retrieved on 14 November 2013.

[34] Webster, John. “MapReduce: Simplified Data Processing on Large Clusters”, “Search Storage”, 2004. Retrieved on 25 March 2013.

[35] “Big Data Solution Offering”. MIKE2.0. Retrieved 8 Dec 2013.

[36] “Big Data Definition”. MIKE2.0. Retrieved 9 March 2013.

[37] Boja, C; Pocovnicu, A; Bătăgan, L. (2012). “Distributed Parallel Architecture for Big Data”. Informatica Economica 16 (2): 116–127.

[38] Intelligent Maintenance System 4.9. REFERENCES 25

[39] http://www.hcltech.com/sites/default/files/solving_key_businesschallenges_with_big_data_lake_0.pdf

[40] Manyika, James; Chui, Michael; Bughin, Jaques; Brown, Brad; Dobbs, Richard; Roxburgh, Charles; Byers, Angela Hung (May 2011). Big Data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute.

[41] “Future Directions in Tensor-Based Computation and Modeling” (PDF). May 2009.

[42] Lu, Haiping; Plataniotis, K.N.; Venetsanopoulos, A.N. (2011). “A Survey of Multilinear Subspace Learning for Tensor Data” (PDF). Pattern Recognition 44 (7): 1540–1551. doi:10.1016/j.patcog.2011.01.004.

[43] Monash, Curt (30 April 2009). “eBay’s two enormous data warehouses”. Monash, Curt (6 October 2010). “eBay followup — Greenplum out, Teradata > 10 petabytes, Hadoop has some value, and more”.

[44] “Resources on how Topological Data Analysis is used to analyze big data”. Ayasdi.

[45] CNET News (1 April 2011). “Storage area networks need not apply”.

[46] “How New Analytic Systems will Impact Storage”. September 2011.

[47] “What Is the Content of the World’s Technologically Mediated Information and Communication Capacity: How Much Text, Image, Audio, and Video?", Martin Hilbert (2014), The Information Society; free access to the article through this link: martinhilbert.net/WhatsTheContent_Hilbert.pdf

[48] Rajpurohit, Anmol (2014-07-11). “Interview: Amy Gershkoff, Director of Customer Analytics & Insights, eBay on How to Design Custom In-House BI Tools”. KDnuggets. Retrieved 2014-07-14. Dr. Amy Gershkoff: “Generally, I find that off-the-shelf business intelligence tools do not meet the needs of clients who want to derive custom insights from their data. Therefore, for medium-to-large organizations with access to strong technical talent, I usually recommend building custom, in-house solutions.”

[49] Kalil, Tom. “Big Data is a Big Deal”. White House. Retrieved 26 September 2012.

[50] Executive Office of the President (March 2012). “Big Data Across the Federal Government” (PDF). White House. Re- trieved 26 September 2012.

[51] Lampitt, Andrew. “The real story of how big data analytics helped Obama win”. Infoworld. Retrieved 31 May 2014.

[52] Hoover, J. Nicholas. “Government’s 10 Most Powerful Supercomputers”. Information Week. UBM. Retrieved 26 Septem- ber 2012.

[53] Bamford, James (15 March 2012). “The NSA Is Building the Country’s Biggest Spy Center (Watch What You Say)". Wired Magazine. Retrieved 2013-03-18.

[54] “Groundbreaking Ceremony Held for $1.2 Billion Utah Data Center”. National Security Agency Central Security Service. Retrieved 2013-03-18.

[55] Hill, Kashmir. “TBlueprints Of NSA’s Ridiculously Expensive Data Center In Utah Suggest It Holds Less Info Than Thought”. Forbes. Retrieved 2013-10-31.

[56] “News: Live Mint”. Are Indian companies making enough sense of Big Data?. Live Mint - http://www.livemint.com/. 2014-06-23. Retrieved 2014-11-22.

[57] UN GLobal Pulse (2012). Big Data for Development: Opportunities and Challenges (White p. by Letouzé, E.). New York: United Nations. Retrieved from http://www.unglobalpulse.org/projects/BigDataforDevelopment

[58] WEF (World Economic Forum), & Vital Wave Consulting. (2012). Big Data, Big Impact: New Possibilities for In- ternational Development. World Economic Forum. Retrieved 24 August 2012, from http://www.weforum.org/reports/ big-data-big-impact-new-possibilities-international-development

[59] “Big Data for Development: From Information- to Knowledge Societies”, Martin Hilbert (2013), SSRN Scholarly Paper No. ID 2205145). Rochester, NY: Social Science Research Network; http://papers.ssrn.com/abstract=2205145

[60] “Elena Kvochko, Four Ways To talk About Big Data (Information Communication Technologies for Development Series)". worldbank.org. Retrieved 2012-05-30.

[61] “Daniele Medri: Big Data & Business: An on-going revolution”. Statistics Views. 21 Oct 2013.

[62] “Manufacturing: Big Data Benefits and Challenges”. TCS Big Data Study. Mumbai, India: Tata Consultancy Services Limited. Retrieved 2014-06-03. 26 CHAPTER 4. BIG DATA

[63] Lee, Jay; Wu, F.; Zhao, W.; Ghaffari, M.; Liao, L (Jan 2013). “Prognostics and health management design for rotary machinery systems—Reviews, methodology and applications”. Mechanical Systems and Signal Processing 42 (1).

[64] “Center for Intelligent Maintenance Systems (IMS Center)".

[65] Predictive manufacturing system

[66] Couldry, Nick; Turow, Joseph (2014). “Advertising, Big Data, and the Clearance of the Public Realm: Marketers’ New Approaches to the Content Subsidy”. International Journal of Communication 8: 1710–1726.

[67] Layton, Julia. “Amazon Technology”. Money.howstuffworks.com. Retrieved 2013-03-05.

[68] “Scaling Facebook to 500 Million Users and Beyond”. Facebook.com. Retrieved 2013-07-21.

[69] “Google Still Doing At Least 1 Trillion Searches Per Year”. Search Engine Land. 16 January 2015. Retrieved 15 April 2015.

[70] Lamb, Charles. “Oracle NoSQL Database Exceeds 1 Million Mixed YCSB Ops/Sec”.

[71] “FICO® Falcon® Fraud Manager”. Fico.com. Retrieved 2013-07-21.

[72] “eBay Study: How to Build Trust and Improve the Shopping Experience”. Knowwpcarey.com. 2012-05-08. Retrieved 2013-03-05.

[73] Leading Priorities for Big Data for Business and IT. eMarketer. October 2013. Retrieved January 2014.

[74] Wingfield, Nick (2013-03-12). “Predicting Commutes More Accurately for Would-Be Home Buyers - NYTimes.com”. Bits.blogs.nytimes.com. Retrieved 2013-07-21.

[75] Alexandru, Dan. “Prof” (PDF). cds.cern.ch. CERN. Retrieved 24 March 2015.

[76] “LHC Brochure, English version. A presentation of the largest and the most powerful particle in the world, the Large Hadron Collider (LHC), which started up in 2008. Its role, characteristics, technologies, etc. are explained for the general public.”. CERN-Brochure-2010-006-Eng. LHC Brochure, English version. CERN. Retrieved 20 January 2013.

[77] “LHC Guide, English version. A collection of facts and figures about the Large Hadron Collider (LHC) in the form of questions and answers.”. CERN-Brochure-2008-001-Eng. LHC Guide, English version. CERN. Retrieved 20 January 2013.

[78] Brumfiel, Geoff (19 January 2011). “High-energy physics: Down the petabyte highway”. Nature 469. pp. 282–83. doi:10.1038/469282a.

[79] http://www.zurich.ibm.com/pdf/astron/CeBIT%202013%20Background%20DOME.pdf

[80] “Future telescope array drives development of exabyte processing”. Ars Technica. Retrieved 15 April 2015.

[81] Delort P., OECD ICCP Technology Foresight Forum, 2012. http://www.oecd.org/sti/ieconomy/Session_3_Delort.pdf# page=6

[82] Webster, Phil. “Supercomputing the Climate: NASA’s Big Data Mission”. CSC World. Computer Sciences Corporation. Retrieved 2013-01-18.

[83] Siwach, Gautam; Esmailpour, Amir (March 2014). Encrypted Search & Cluster Formation in Big Data (PDF). ASEE 2014 Zone I Conference. University of Bridgeport, Bridgeport, Connecticut, USA.

[84] “Obama Administration Unveils “Big Data” Initiative:Announces $200 Million In New R&D Investments” (PDF). The White House.

[85] “AMPLab at the University of California, Berkeley”. Amplab.cs.berkeley.edu. Retrieved 2013-03-05.

[86] “NSF Leads Federal Efforts In Big Data”. National Science Foundation (NSF). 29 March 2012.

[87] Timothy Hunter; Teodor Moldovan; Matei Zaharia; Justin Ma; Michael Franklin; Pieter Abbeel; Alexandre Bayen (October 2011). Scaling the Mobile Millennium System in the Cloud.

[88] David Patterson (5 December 2011). “Computer Scientists May Have What It Takes to Help Cure Cancer”. The New York Times.

[89] “Secretary Chu Announces New Institute to Help Scientists Improve Massive Data Set Research on DOE Supercomputers”. “energy.gov”.

[90] “Governor Patrick announces new initiative to strengthen Massachusetts’ position as a World leader in Big Data”. Com- monwealth of Massachusetts. 4.9. REFERENCES 27

[91] “Big Data @ CSAIL”. Bigdata.csail.mit.edu. 2013-02-22. Retrieved 2013-03-05.

[92] “Big Data Public Private Forum”. Cordis.europa.eu. 2012-09-01. Retrieved 2013-03-05.

[93] “Alan Turing Institute to be set up to research big data”. BBC News. 19 March 2014. Retrieved 2014-03-19.

[94] “Inspiration day at University of Waterloo, Stratford Campus”. http://www.betakit.com/. Retrieved 2014-02-28.

[95] Lee, Jay; Lapira, Edzel; Bagheri, Behrad; Kao, Hung-An (2013). “Recent Advances and Trends in Predictive Manufac- turing Systems in Big Data Environment”. Manufacturing Letters 1 (1): 38–41. doi:10.1016/j.mfglet.2013.09.005.

[96] Reips, Ulf-Dietrich; Matzat, Uwe (2014). “Mining “Big Data” using Big Data Services”. International Journal of Internet Science 1 (1): 1–8.

[97] Preis, Tobias; Moat,, Helen Susannah; Stanley, H. Eugene; Bishop, Steven R. (2012). “Quantifying the Advantage of Looking Forward”. Scientific Reports 2: 350. doi:10.1038/srep00350. PMC 3320057. PMID 22482034.

[98] Marks, Paul (5 April 2012). “Online searches for future linked to economic success”. New Scientist. Retrieved 9 April 2012.

[99] Johnston, Casey (6 April 2012). “Google Trends reveals clues about the mentality of richer nations”. Ars Technica. Retrieved 9 April 2012.

[100] Tobias Preis (2012-05-24). “Supplementary Information: The Future Orientation Index is available for download” (PDF). Retrieved 2012-05-24.

[101] Philip Ball (26 April 2013). “Counting Google searches predicts market movements”. Nature. Retrieved 9 August 2013.

[102] Tobias Preis, Helen Susannah Moat and H. Eugene Stanley (2013). “Quantifying Trading Behavior in Financial Markets Using Google Trends”. Scientific Reports 3: 1684. doi:10.1038/srep01684.

[103] Nick Bilton (26 April 2013). “Google Search Terms Can Predict Stock Market, Study Finds”. New York Times. Retrieved 9 August 2013.

[104] Christopher Matthews (26 April 2013). “Trouble With Your Investment Portfolio? Google It!". TIME Magazine. Retrieved 9 August 2013.

[105] Philip Ball (26 April 2013). “Counting Google searches predicts market movements”. Nature. Retrieved 9 August 2013.

[106] Bernhard Warner (25 April 2013). "'Big Data' Researchers Turn to Google to Beat the Markets”. Bloomberg Businessweek. Retrieved 9 August 2013.

[107] Hamish McRae (28 April 2013). “Hamish McRae: Need a valuable handle on investor sentiment? Google it”. The Independent (London). Retrieved 9 August 2013.

[108] Richard Waters (25 April 2013). “Google search proves to be new word in stock market prediction”. Financial Times. Retrieved 9 August 2013.

[109] David Leinweber (26 April 2013). “Big Data Gets Bigger: Now Google Trends Can Predict The Market”. Forbes. Re- trieved 9 August 2013.

[110] Jason Palmer (25 April 2013). “Google searches predict market moves”. BBC. Retrieved 9 August 2013.

[111] Deepan Palguna, Vikas Joshi, Venkatesan Chakaravarthy, Ravi Kothari and L. V. Subramaniam (2015). Analysis of Sampling Algorithms for Twitter. International Joint Conference on Artificial Intelligence.

[112] Graham M. (9 March 2012). “Big data and the end of theory?". The Guardian (London).

[113] “Good Data Won't Guarantee Good Decisions. Harvard Business Review”. Shah, Shvetank; Horne, Andrew; Capellá, Jaime;. HBR.org. Retrieved 8 September 2012.

[114] Anderson, C. (2008, 23 June). The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. Wired Magazine, (Science: Discoveries). http://www.wired.com/science/discoveries/magazine/16-07/pb_theory

[115] Braha, D.; Stacey, B.; Bar-Yam, Y. (2011). “Corporate Competition: A Self-organized Network”. Social Networks 33: 219–230.

[116] Rauch, J. (2002). Seeing Around Corners. The Atlantic, (April), 35–48. http://www.theatlantic.com/magazine/archive/ 2002/04/seeing-around-corners/302471/

[117] Epstein, J. M., & Axtell, R. L. (1996). Growing Artificial Societies: Social Science from the Bottom Up. A Bradford Book. 28 CHAPTER 4. BIG DATA

[118] Delort P., Big data in Biosciences, Big Data Paris, 2012 http://www.bigdataparis.com/documents/Pierre-Delort-INSERM. pdf#page=5

[119] Ohm, Paul. “Don't Build a Database of Ruin”. Harvard Business Review.

[120] Darwin Bond-Graham, Iron Cagebook - The Logical End of Facebook’s Patents, Counterpunch.org, 2013.12.03

[121] Darwin Bond-Graham, Inside the Tech industry’s Startup Conference, Counterpunch.org, 2013.09.11

[122] danah boyd (2010-04-29). “Privacy and Publicity in the Context of Big Data”. WWW 2010 conference. Retrieved 2011- 04-18.

[123] Jones, MB; Schildhauer, MP; Reichman, OJ; Bowers, S (2006). “The New Bioinformatics: Integrating Ecological Data from the Gene to the Biosphere” (PDF). Annual Review of Ecology, Evolution, and Systematics 37 (1): 519–544. doi:10.1146/annurev.ecolsys.37.091305.110031.

[124] Boyd, D.; Crawford, K. (2012). “Critical Questions for Big Data”. Information, Communication & Society 15 (5): 662. doi:10.1080/1369118X.2012.678878.

[125] Failure to Launch: From Big Data to Big Decisions, Forte Wares.

[126] Gregory Piatetsky (2014-08-12). “Interview: Michael Berthold, KNIME Founder, on Research, Creativity, Big Data, and Privacy, Part 2”. KDnuggets. Retrieved 2014-08-13.

[127] Harford, Tim (2014-03-28). “Big data: are we making a big mistake?". Financial Times. Financial Times. Retrieved 2014-04-07.

[128] Ioannidis, J. P. A. (2005). “Why Most Published Research Findings Are False”. PLoS Medicine 2 (8): e124. doi:10.1371/journal.pmed.0020124. PMC 1182327. PMID 16060722.

4.10 Further reading

• Sharma, Sugam; Tim, Udoyara S; Wong, Johnny; Gadia, Shashi; Sharma, Subhash (2014). “A BRIEF RE- VIEW ON LEADING BIG DATA MODELS”. Data Science Journal 13. • Big Data Computing and Clouds: Challenges, Solutions, and Future Directions. Marcos D. Assuncao, Rodrigo N. Calheiros, Silvia Bianchi, Marco A. S. Netto, Rajkumar Buyya. Technical Report CLOUDS-TR-2013-1, Cloud Computing and Distributed Systems Laboratory, The University of Melbourne, 17 Dec. 2013.

• Encrypted search & cluster formation in Big Data. Gautam Siwach, Dr. A. Esmailpour. American Society for Engineering Education, Conference at the University of Bridgeport, Bridgeport, Connecticut 3–5 April 2014.

• “Big Data for Good” (PDF). ODBMS.org. 5 June 2012. Retrieved 2013-11-12. • Hilbert, Martin; López, Priscila (2011). “The World’s Technological Capacity to Store, Communicate, and Compute Information”. Science 332 (6025): 60–65. doi:10.1126/science.1200970. PMID 21310967. • “The Rise of Industrial Big Data”. GE Intelligent Platforms. Retrieved 2013-11-12.

• History of Big Data Timeline. A visual history of Big Data with links to supporting articles.

4.11 External links

• Media related to Big data at Wikimedia Commons • The dictionary definition of big data at Wiktionary

• MIT Big Data Initiative / Tackling the Challenges of Big Data Chapter 5

Big Memory

Big Memory is software and hardware approach that facilitates storing/retrieval/processing of large data sets (ter- abytes and higher). The term is akin to Big Data and in some instances is a form of Big Data processing architecture implemented in memory rather than in disks/storage. Different Caches are one of the usage of the Big Memory. The computer memory, namely RAM works orders of magnitude faster than spinning disks or even Solid State Drives. This is usually due to higher raw data throughput because of tighter coupling of CPU and RAM chips (wider bus, CPU and RAM are usually installed on the same motherboard). Locality of reference is another important characteristic for caches and fast access. The price of the computer memory chips has significantly declined in the late 2000s. As of 2015 it is affordable to have 256 gigabytes of RAM on a server. [1] Currently, not many vendors have solid software Big Memory solutions while there are plentiful hardware options (i.e. cheap RAM planks). Terracotta has developed a “in-memory data management suite” [2] The contemporary software platforms that are based on a garbage collected models, such as .NET CLR , JAVA and others can not usually store hundreds of millions resident objects (object staying in RAM for minutes+ and get promoted to older generations) directly as this results in GC stalls that significantly affect the performance.[3] [4] As of 2015 there are still no efficient simple GC solutions that would have allowed to store 16Gb+ object in a language- native-heaps so there are hybrid approaches emerging for memory-hungry apps in the managed environments. NFX Unistack framework provides a concept of a large managed memory 'Pile' on a .NET CLR platform based on pre-allocated large byte[] that do not slow GC down.[5] The solution allows to easily store 300,000,000 business objects on a machine with 64 Gb RAM while allowing for millions object put/get transactions a second. The solution was purposely provided for processing Big Memory data sets without going to disk/network.[6] [7]

5.1 References

[1] 128 Gb ram chip on NewEgg - http://www.newegg.com/Product/Product.aspx?Item=9SIA7S634M7975 [2] Terracotta, Inc. - http://terracotta.org/products/bigmemorymax [3] Understanding GC pauses in JVM, HotSpot’s minor GC. - http://blog.ragozin.info/2011/06/understanding-gc-pauses-in-jvm-hotspots. [4] Long GC pauses in application - http://stackoverflow.com/questions/15696585/long-gc-pauses-in-application [5] NFX GitHub Inc. - http://github.com/aumcode/nfx [6] About Managed Object Pile - https://www.youtube.com/watch?v=WFA1XirINB0 [7] .NET Heap with hundreds of millions of objects - https://www.youtube.com/watch?v=Dz_7hukyejQ

5.2 External links

• Media related to Big data at Wikimedia Commons

29 30 CHAPTER 5. BIG MEMORY

• The dictionary definition of big data at Wiktionary Chapter 6

Brooks–Iyengar algorithm

The Brooks–Iyengar algorithm or Brooks–Iyengar hybrid algorithm [1] is a distributed algorithm, that improves both the precision and accuracy of the measurements taken by a distributed sensor network, even in the presence of faulty sensors.[2] The sensor network does this by exchanging the measured value and accuracy value at every node with every other node. And it computes the accuracy range and a measured value for the whole network from all of the values collected. Even if some of the data from some of the sensors is faulty, the sensor network will not malfunction.

6.1 Background

The Brooks–Iyengar hybrid algorithm for distributed control in the presence of noisy data combines Byzantine agree- ment with sensor fusion. It bridges the gap between sensor fusion and Byzantine fault tolerance.[3] This seminal al- gorithm unified these disparate fields for the first time. Essentially, it combines Dolev’s[4] algorithm for approximate agreement with Mahaney and Schneider’s fast convergence algorithm (FCA). The algorithm assumes N processing elements (PEs), t of which are faulty and can behave maliciously. It takes as input either real values with inherent inaccuracy or noise (which can be unknown), or a real value with apriori defined uncertainty, or an interval. The out- put of the algorithm is a real value with an explicitly specified accuracy. The algorithm runs in O(NlogN) where N is the number of PEs: see Big O notation. It is possible to modify this algorithm to correspond to Crusader’s Conver- gence Algorithm (CCA),[5] however, the bandwidth requirement will also increase. The algorithm has applications in distributed control, software reliability, High-performance computing, etc.[6]

6.2 Algorithm

The Brooks–Iyengar algorithm is executed in every sensor node of a distributed sensor network. Each sensor ex- changes their measured value and accuracy value with all other sensors in the network. The accuracy range the algorithm finds is the lowest lower bound and the highest upper bound returned from all the sensors. The “fused” measurement is a weighted average of the midpoints of the regions found.[7] STEP 1: Each processing element receives the values from all other processing elements and forms a set V. STEP 2: Perform the optimal region algorithm on V and returns a set A consisting of the ranges of values where at least N − T processing elements intersect. STEP 3: Output the range defined by the lowest lower bound and the largest upper bound in A. These are the accuracy bounds of the answer. STEP 4: The answer is the weighted average of the midpoints of the ranges in A where each midpoint is weighted by the number of sensors whose readings intersect its range.

31 32 CHAPTER 6. BROOKS–IYENGAR ALGORITHM

6.3 Algorithm characteristics

1. Faulty PEs tolerated < N/3 2. Maximum faulty PEs < 2N/3 3. Complexity = O(N log N) 4. Order of network bandwidth = O(N) 5. Convergence = 2t/N 6. Accuracy = limited by input 7. Iterates for precision = often 8. Precision over accuracy = no 9. Accuracy over precision = no

6.4 See also

• Distributed sensor network

6.5 References

[1] Richard R. Brooks and S. Sithrama Iyengar (June 1996). “Robust Distributed Computing and Sensing Algorithm”. Com- puter (IEEE) 29 (6): pp. 53–60. doi:10.1109/2.507632. ISSN 0018-9162. Retrieved 2010-03-22.

[2] Mohammad Ilyas, Imad Mahgoub (July 28, 2004). Handbook of sensor networks: compact wireless and wired sensing systems (PDF). CRC Press. p. 864. ISBN 978-0-8493-1968-6. Retrieved 2010-03-22.

[3] D. Dolev (Jan 1982). “The Byzantine Generals Strike Again” (PDF). J. Algorithms 3 (1): pp. 14–30. doi:10.1016/0196- 6774(82)90004-9. Retrieved 2010-03-22.

[4] L. Lamport, R. Shostak, M. Pease (July 1982). “The Byzantine Generals Problem”. Transactions on Programming Lan- guages and Systems (ACM) 4 (3): pp. 382–401. doi:10.1145/357172.357176. Retrieved 2010-03-22.

[5] D. Dolev et al. (July 1986). “Reaching Approximate Agreement in the Presence of Faults” (PDF). Journal of the ACM (JACM) (ACM Press) 33 (3): pp. 499–516. doi:10.1145/5925.5931. ISSN 0004-5411. Retrieved 2010-03-23.

[6] S. Mahaney and F. Schneider (1985). “Inexact Agreement: Accuracy, Precision, and Graceful Degradation”. Proc. Fourth ACM Symp. Principles of Distributed Computing (ACM Press, New York,): pp. 237–249. CiteSeerX: 10 .1 .1 .20 .6337.

[7] Sartaj Sahni and Xiaochun Xu (September 7, 2004). “Algorithms For Wireless Sensor Networks” (PDF). University of Florida, Gainesville. Retrieved 2010-03-23. Chapter 7

Byzantine fault tolerance

“Byzantine generals” redirects here. For actual military generals of the Byzantine empire, see Category:Byzantine generals.

In fault-tolerant computer systems, and in particular distributed computing systems, Byzantine fault tolerance is the characteristic of a system that tolerates the class of failures known as the Byzantine Generals’ Problem,[1] which is a generalized version of the Two Generals’ Problem. The phrases interactive consistency or source congruency have been used to refer to Byzantine fault tolerance, particularly among the members of some early implementation teams.[2] The objective of Byzantine fault tolerance is to be able to defend against Byzantine failures, in which components of a system fail with symptoms that prevent some components of the system from reaching agreement among themselves, where such agreement is needed for the correct operation of the system. Correctly functioning components of a Byzantine fault tolerant system will be able to provide the system’s service assuming there are not too many faulty components. The following practical, concise definitions are helpful in understanding Byzantine fault tolerance:[3] [4]

Byzantine fault Any fault presenting different symptoms to different observers

Byzantine failure The loss of a system service due to a Byzantine fault in systems that require consensus

The terms fault and failure are used here according to the standard definitions[5] originally created by a joint com- mittee on “Fundamental Concepts and Terminology” formed by the IEEE Computer Society’s Technical Committee on Dependable Computing and Fault-Tolerance and IFIP Working Group 10.4 on Dependable Computing and Fault Tolerance.[6] A version of these definitions is also described in the Dependability Wikipedia page. Note that the type of system services which Byzantine faults affect are agreement (a.k.a consensus) services.

7.1 Origin

Byzantine refers to the Byzantine Generals’ Problem, an agreement problem (described by Leslie Lamport, Robert Shostak and Marshall Pease in their 1982 paper, “The Byzantine Generals Problem”)[1] in which a group of generals, each commanding a portion of the Byzantine army, encircle a city. These generals wish to formulate a plan for attacking the city. In its simplest form, the generals must only decide whether to attack or retreat. Some generals may prefer to attack, while others prefer to retreat. The important thing is that every general agrees on a common decision, for a halfhearted attack by a few generals would become a rout and be worse than a coordinated attack or a coordinated retreat. The problem is complicated by the presence of traitorous generals who may not only cast a vote for a suboptimal strategy, they may do so selectively. For instance, if nine generals are voting, four of whom support attacking while four others are in favor of retreat, the ninth general may send a vote of retreat to those generals in favor of retreat, and a vote of attack to the rest. Those who received a retreat vote from the ninth general will retreat, while the rest will

33 34 CHAPTER 7. BYZANTINE FAULT TOLERANCE attack (which may not go well for the attackers). The problem is complicated further by the generals being physically separated and must send their votes via messengers who may fail to deliver votes or may forge false votes. Byzantine fault tolerance can be achieved if the loyal (non-faulty) generals have a unanimous agreement on their strategy. Note that there can be a default vote value given to missing messages. For example, missing messages can be given the value . Further, if the agreement is that the votes are in the majority, a pre-assigned default strategy can be used (e.g., retreat). The typical mapping of this story on to computer systems is that the computers are the generals and their digital communication system links are the messengers.

7.2 Known examples of Byzantine failures

Several examples of Byzantine failures that have occurred are given in two equivalent journal papers.[3][4] These and other examples are described on the NASA DASHlink web pages.[7] These web pages also describe some phe- nomenology that can cause Byzantine faults. Byzantine errors were observed infrequently and at irregular points during endurance testing for the New Virginia Class submarine.[8]

7.3 Early solutions

Several solutions were described by Lamport, Shostak, and Pease in 1982.[1] They began by noting that the Generals’ Problem can be reduced to solving a “Commander and Lieutenants” problem where Loyal Lieutenants must all act in unison and that their action must correspond to what the Commander ordered in the case that the Commander is Loyal.

• One solution considers scenarios in which messages may be forged, but which will be Byzantine-fault-tolerant as long as the number of traitorous generals does not equal or exceed one third of the generals. The impossibility of dealing with one-third or more traitors ultimately reduces to proving that the one Commander and two Lieutenants problem cannot be solved, if the Commander is traitorous. To see this, suppose we have a traitorous Commander A, and two Lieutenants, B and C: when A tells B to attack and C to retreat, and B and C send messages to each other, forwarding A’s message, neither B nor C can figure out who is the traitor, since it is not necessarily A—another Commander could have forged the message purportedly from A. It can be shown that if n is the number of generals in total, and t is the number of traitors in that n, then there are solutions to the problem only when n > 3t and the communication is synchronous (bounded delay).[9]

• A second solution requires unforgeable message signatures. For security-critical systems, digital signatures (in modern computer systems, this may be achieved in practice using public-key cryptography) can provide Byzantine fault tolerance in the presence of an arbitrary number of traitorous generals. However, for safety- critical systems, simple error detecting codes, such as CRCs, provide the same or better coverage at a much lower cost. This is true for both Byzantine and non-Byzantine faults. Thus, cryptographic digital signature methods are not a good choice for safety-critical systems, unless there is also a specific security threat as well.[10] While error detecting codes, such as CRCs, are better than cryptographic techniques, neither provide adequate coverage for active electronics in safety-critical systems. This is illustrated by the Schrödinger CRC scenario where a CRC-protected message with a single Byzantine faulty bit presents different data to different observers and each observer sees a valid CRC.[3][4]

• Also presented is a variation on the first two solutions allowing Byzantine-fault-tolerant behavior in some situ- ations where not all generals can communicate directly with each other.

Several system architectures were designed c. 1980 that implemented Byzantine fault tolerance. These include: Draper’s FTMP,[11] Honeywell’s MMFCS,[12] and SRI’s SIFT.[13] 7.4. PRACTICAL BYZANTINE FAULT TOLERANCE 35

7.4 Practical Byzantine fault tolerance

In 1999, Miguel Castro and Barbara Liskov introduced the “Practical Byzantine Fault Tolerance” (PBFT) algorithm,[14] which provides high-performance Byzantine state machine replication, processing thousands of requests per second with sub-millisecond increases in latency. PBFT triggered a renaissance in Byzantine fault tolerant replication research, with protocols like Q/U,[15] HQ,[16] Zyzzyva,[17] and ABsTRACTs [18] working to lower costs and improve performance and protocols like Aardvark[19] and RBFT[20] working to improve robustness.

7.5 Byzantine fault tolerance software

UpRight[21] is an open source library for constructing services that tolerate both crashes (“up”) and Byzantine behav- iors (“right”) that incorporates many of these protocols’ innovations. In addition to PBFT and Upright, there is the BFT-SMaRt library,[22] a high-performance Byzantine fault-tolerant state machine replication library developed in Java. This library implements a protocol very similar to PBFT’s, plus complementary protocols which offer state transfer and on-the-fly reconfiguration of hosts. BFT-SMaRt is the most recent effort to implement state machine replication, still being actively maintained. Archistar[23] utilizes a slim BFT layer[24] for communication. It prototypes a secure multi-cloud storage system using Java licensed under LGPLv2. Focus lies on simplicity and readability, it aims to be the foundation for further research projects.

7.6 Byzantine fault tolerance in practice

One example of BFT in use is Bitcoin, a peer-to-peer digital currency system. The Bitcoin network works in parallel to generate a chain of Hashcash style proof-of-work. The proof-of-work chain is the key to overcome Byzantine failures and to reach a coherent global view of the system state. Some aircraft systems, such as the Boeing 777 Aircraft Information Management System (via its ARINC 659 SAFEbus® network),[25] [26] the Boeing 777 flight control system,[27] and the Boeing 787 flight control systems, use Byzantine fault tolerance. Because these are real-time systems, their Byzantine fault tolerance solutions must have very low latency. For example, SAFEbus can achieve Byzantine fault tolerance with on the order of a microsecond of added latency. Some spacecraft such as the SpaceX Dragon flight system and the NASA Crew Exploration Vehicle consider Byzan- tine fault tolerance in their design. Byzantine fault tolerance mechanisms use components that repeat an incoming message (or just its signature) to other recipients of that incoming message. All these mechanisms make the assumption that the act of repeating a message blocks the propagation of Byzantine symptoms. For systems that have a high degree of safety or security criticality, these assumptions must be proven to be true to an acceptable level of fault coverage. When providing proof through testing, one difficulty is creating a sufficiently wide range of signals with Byzantine symptoms.[28] Such testing likely will require specialized fault injectors.[29][30]

7.7 See also

• Atomic commit

• Brooks–Iyengar algorithm

• Byzantine Paxos

• Consensus (computer science)

• Quantum Byzantine agreement 36 CHAPTER 7. BYZANTINE FAULT TOLERANCE

7.8 References

[1] Lamport, L.; Shostak, R.; Pease, M. (1982). “The Byzantine Generals Problem” (PDF). ACM Transactions on Program- ming Languages and Systems 4 (3): 382–401. doi:10.1145/357172.357176.

[2] Kirrmann, Hubert (n.d.). “Fault Tolerant Computing in Industrial Automation” (PDF). CH-5405 Baden, Switzerland: ABB Research Center. p. 94. Retrieved 2015-03-02.

[3] Driscoll, K.; Hall, B.; Paulitsch, M.; Zumsteg, P.; Sivencrona, H. (2004). “The Real Byzantine Generals”. pp. 6.D.4–61– 11. doi:10.1109/DASC.2004.1390734.

[4] Driscoll, Kevin; Hall, Brendan; Sivencrona, Håkan; Zumsteg, Phil (2003). “Byzantine Fault Tolerance, from Theory to Reality” 2788. pp. 235–248. doi:10.1007/978-3-540-39878-3_19. ISSN 0302-9743.

[5] Avizienis, A.; Laprie, J.-C.; Randell, B.; Landwehr, C. (2004). “Basic concepts and taxonomy of dependable and secure computing”. IEEE Transactions on Dependable and Secure Computing 1 (1): 11–33. doi:10.1109/TDSC.2004.2. ISSN 1545-5971.

[6] “Dependable Computing and Fault Tolerance”. Retrieved 2015-03-02.

[7] Driscoll, Kevin (2012-12-11). “Real System Failures”. DASHlink. NASA. Retrieved 2015-03-02.

[8] Walter, C.; Ellis, P.; LaValley, B. (2005). “The Reliable Platform Service: A Property-Based Fault Tolerant Service Architecture”. pp. 34–43. doi:10.1109/HASE.2005.23.

[9] Feldman, P.; Micali, S. (1997). “An optimal probabilistic protocol for synchronous Byzantine agreement” (PDF). SIAM J. Computing 26 (4): 873–933. doi:10.1137/s0097539790187084.

[10] Paulitsch, M.; Morris, J.; Hall, B.; Driscoll, K.; Latronico, E.; Koopman, P. (2005). “Coverage and the Use of Cyclic Redundancy Codes in Ultra-Dependable Systems”. pp. 346–355. doi:10.1109/DSN.2005.31.

[11] Hopkins, Albert L.; Lala, Jaynarayan H.; Smith, T. Basil (1987). “The Evolution of Fault Tolerant Computing at the Charles Stark Draper Laboratory, 1955–85” 1. pp. 121–140. doi:10.1007/978-3-7091-8871-2_6. ISSN 0932-5581.

[12] Driscoll, Kevin; Papadopoulos, Gregory; Nelson, Scott; Hartmann, Gary; Ramohalli, Gautham (1984), Multi-Microprocessor Flight Control System (Technical Report), Wright-Patterson Air Force Base, OH 45433, USA: AFWAL/FIGL U.S. Air Force Systems Command, AFWAL-TR-84-3076

[13] “SIFT: design and analysis of a fault-tolerant computer for aircraft control”. Microelectronics Reliability 19 (3): 190. 1979. doi:10.1016/0026-2714(79)90211-7. ISSN 0026-2714.

[14] Castro, M.; Liskov, B. (2002). “Practical Byzantine Fault Tolerance and Proactive Recovery”. ACM Transactions on Computer Systems (Association for Computing Machinery) 20 (4): 398–461. doi:10.1145/571637.571640. CiteSeerX: 10 .1 .1 .127 .6130.

[15] Abd-El-Malek; Ganger, G.; Goodson, G.; Reiter, M.; Wylie, J. (2005). “Fault-scalable Byzantine Fault-Tolerant Services”. Association for Computing Machinery. doi:10.1145/1095809.1095817.

[16] Cowling, James; Myers, Daniel; Liskov, Barbara; Rodrigues, Rodrigo; Shrira, Liuba (2006). HQ Replication: A Hybrid Quorum Protocol for Byzantine Fault Tolerance. Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation. pp. 177–190. ISBN 1-931971-47-1.

[17] Kotla, Ramakrishna; Alvisi, Lorenzo; Dahlin, Mike; Clement, Allen; Wong, Edmund (December 2009). “Zyzzyva: Spec- ulative Byzantine Fault Tolerance”. ACM Transactions on Computer Systems (Association for Computing Machinery) 27 (4). doi:10.1145/1658357.1658358.

[18] Guerraoui, Rachid; Kneževic, Nikola; Vukolic, Marko; Quéma, Vivien (2010). The Next 700 BFT Protocols. Proceedings of the 5th European conference on Computer systems. EuroSys.

[19] Clement, A.; Wong, E.; Alvisi, L.; Dahlin, M.; Marchetti, M. (April 22–24, 2009). Making Byzantine Fault Tolerant Systems Tolerate Byzantine Faults (PDF). Symposium on Networked Systems Design and Implementation. USENIX.

[20] Aublin, P.-L.; Ben Mokhtar, S.; Quéma, V. (July 8–11, 2013). RBFT: Redundant Byzantine Fault Tolerance. 33rd IEEE International Conference on Distributed Computing Systems. International Conference on Distributed Computing Systems.

[21] UpRight. Google Code repository for the UpRight replication library.

[22] BFT-SMaRt. Google Code repository for the BFT-SMaRt replication library.

[23] Archistar. github repository for the Archistar project. 7.9. EXTERNAL LINKS 37

[24] Archistar-bft BFT state-machine. github repository for the Archistar project.

[25] M., Paulitsch; Driscoll, K. (9 January 2015). “Chapter 48:SAFEbus”. In Zurawski, Richard. Industrial Communication Technology Handbook, Second Edition. CRC Press. pp. 48–1–48–26. ISBN 978-1-4822-0733-0.

[26] Thomas A. Henzinger; Christoph M. Kirsch (26 September 2001). Embedded Software: First International Workshop, EMSOFT 2001, Tahoe City, CA, USA, October 8-10, 2001. Proceedings (PDF). Springer Science & Business Media. pp. 307–. ISBN 978-3-540-42673-8.

[27] Yeh, Y.C. (2001). “Safety critical avionics for the 777 primary flight controls system” 1. pp. 1C2/1–1C2/11. doi:10.1109/DASC.2001.963311.

[28] Nanya, T.; Goosen, H.A. (1989). “The Byzantine hardware fault model”. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 8 (11): 1226–1231. doi:10.1109/43.41508. ISSN 0278-0070.

[29] Martins, Rolando; Gandhi, Rajeev; Narasimhan, Priya; Pertet, Soila; Casimiro, António; Kreutz, Diego; Veríssimo, Paulo (2013). “Experiences with Fault-Injection in a Byzantine Fault-Tolerant Protocol” 8275. pp. 41–61. doi:10.1007/978-3- 642-45065-5_3. ISSN 0302-9743.

[30] US patent 7475318, Kevin R. Driscoll, “Method for testing the sensitive input range of Byzantine filters”, issued 2009-01- 06, assigned to Honeywell International Inc.

7.9 External links

• Ocean Store replicates data with a Byzantine fault tolerant commit protocol. • Practical Byzantine Fault Tolerance

• Byzantine Fault Tolerance in the RKBExplorer • UpRight is an open source library for Crash-tolerant and Byzantine-tolerant state machine replication. Chapter 8

Clock synchronization

Clock synchronization is a problem from computer science and engineering which deals with the idea that internal clocks of several computers may differ. Even when initially set accurately, real clocks will differ after some amount of time due to clock drift, caused by clocks counting time at slightly different rates. There are several problems that occur as a repercussion of clock rate differences and several solutions, some being more appropriate than others in certain contexts.[1] In serial communication, some people[2] use the term “clock synchronization” merely to discuss getting one metronome- like clock signal to pulse at the same frequency as another one – frequency synchronization (plesiochronous or isochronous operation), as opposed to full phase synchronization (synchronous operation). Such “clock synchro- nization” is used in synchronization in telecommunications and automatic baud rate detection.

8.1 Problems

Besides the incorrectness of the time itself, there are problems associated with clock skew that take on more com- plexity in a distributed system in which several computers will need to realize the same global time. For instance, in Unix systems the make command is used to compile new or modified code without the need to recompile unchanged code. The make command uses the clock of the machine it runs on to determine which source files need to be recompiled. If the sources reside on a separate file server and the two machines have unsynchronized clocks, the make program might not produce the correct results.[3]

8.2 Solutions

In a centralized system the solution is trivial; the centralized server will dictate the system time. Cristian’s algorithm and the Berkeley Algorithm are some solutions to the clock synchronization problem in a centralized server environ- ment. In a distributed system the problem takes on more complexity because a global time is not easily known. The most used clock synchronization solution on the Internet is the Network Time Protocol (NTP) which is a layered client-server architecture based on UDP message passing. Lamport timestamps and Vector clocks are concepts of the logical clocks in distributed systems.

8.2.1 Cristian’s algorithm

Main article: Cristian’s algorithm

Cristian’s algorithm relies on the existence of a time server.[4] The time server maintains its clock by using a radio clock or other accurate time source, then all other computers in the system stay synchronized with it. A time client will maintain its clock by making a procedure call to the time server. Variations of this algorithm make more precise time calculations by factoring in network radio propagation time.

38 8.2. SOLUTIONS 39

8.2.2 Berkeley algorithm

Main article: Berkeley algorithm

This algorithm is more suitable for systems where a radio clock is not present, this system has no way of making sure of the actual time other than by maintaining a global average time as the global time. A time server will periodically fetch the time from all the time clients, average the results, and then report back to the clients the adjustment that needs be made to their local clocks to achieve the average. This algorithm highlights the fact that internal clocks may vary not only in the time they contain but also in the clock rate. Often, any client whose clock differs by a value outside of a given tolerance is disregarded when averaging the results. This prevents the overall system time from being drastically skewed due to one erroneous clock.

8.2.3 Network Time Protocol

Main article: Network Time Protocol

This algorithm is a class of mutual network synchronization protocol that allows for use-selectable policy control in the design of the time synchronization and evidence model. NTP supports single inline and meshed operating models in which a clearly defined master source of time is used ones in which no penultimate master or reference clocks are needed. In NTP service topologies based on peering, all clocks equally participate in the synchronization of the network by exchanging their timestamps using regular beacon packets. In addition NTP supports a unicast type time transfer which provides a higher level of security. NTP performance is tunable based on its application and environmental loading as well. NTP combines a number of algorithms to robustly select and compare clocks, together with a combination of linear and decision-based control loop feedback models that allows multiple time synchronization probes to be combined over long time periods to produce high quality timing and clock drift estimates. Because NTP allows arbitrary synchronization mesh topologies, and can withstand (up to a point) both the loss of connectivity to other nodes, and “falsetickers” that do not give consistent time, it is also robust against failure and misconfiguration of other nodes in the synchronization mesh. NTP is highly robust, widely deployed throughout the Internet, and well tested over the years, and is generally re- garded as the state of the art in distributed time synchronization protocols for unreliable networks. It can reduce synchronization offsets to times of the order of a few milliseconds over the public Internet, and to sub-millisecond levels over local area networks. A simplified verson of the NTP protocol, SNTP, can also be used as a pure single-shot stateless master-slave synchro- nization protocol, but lacks the sophisticated features of NTP, and thus has much lower performance and reliability levels.

8.2.4 Clock Sampling Mutual Network Synchronization

CS-MNS is suitable for distributed and mobile applications. It has been shown to be scalable over mesh networks that include indirectly linked non-adjacent nodes, and compatible to IEEE 802.11 and similar standards. It can be accurate to the order of few microseconds, but requires direct physical wireless connectivity with negligible link delay (less than 1 microsecond) on links between adjacent nodes, limiting the distance between neighboring nodes to a few hundred meters.[5]

8.2.5 Precision Time Protocol

Main article: Precision Time Protocol

A master/slave protocol for delivery of highly accurate time over local area networks 40 CHAPTER 8. CLOCK SYNCHRONIZATION

8.2.6 Reference broadcast synchronization

Main article: Reference Broadcast Time Synchronization

The Reference Broadcast Synchronization (RBS) algorithm is often used in wireless networks and sensor networks. In this scheme, an initiator broadcasts a reference message to urge the receivers to adjust their clocks.

8.2.7 Reference Broadcast Infrastructure Synchronization

Main article: Reference Broadcast Infrastructure Synchronization

The Reference Broadcast Infrastructure Synchronization (RBIS)[6] protocol is a master/slave synchronization protocol based on the receiver/receiver synchronization paradigm, as RBS. Is is specifically tailored to be used in IEEE 802.11 Wi-Fi networks configured in infrastructure mode (i.e., coordinated by an access point). The protocol does not require any modification to the access point.

8.2.8 Global Positioning System

Main article: Global Positioning System

The Global Positioning System can also be used for clock synchronization. The accuracy of GPS time signals is ±10 ns[7] and is second only to the atomic clocks upon which they are based.

8.3 See also

• International Atomic Time

• Network Identity and Time Zone

• Network Time Protocol

• Precision Time Protocol

• Vector clocks

• Einstein synchronisation

8.4 External Links

• Accurate time vs. PC Clock Difference

8.5 References

[1] Tanenbaum, Andrew S.; Maarten van Steen (2002), Distributed Systems : Principles and Paradigms, Prentice Hall, ISBN 0-13-088893-1

[2] “Transmission on a Serial Line”

[3] GNU `make'

[4] Cristian, F. (1989), “Probabilistic clock synchronization”, Distributed Computing (Springer) 3 (3): 146–158, doi:10.1007/BF01784024

[5] Rentel, C.; Kunz, T. (March 2005), “A clock-sampling mutual network synchronization algorithm for wireless ad hoc net- works”, IEEE Wireless Communications and Networking Conference (IEEE Press) 1: 638–644, doi:10.1109/WCNC.2005.1424575, ISBN 0-7803-8966-2 |chapter= ignored (help) 8.5. REFERENCES 41

[6] Cena, G.; Scanzio, S.; Valenzano, A.; Zunino, C. (June 2015), “Implementation and Evaluation of the Reference Broad- cast Infrastructure Synchronization Protocol”, IEEE Transactions on Industrial Informatics (IEEE Press) 11: 801–811, doi:10.1109/TII.2015.2396003

[7] “Common View GPS Time Transfer”. nist.gov. Archived from the original on 2012-10-28. Retrieved 2011-07-23. Chapter 9

Consensus (computer science)

A fundamental problem in distributed computing is to achieve overall system reliability in the presence of a number of faulty processes. This often requires processes to agree on some data value that is needed during computation. Examples of applications of consensus include whether to commit a transaction to a database, agreeing on the identity of a leader, state machine replication, and atomic broadcasts.

9.1 Problem description

The consensus problem requires agreement among a number of processes for a single data value. Some of the processes may fail or be unreliable in other ways, so consensus protocols must be fault tolerant. The processes must somehow put forth their candidate values, communicate with one another, and agree on a single consensus value. One approach to generating consensus is for all processes to agree on a majority value. In this context, a majority requires at least one more than half of available votes (where each process is given a vote). However one or more faulty processes may skew the resultant outcome such that consensus may not be reached or reached incorrectly. Protocols that solve consensus problems are designed to deal with limited numbers of faulty processes. These pro- tocols must satisfy a number of requirements to be useful. For instance a trivial protocol could have all processes output binary value 1. This is not useful and thus the requirement is modified such that the output must somehow depend on the input. That is, the output value of a consensus protocol must be the input value of some process. Another requirement is that a process may decide upon and output a value only once and this decision is irrevocable. A process is called correct in an execution if it does not experience a failure. A consensus protocol tolerating halting failures must satisfy the following properties

Termination Every correct process decides some value.

Validity If all processes propose the same value v , then all correct processes decide v .

Integrity Every correct process decides at most one value, and if it decides some value v , then v must have been proposed by some process.

Agreement Every correct process must agree on the same value.

A protocol that can correctly guarantee consensus amongst n processes of which at most t fail is said to be t-resilient. In evaluating the performance of consensus protocols two factors of interest are running time and message complexity. Running time is given in Big O notation in the number of rounds of message exchange as a function of some input parameters (typically the number of processes and/or the size of the input domain). Message complexity refers to the amount of message traffic that is generated by the protocol. Other factors may include memory usage and the size of messages.

42 9.2. MODELS OF COMPUTATION 43

9.2 Models of computation

There are two types of failures a process may undergo, a crash failure, or a Byzantine failure. A crash failure occurs when a process abruptly stops and does not resume. Byzantine failures are failures in which absolutely no conditions are imposed. For example they may occur as a result of the malicious actions of an adversary. A process that experiences a Byzantine failure may send contradictory or conflicting data to other processes, or it may also sleep and then resume activity after a lengthy delay. Of the two types of failures, Byzantine failures are far more disruptive. Thus a consensus protocol tolerating Byzantine failures must be resilient to every possible error that can occur. A stronger version of consensus tolerating Byzantine failures is given below

Termination Every correct process decides some value.

Validity If all correct processes propose the same value v , then all correct processes decide v .

Integrity If a correct process decides v , then v must have been proposed by some correct process.

Agreement Every correct process must agree on the same value.

Varying models of computation may define a consensus problem. Some models may deal with fully connected graphs while others may deal with rings and trees. Asynchronous versus synchronous models for message passing may be considered. In some models message authentication is allowed whereas in others processes are completely anonymous. Shared memory models in which processes communicate by accessing objects in shared memory are also an important area of research. A special case of the consensus problem called binary consensus restricts the input and hence the output domain to a single binary digit {0,1}. When the input domain is large relative to the number of processes, for instance an input set of all the natural numbers, it can be shown that consensus is impossible in a synchronous message passing model. While real world communications are often inherently asynchronous it is more practical and useful to model syn- chronous systems.[1] In a fully asynchronous message-passing distributed system in which one process may have a halting failure, it has been proved that consensus is impossible.[2] However, this impossibility result derives from a worst-case scenario of a process schedule which is highly unlikely. In reality, process scheduling has a degree of randomness.[1] In an asynchronous model some forms of failures can be handled by a synchronous consensus protocol. For instance, the loss of a communication link may be modeled as a process which has suffered a Byzantine failure. In synchronous systems it is assumed that all communications proceed in rounds. In one round a process may send all the messages it requires while receiving all messages from other processes. In this manner no message from one round may influence any messages sent within the same round.

9.3 Equivalency of agreement problems

Three agreement problems of interest are as follows.

9.3.1 Terminating Reliable Broadcast

Main article: Terminating Reliable Broadcast

A collection of n processes, numbered from 0 to n - 1, communicate by sending messages to one another. Process 0 must transmit a value v to all processes such that:

1. if process 0 is correct, then every correct process receives v

2. for any two correct processes, each process receives the same value.

It is also known as The General’s Problem. 44 CHAPTER 9. CONSENSUS (COMPUTER SCIENCE)

9.3.2 Consensus

Formal requirements for a consensus protocol may include:

• Agreement: All correct processes must agree on the same value.

• Weak validity: If all correct processes receive the same input value, then they must all output that value.

• Strong validity: For each correct process, its output must be the input of some correct process.

• Termination: All processes must eventually decide on an output value

9.3.3 Weak Interactive Consistency

For n processes in a partially synchronous system (the system alternates between good and bad periods of synchrony), each process chooses a private value. The processes communicate with each other by rounds to determine a public value and generate a consensus vector with the following requirements:[3]

1. if a correct process sends v , then all correct processes receive either v or nothing (integrity property)

2. all messages sent in a round by a correct process are received in the same round by all correct processes (consistency property).

It can be shown that variations of these problems are equivalent in that the solution for a problem in one type of model may be the solution for another problem in another type of model. For example, a solution to the Weak Byzantine General problem in a synchronous authenticated message passing model leads to a solution for Weak Interactive Consistency.[4] An interactive consistency algorithm can solve the consensus problem by having each process choose the majority value in its consensus vector as its consensus value.[5]

9.4 Solvability results for some agreement problems

There is a t-resilient anonymous synchronous protocol which solves the Byzantine Generals problem,[6][7] iff t/n < 1/3 and the Weak Byzantine Generals case[4] where t is the number of failures and n is the number of processes. For a system of 3 processors with one of them Byzantine, there is no solution for the consensus problem in a syn- chronous message passing model with binary inputs.[8] In a fully asynchronous system there is no consensus solution that can tolerate one or more crash failures even when only requiring the non triviality property.[2] This result is sometimes called the FLP impossibility proof. The authors Michael J. Fischer, Nancy Lynch, and Mike Paterson were awarded a Dijkstra Prize for this significant work. The FLP result does not state that consensus can never be reached: merely that under the model’s assumptions, no algorithm can always reach consensus in bounded time. In practice it is highly unlikely to occur.

9.5 Some consensus protocols

An example of a polynomial time binary consensus protocol that tolerates Byzantine failures is the Phase King al- gorithm [9] by Garay and Berman. The algorithm solves consensus in a synchronous message passing model with n processes and up to f failures, provided n > 4f. In the phase king algorithm, there are f + 1 phases, with 2 rounds per phase. Each process keeps track of its preferred output (initially equal to the process’s own input value). In the first round of each phase each process broadcasts its own preferred value to all other processes. It then receives the values from all processes and determines which value is the majority value and its count. In the second round of the phase, the process whose id matches the current phase number is designated the king of the phase. The king broadcasts the majority value it observed in the first round and serves as a tie breaker. Each process then updates its preferred value as follows. If the count of the majority value the process observed in the first round is greater than n/2 + f, the process changes its preference to that majority value; otherwise it uses the phase king’s value. At the end of f + 1 phases the processes output their preferred values. 9.6. APPLICATIONS OF CONSENSUS PROTOCOLS 45

Google has implemented a distributed service library called Chubby.[10] Chubby maintains lock information in small files which are stored in a replicated database to achieve high availability in the face of failures. The database is implemented on top of a fault-tolerant log layer which is based on the Paxos consensus algorithm. In this scheme, Chubby clients communicate with the Paxos master in order to access/update the replicated log; i.e., read/write to the files.[11] Bitcoin uses proof of work to maintain consensus in its peer-to-peer network. Nodes in the bitcoin network attempt to solve a cryptographic proof-of-work problem, where probability of finding the solution is proportional to the computational effort, in hashes per second, expended, and the node that solves the problem has their version of the block of transactions added to the peer-to-peer distributed timestamp server accepted by all of the other nodes. As any node in the network can attempt to solve the proof-of-work problem, a Sybil attack becomes unfeasible unless the attacker has over 50% of the computational resources of the network.

• Chandra-Toueg consensus algorithm

• Randomized consensus

• Raft consensus algorithm

Many peer-to-peer online Real-time strategy games use a modified Lockstep protocol as a consensus protocol in order to manage game state between players in a game. Each game action results in a game state delta broadcast to all other players in the game along with a hash of the total game state. Each player validates the change by applying the delta to their own game state and comparing the game state hashes. If the hashes do not agree then a vote is cast, and those players whose game state is in the minority are disconnected and removed from the game (known as a desync.)

9.6 Applications of consensus protocols

One important application of consensus protocols is to provide synchronization. Traditional methods of concurrent access to shared data objects implement some form of mutual exclusion through locks. However the drawback is if a process dies while in its critical section, other correct processes may never acquire the lock. Thus mutual exclusion is poorly suited to asynchronous fault tolerant systems. A wait-free implementation of a data object supporting con- current accesses guarantees that any process can complete its execution within a finite number of steps independent of the behavior of other processes. Atomic objects such as read/write registers have been proposed for the imple- mentation of wait free synchronization. However it has been shown that such objects as well as traditional primitives such as test&set and fetch&add cannot be used for such an implementation.[12]

9.7 See also

• Uniform Consensus

• Quantum Byzantine agreement

• Byzantine fault tolerance

9.8 References

[1] Aguilera, M. K. (2010). “Stumbling over Consensus Research: Misunderstandings and Issues”. Lecture Notes in Computer Science 5959. pp. 59–72. doi:10.1007/978-3-642-11294-2_4. ISBN 978-3-642-11293-5.

[2] Fischer, M. J.; Lynch, N. A.; Paterson, M. S. (1985). “Impossibility of distributed consensus with one faulty process” (PDF). Journal of the ACM 32 (2): 374–382. doi:10.1145/3149.214121.

[3] Milosevic, Zarko; Martin Hutle; Andre Schiper (2009). “Unifying Byzantine Consensus Algorithms with Weak Interactive Consistency”. Principles of Distributed Systems, Lecture Notes in Computer Science 5293: 300–314. doi:10.1007/978-3- 642-10877-8_24.

[4] Lamport, L. (1983). “The Weak Byzantine Generals Problem”. Journal of the ACM 30 (3): 668. doi:10.1145/2402.322398. 46 CHAPTER 9. CONSENSUS (COMPUTER SCIENCE)

[5] Fischer, Michael J. “The Consensus Problem in Unreliable Distributed Systems (A Brief Survey)" (PDF). Retrieved 21 April 2014.

[6] Lamport, L.; Shostak, R.; Pease, M. (1982). “The Byzantine Generals Problem” (PDF). ACM Transactions on Program- ming Languages and Systems 4 (3): 382–401. doi:10.1145/357172.357176.

[7] Lamport, Leslie; Marshall Pease; Robert Shostak (April 1980). “Reaching Agreement in the Presence of Faults” (PDF). Journal of the ACM 27 (2): 228–234. doi:10.1145/322186.322188. Retrieved 2007-07-25.

[8] Attiya, Hagit (2004). Distributed Computing 2nd Ed. Wiley. pp. 101–103. ISBN 978-0-471-45324-6.

[9] Berman, Piotr; Juan A. Garay. “Cloture Votes: n/4-resilient Distributed Consensus in t + 1 rounds”. Theory of Computing Systems. 2 26: 3–19. doi:10.1007/BF01187072. Retrieved 19 December 2011.

[10] Burrows, M. (2006). The Chubby lock service for loosely-coupled distributed systems (PDF). Proceedings of the 7th Symposium on Operating Systems Design and Implementation. USENIX Association Berkeley, CA, USA. pp. 335–350.

[11] C., Tushar; Griesemer, R; Redstone J. (2007). Paxos Made Live - An Engineering Perspective (PDF). Proceedings of the Twenty-sixth Annual ACM Symposium on Principles of Distributed Computing. Portland, Oregon, USA: ACM Press New York, NY, USA. pp. 398–407. doi:10.1145/1281100.1281103. Retrieved 2008-02-06.

[12] Herlihy, Maurice. “Wait-Free Synchronization” (PDF). Retrieved 19 December 2011.

9.9 Further reading

• Herlihy, M.; Shavit, N. (1999). “The topological structure of asynchronous computability”. Journal of the ACM 46 (6): 858. doi:10.1145/331524.331529.

• Saks, M.; Zaharoglou, F. (2000). “Wait-Free k-Set Agreement is Impossible: The Topology of Public Knowl- edge”. SIAM J. Comput. 29 (5): 1449–1483. doi:10.1137/S0097539796307698. Chapter 10

Data lineage

"Data lineage is defined as a data life cycle that includes the data’s origins and where it moves over time.” [1] It describes what happens to data as it goes through diverse processes. It helps provide visibility into the analytics pipeline and simplifies tracing errors back to their sources. It also enables replaying specific portions or inputs of the dataflow for step-wise debugging or regenerating lost output. In fact, database systems have used such information, called data provenance, to address similar validation and debugging challenges already.[2] Data provenance documents the inputs, entities, systems, and processes that influence data of interest, in effect providing a historical record of the data and its origins. The generated evidence supports essential forensic activities such as data-dependency analysis, error/compromise detection and recovery, and auditing and compliance analysis. "Lineage is a simple type of why provenance.”[2]

10.1 Case for Data Lineage

The world of big data is changing dramatically right before our eyes. Statistics say that Ninety percent (90%) of the world’s data has been created in the last two years alone.[3] This explosion of data has resulted in the ever-growing number of systems and automation at all levels in all sizes of organizations. Today, distributed systems like Google Map Reduce,[4] Microsoft ,[5] Apache Hadoop [6](an open-source project) and Google Pregel[7] provide such platforms for businesses and users. However, even with these systems, big data analytics can take several hours, days or weeks to run, simply due to the data volumes involved. For exam- ple, a ratings prediction algorithm for the Netflix Prize challenge took nearly 20 hours to execute on 50 cores, and a large-scale image processing task to estimate geographic information took 3 days to complete using 400 cores.[8] “The Large Synoptic Survey Telescope is expected to generate terabytes of data every night and eventually store more than 50 petabytes, while in the bioinformatics sector, the largest genome 12 sequencing houses in the world now store petabytes of data apiece.”[9] Due to the humongous size of the big data, there could be features in the data that are not considered in the machine learning algorithm, possibly even outliers. It is very difficult for a data scientist to trace an unknown or an unanticipated result.

10.1.1 Big Data Debugging

Big data analytics is the process of examining large data sets to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information. They apply machine learning algorithms etc. to the data which transform the data. Due to the humongous size of the data, there could be unknown features in the data, possibly even outliers. It is pretty difficult for a data scientist to actually debug an unexpected result. The massive scale and unstructured nature of data, the complexity of these analytics pipelines, and long runtimes pose significant manageability and debugging challenges. Even a single error in these analytics can be extremely difficult to identify and remove. While one may debug them by re-running the entire analytics through a debugger for step-wise debugging, this can be expensive due to the amount of time and resources needed. Auditing and data validation are other major problems due to the growing ease of access to relevant data sources for use in experiments, sharing of data between scientific communities and use of third-party data in business enterprises.[10][11][12][13] These problems

47 48 CHAPTER 10. DATA LINEAGE will only become larger and more acute as these systems and data continue to grow. As such, more cost-efficient ways of analyzing DISC system analytics are crucial to their continued use.

10.1.2 Challenges in Big Data Debugging

Massive Scale

The past two decades have seen a nuclear explosion in the collection and storage of digital information. In 2012, 2.8 zettabytes—that’s 1 sextillion bytes, or the equivalent of 24 quintillion tweets—were created or replicated, according to the research firm IDC. There are hundreds or thousands of petabyte-scale databases today, and we’d compare their size to what existed two decades ago, only every time the basis of comparison would be zero. Here’s a look at some of the world’s largest and most interesting data sets. Working with this scale of data has become very challenging.[14]

Unstructured Data

The phrase unstructured data usually refers to information that doesn't reside in a traditional row-column database. As you might expect, it’s the opposite of structured data the data stored in fields in a database. Unstructured data files often include text and multimedia content. Examples include e-mail messages, word processing documents, videos, photos, audio files, presentations, webpages and many other kinds of business documents. Note that while these sorts of files may have an internal structure, they are still considered “unstructured” because the data they contain doesn't fit neatly in a database. Experts estimate that 80 to 90 percent of the data in any organization is unstructured. And the amount of unstructured data in enterprises is growing significantly often many times faster than structured databases are growing. "Big data can include both structured and unstructured data, but IDC estimates that 90 percent of big data is unstructured data.”[15]

Long Runtime

In today’s hyper competitive business environment, companies not only have to find and analyze the relevant data they need, they must find it quickly. The challenge is going through the sheer volumes of data and accessing the level of detail needed, all at a high speed. The challenge only grows as the degree of granularity increases. One possible solution is hardware. Some vendors are using increased memory and powerful parallel processing to crunch large volumes of data extremely quickly. Another method is putting data in-memory but using a approach, where many machines are used to solve a problem. Both approaches allow organizations to explore huge data volumes. Even this level of sophisticated hardware and software, few of the image processing tasks in large scale take a few days to few weeks.[16] Debugging of the data processing is extremely hard due to long run times.

Complex Platform

Big Data platforms have a very complicated structure. Data is distributed among several machines. Typically the jobs are mapped into several machines and results are later combined by reduce operations. Debugging of a big data pipeline becomes very challenging because of the very nature of the system. It will not be an easy task for the data scientist to figure out which machine’s data has the outliers and unknown features causing a particular algorithm to give unexpected results.

10.1.3 Proposed Solution

Data provenance or data lineage can be used to make the debugging of big data pipeline easier. This necessitates the collection of data about data transformations. The below section will explain data provenance in more detail. 10.2. DATA PROVENANCE 49

10.2 Data Provenance

Data Provenance provides a historical record of the data and its origins. The provenance of data which is generated by complex transformations such as workflows is of considerable value to scientists. From it, one can ascertain the quality of the data based on its ancestral data and derivations, track back sources of errors, allow automated re-enactment of derivations to update a data, and provide attribution of data sources. Provenance is also essential to the business domain where it can be used to drill down to the source of data in a data warehouse, track the creation of intellectual property, and provide an audit trail for regulatory purposes. The use of data provenance is proposed in distributed systems to trace records through a dataflow, replay the dataflow on a subset of its original inputs and debug data flows. To do so, one needs to keep track of the set of inputs to each operator, which were used to derive each of its outputs. Although there are several forms of provenance, such as copy-provenance and how-provenance,[13][17] the information we need is a simple form of why-provenance, or lineage, as defined by Cui et al.[18]

10.3 Lineage Capture

Intuitively, for an operator T producing output o, lineage consists of triplets of form {I, T, o}, where I is the set of inputs to T used to derive o. Capturing lineage for each operator T in a dataflow enables users to ask questions such as “Which outputs were produced by an input i on operator T ?” and “Which inputs produced output o in operator T ?”[2] A query that finds the inputs deriving an output is called a backward tracing query, while one that finds the outputs produced by an input is called a forward tracing query.[19] Backward tracing is useful for debugging, while forward tracing is useful for tracking error propagation.[19] Tracing queries also form the basis for replaying an original dataflow.[11][18][19] However, to efficiently use lineage in a DISC system, we need to be able to capture lineage at multiple levels (or granularities) of operators and data, capture accurate lineage for DISC processing constructs and be able to trace through multiple dataflow stages efficiently. DISC system consists of several levels of operators and data, and different use cases of lineage can dictate the level at which lineage needs to be captured. Lineage can be captured at the level of the job, using files and giving lineage tuples of form {IF i, M RJob, OF i }, lineage can also be captured at the level of each task, using records and giving, for example, lineage tuples of form {(k rr, v rr ), map, (k m, v m )}. The first form of lineage is called coarse-grain lineage, while the second form is called fine-grain lineage. Integrating lineage across different granularities enables users to ask questions such as “Which file read by a MapReduce job produced this particular output record?” and can be useful in debugging across different operator and data granularities within a dataflow.[2]

Map Reduce Job showing containment relashionshpis 50 CHAPTER 10. DATA LINEAGE

To capture end-to-end lineage in a DISC system, we use the Ibis model,[20] which introduces the notion of containment hierarchies for operators and data. Specifically, Ibis proposes that an operator can be contained within another and such a relationship between two operators is called operator containment. “Operator containment implies that the contained (or child) operator performs a part of the logical operation of the containing (or parent) operator.”[2] For example, a MapReduce task is contained in a job. Similar containment relationships exist for data as well, called data containment. Data containment implies that the contained data is a subset of the containing data (superset).

Containment Hierarchy

10.4 Active vs Lazy Lineage

Lazy lineage collection typically captures only coarse-grain lineage at run time. These systems incur low capture overheads due to the small amount of lineage they capture. However, to answer fine-grain tracing queries, they must replay the data flow on all (or a large part) of its input and collect fine-grain lineage during the replay. This approach is suitable for forensic systems, where a user wants to debug an observed bad output. Active collection systems capture entire lineage of the data flow at run time. The kind of lineage they capture may be coarse-grain or fine-grain, but they do not require any further computations on the data flow after its execution. Active fine-grain lineage collection systems incur higher capture overheads than lazy collection systems. However, they enable sophisticated replay and debugging.[2]

10.5 Actors

An actor is an entity that transforms data; it may be a Dryad vertex, individual map and reduce operators, a MapRe- duce job, or an entire dataflow pipeline. Actors act as black-boxes and the inputs and outputs of an actor are tapped to capture lineage in the form of associations, where an association is a triplet {i, T, o} that relates an input i with an output o for an actor T . The instrumentation thus captures lineage in a dataflow one actor at a time, piecing it into a set of associations for each actor. The system developer needs to capture the data an actor reads (from other actors) and the data an actor writes (to other actors). For example, a developer can treat the Hadoop Job Tracker as an actor by recording the set of files read and written by each job. [21]

10.6 Associations

Association is a combination of the inputs, outputs and the operation itself. The operation is represented in terms of a black box also known as the actor. The associations describe the transformations that are applied on the data. The 10.7. ARCHITECTURE 51

associations are stored in the association tables. Each unique actor is represented by its own association table. An association itself looks like {i, T, o} where i is the set of inputs to the actor T and o is set of outputs given produced by the actor. Associations are the basic units of Data Lineage. Individual associations are later clubbed together to construct the entire history of transformations that were applied to the data.[2]

10.7 Architecture

Big data systems scale horizontally i.e. increase capacity by adding new hardware or software entities into the dis- tributed system. The distributed system acts as a single entity in the logical level even though it comprises multiple hardware and software entities. The system should continue to maintain this property after horizontal scaling. An important advantage of horizontal scalability is that it can provide the ability to increase capacity on the fly. The biggest plus point is that horizontal scaling can be done using commodity hardware. The horizontal scaling feature of Big Data systems should be taken into account while creating the architecture of lineage store. This is essential because the lineage store itself should also be able to scale in parallel with the Big data system. The number of associations and amount of storage required to store lineage will increase with the increase in size and capacity of the system. The architecture of Big data systems makes the use of a single lineage store not appropriate and impossible to scale. The immediate solution to this problem is to distribute the lineage store itself.[2] The best case scenario is to use a local lineage store for every machine in the distributed system network. This allows the lineage store also to scale horizontally. In this design, the lineage of data transformations applied to the data on a particular machine is stored on the local lineage store of that specific machine. The lineage store typically stores association tables. Each actor is represented by its own association table. The rows are the associations themselves and columns represent inputs and outputs. This design solves 2 problems. It allows horizontal scaling of the lineage store. If a single centralized lineage store was used, then this information had to be carried over the network, which would cause additional network latency. The network latency is also avoided by the use of a distributed lineage store.[21]

Architecture of Lineage Systems

10.8 Data flow Reconstruction

The information stored in terms of associations needs to be combined by some means to get the data flow of a particular job. In a distributed system a job is broken down into multiple tasks. One or more instances run a particular task. The results produced on these individual machines are later combined together to finish the job. Tasks running on different machines perform multiple transformations on the data in the machine. All the transformations applied to the data on a machines is stored in the local lineage store of that machines. This information needs to be combined together to get the lineage of the entire job. The lineage of the entire job should help the data scientist understand the data flow of the job and he/she can use the data flow to debug the big data pipeline. The data flow is reconstructed in 3 stages.

10.8.1 Association tables

The first stage of the data flow reconstruction is the computation of the association tables. The association tables exists for each actor in each local lineage store. The entire association table for an actor can be computed by combining these individual association tables. This is generally done using a series of equality joins based on the actors themselves. In 52 CHAPTER 10. DATA LINEAGE few scenarios the tables might also be joined using inputs as the key. Indexes can also be used to improve the efficiency of a join.The joined tables need to be stored on a single instance or a machine to further continue processing. There are multiple schemes that are used to pick a machine where a join would be computed. The easiest one being the one with minimum CPU load. Space constraints should also be kept in mind while picking the instance where join would happen.

10.8.2 Association Graph

The second step in data flow reconstruction is computing an association graph from the lineage information. The graph represents the steps in the data flow. The actors act as vertices and the associations act as edges. Each actor T is linked to its upstream and downstream actors in the data flow. An upstream actor of T is one that produced the input of T, while a downstream actor is one that consumes the output of T . Containment relationships are always considered while creating the links. The graph consists of three types of links or edges.

Explicitly specified links

The simplest link is an explicitly specified link between two actors. These links are explicitly specified in the code of a machine learning algorithm. When an actor is aware of its exact upstream or downstream actor, it can communicate this information to lineage API. This information is later used to link these actors during the tracing query. For example, in the MapReduce architecture, each map instance knows the exact record reader instance whose output it consumes.[2]

Logically inferred links

Developers can attach data flow archetypes to each logical actor. A data flow archetype explains how the children types of an actor type arrange themselves in a data flow. With the help of this information, one can infer a link between each actor of a source type and a destination type. For example, in the MapReduce architecture, the map actor type is the source for reduce, and vice versa. The system infers this from the data flow archetypes and duly links map instances with reduce instances. However, there may be several MapReduce jobs in the data flow, and linking all map instances with all reduce instances can create false links. To prevent this, such links are restricted to actor instances contained within a common actor instance of a containing (or parent) actor type. Thus, map and reduce instances are only linked to each other if they belong to the same job.[2]

Implicit links through data set sharing

In distributed systems, sometimes there are implicit links, which are not specified during execution. For example, an implicit link exists between an actor that wrote to a file and another actor that read from it. Such links connect actors which use a common data set for execution. The dataset is the output of the first actor and is the input of the actor following it.[2]

10.8.3 Topological Sorting

The final step in the data flow reconstruction is the Topological sorting of the association graph. The directed graph created in the previous step is topologically sorted to obtain the order in which the actors have modified the data. This inherit order of the actors defines the data flow of the big data pipeline or task.

10.9 Tracing & Replay

This is the most crucial step in Big Data debugging. The captured lineage is combined and processed to obtain the data flow of the pipeline. The data flow helps the data scientist or a developer to look deeply into the actors and their transformations. This step allows the data scientist to figure out the part of the algorithm that is generating the unexpected output. A big data pipeline can go wrong in 2 broad ways. The first is a presence of a suspicious actor in the data-flow. The second being the existence of outliers in the data. 10.9. TRACING & REPLAY 53

The first case can be debugged by tracing the data-flow. By using lineage and data-flow information together a data scientist can figure out how the inputs are converted into outputs. During the process actors that behave unexpectedly can be caught. Either these actors can be removed from the data flow or they can be augmented by new actors to change the data-flow. The improved data-flow can be replayed to test the validity of it. Debugging faulty actors include recursively performing coarse-grain replay on actors in the data-flow,[22] which can be expensive in resources for long dataflows. Another approach is to manually inspect lineage logs to find anomalies,[12][23] which can be tedious and time-consuming across several stages of a data-flow. Furthermore, these approaches work only when the data scientist can discover bad outputs. To debug analytics without known bad outputs, the data scientist need to analyze the data-flow for suspicious behavior in general. However, often, a user may not know the expected normal behavior and cannot specify predicates. This section describes a debugging methodology for retrospectively analyzing lineage to identify faulty actors in a multi-stage data-flow. We believe that sudden changes in an actor’s behavior, such as its average selectivity, processing rate or output size, is characteristic of an anomaly. Lineage can reflect such changes in actor behavior over time and across different actor instances. Thus, mining lineage to identify such changes can be useful in debugging faulty actors in a data-flow.

Tracing Anomalous Actors

The second problem i.e. the existence of outliers can also be identified by running the data-flow step wise and looking at the transformed outputs. The data scientist finds a subset of outputs that are not in accordance to the rest of outputs. The inputs which are causing these bad outputs are the outliers in the data. This problem can be solved by removing the set of outliers from the data and replaying the entire data-flow. It can also be solved by modifying the machine learning algorithm by adding, removing or moving actors in the data-flow. The changes in the data-flow are successful if the replayed data-flow does not produce bad outputs.

Tracing Outliers in the data 54 CHAPTER 10. DATA LINEAGE

10.10 Challenges

Even though use data lineage is a novel way of debugging of big data pipelines, the process is not simple. The challenges are scalability of lineage store, fault tolerance of the lineage store, accurate capture of lineage for black box operators and many others. These challenges must be considered carefully and trade offs between them need to be evaluated to make a realistic design for data lineage capture.

10.10.1 Scalability

DISC systems are primarily batch processing systems designed for high throughput. They execute several jobs per analytics, with several tasks per job. The overall number of operators executing at any time in a cluster can range from hundreds to thousands depending on the cluster size. Lineage capture for these systems must be able scale to both large volumes of data and numerous operators to avoid being a bottleneck for the DISC analytics.

10.10.2 Fault tolerance

Lineage capture systems must also be fault tolerant to avoid rerunning data flows to capture lineage. At the same time, they must also accommodate failures in the DISC system. To do so, they must be able to identify a failed DISC task and avoid storing duplicate copies of lineage between the partial lineage generated by the failed task and duplicate lineage produced by the restarted task. A lineage system should also be able to gracefully handle multiple instances of local lineage systems going down. This can achieved by storing replicas of lineage associations in multiple machines. The replica can act like a backup in the event of the real copy being lost.

10.10.3 Black-box operators

Lineage systems for DISC dataflows must be able to capture accurate lineage across black-box operators to enable fine-grain debugging. Current approaches to this include Prober, which seeks to find the minimal set of inputs that can produce a specified output for a black-box operator by replaying the data-flow several times to deduce the minimal set,[24] and dynamic slicing, as used by Zhang et al.[25] to capture lineage for NoSQL operators through binary rewriting to compute dynamic slices. Although producing highly accurate lineage, such techniques can incur significant time overheads for capture or tracing, and it may be preferable to instead trade some accuracy for better performance. Thus, there is a need for a lineage collection system for DISC dataflows that can capture lineage from arbitrary operators with reasonable accuracy, and without significant overheads in capture or tracing.

10.10.4 Efficient tracing

Tracing is essential for debugging, during which, a user can issue multiple tracing queries. Thus, it is important that tracing has fast turnaround times. Ikeda et al.[19] can perform efficient backward tracing queries for MapReduce dataflows, but are not generic to different DISC systems and do not perform efficient forward queries. Lipstick,[26] a lineage system for Pig,[27] while able to perform both backward and forward tracing, is specific to Pig and SQL operators and can only perform coarse-grain tracing for black-box operators. Thus, there is a need for a lineage system that enables efficient forward and backward tracing for generic DISC systems and dataflows with black-box operators.

10.10.5 Sophisticated replay

Replaying only specific inputs or portions of a data-flow is crucial for efficient debugging and simulating what-if scenarios. Ikeda et al. present a methodology for lineage-based refresh, which selectively replays updated inputs to recompute affected outputs.[28] This is useful during debugging for re-computing outputs when a bad input has been fixed. However, sometimes a user may want to remove the bad input and replay the lineage of outputs previously affected by the error to produce error-free outputs. We call this exclusive replay. Another use of replay in debugging involves replaying bad inputs for step-wise debugging (called selective replay). Current approaches to using lineage in DISC systems do not address these. Thus, there is a need for a lineage system that can perform both exclusive and selective replays to address different debugging needs. 10.11. SEE ALSO 55

10.10.6 Anomaly detection

One of the primary debugging concerns in DISC systems is identifying faulty operators. In long dataflows with several hundreds of operators or tasks, manual inspection can be tedious and prohibitive. Even if lineage is used to narrow the subset of operators to examine, the lineage of a single output can still span several operators. There is a need for an inexpensive automated debugging system, which can substantially narrow the set of potentially faulty operators, with reasonable accuracy, to minimize the amount of manual examination required.[2]

10.11 See also

• Provenance

• Big Data

• Topological Sorting

• Debugging

• NoSQL

• Scalability

• Directed acyclic graph

10.12 References

[1] http://www.techopedia.com/definition/28040/data-lineage

[2] De, Soumyarupa. (2012). Newt : an architecture for lineage based replay and debugging in DISC systems. UC San Diego: b7355202. Retrieved from: https://escholarship.org/uc/item/3170p7zn

[3] http://newstex.com/2014/07/12/thedataexplosionin2014minutebyminuteinfographic/

[4] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107– 113, January 2008.

[5] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference onComputer Systems 2007, EuroSys ’07, pages 59–72, New York, NY, USA, 2007. ACM.

[6] Apache Hadoop. http://hadoop.apache.org.

[7] Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Cza- jkowski. Pregel: a system for largescale graph processing. In Proceedings of the 2010 international conference on Man- agementof data, SIGMOD ’10, pages 135–146, New York, NY, USA, 2010. ACM.

[8] Shimin Chen and Steven W. Schlosser. Map-reduce meets wider varieties of applications. Technical report, Intel Research, 2008.

[9] The data deluge in genomics. https://www-304.ibm.com/connections/blogs/ibmhealthcare/entry/data overload in genomics3?lang=de, 2010.

[10] Yogesh L. Simmhan, Beth Plale, and Dennis Gannon. A survey of data prove- nance in e-science. SIGMOD Rec., 34(3):31–36, September 2005.

[11] Ian Foster, Jens Vockler, Michael Wilde, and Yong Zhao. Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation. In 14th International Conference on Scientific and Statistical Database Management, July 2002.

[12] Benjamin H. Sigelman, Luiz Andr Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. Dapper, a large-scale distributed systems tracing infrastructure. Technical report, Google Inc, 2010. 56 CHAPTER 10. DATA LINEAGE

[13] Peter Buneman, Sanjeev Khanna, and Wang Chiew Tan. Data provenance: Some basic issues. In Proceedings of the 20th Conference on Foundations of SoftwareTechnology and Theoretical Computer Science, FST TCS 2000, pages 87–93, London, UK, UK, 2000. Springer-Verlag

[14] The Wired. http://www.wired.com/2013/04/bigdata/

[15] Webopedia http://www.webopedia.com/TERM/U/unstructured_data.html

[16] SAS. http://www.sas.com/resources/asset/five-big-data-challenges-article.pdf

[17] Robert Ikeda and Jennifer Widom. Data lineage: A survey. Technical report, Stanford University, 2009.

[18] Y. Cui and J. Widom. Lineage tracing for general data warehouse transformations. VLDB Journal, 12(1), 2003.

[19] Robert Ikeda, Hyunjung Park, and Jennifer Widom. Provenance for generalized map and reduce workflows. In Proc. of CIDR, January 2011.

[20] C. Olston and A. Das Sarma. Ibis: A provenance manager for multi-layer systems. In Proc. of CIDR, January 2011.

[21] Dionysios Logothetis, Soumyarupa De, and Kenneth Yocum. 2013. Scalable lineage capture for debugging DISC analytics. In Proceedings of the 4th annual Symposium on Cloud Computing (SOCC '13). ACM, New York, NY, USA, , Article 17 , 15 pages.

[22] Wenchao Zhou, Qiong Fei, Arjun Narayan, Andreas Haeberlen, Boon Thau Loo, and Micah Sherr. Secure network prove- nance. In Proceedings of 23rd ACM Symposium on Principles (SOSP), December 2011.

[23] Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, and Ion Stoica. X-trace: A pervasive network tracing framework. In In Proceedings of NSDI’07, 2007.

[24] Anish Das Sarma, Alpa Jain, and Philip Bohannon. PROBER: Ad-Hoc Debugging of Extraction and Integration Pipelines. Technical report, Yahoo, April 2010.

[25] Mingwu Zhang, Xiangyu Zhang, Xiang Zhang, and Sunil Prabhakar. Tracing lineage beyond relational operators. In Proc. Conference on Very Large Data Bases (VLDB), September 2007.

[26] Yael Amsterdamer, Susan B. Davidson, Daniel Deutch, Tova Milo, and Julia Stoyanovich. Putting lipstick on a pig: En- abling database-style workflow provenance. In Proc. of VLDB, August 2011.

[27] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. Pig latin: A not-so-foreign language for data processing. In Proc. of ACM SIGMOD, Vancouver, Canada, June 2008.

[28] Robert Ikeda, Semih Salihoglu, and Jennifer Widom. Provenance-based refresh in data-oriented workflows. In Proceedings of the 20th ACM international conference on Information and knowledge management, CIKM ’11, pages 1659–1668, New York, NY, USA, 2011. ACM. Chapter 11

Deadlock

This article is about the computer science concept. For other uses, see Deadlock (disambiguation). In concurrent programming, a deadlock is a situation in which two or more competing actions are each waiting for

R1

P1 P2

R2

Both processes need resources to continue execution. P1 requires additional resource R1 and is in possession of resource R2, P2 requires additional resource R2 and is in possession of R1; neither process can continue. the other to finish, and thus neither ever does. In a transactional database, a deadlock happens when two processes each within its own transaction updates two rows of information but in the opposite order. For example, process A updates row 1 then row 2 in the exact timeframe that process B updates row 2 then row 1. Process A can't finish updating row 2 until process B is finished, but process B cannot finish updating row 1 until process A is finished. No matter how much time is allowed to pass, this situation will never resolve itself and because of this database management systems will typically kill the transaction of the process that has done the least amount of work. In an operating system, a deadlock is a situation which occurs when a process or enters a waiting state because a resource requested is being held by another waiting process, which in turn is waiting for another resource held by another waiting process. If a process is unable to change its state indefinitely because the resources requested by it are being used by another waiting process, then the system is said to be in a deadlock.[1] Deadlock is a common problem in multiprocessing systems, parallel computing and distributed systems, where soft- ware and hardware locks are used to handle shared resources and implement process synchronization.[2] In telecommunication systems, deadlocks occur mainly due to lost or corrupt signals instead of resource contention.[3]

57 58 CHAPTER 11. DEADLOCK

11.1 Examples

Any deadlock situation can be compared to the classic "chicken or egg" problem.[4] It can also be considered a paradoxical "Catch-22" situation.[5] A real world example would be an illogical statute passed by the Kansas legislature in the early 20th century, which stated:[1][6] A simple computer-based example is as follows. Suppose a computer has three CD drives and three processes. Each of the three processes holds one of the drives. If each process now requests another drive, the three processes will be in a deadlock. Each process will be waiting for the “CD drive released” event, which can be only caused by one of the other waiting processes. Thus, it results in a circular chain. Moving onto the source code level, a deadlock can occur even in the case of a single thread and one resource (protected by a mutex). Assume there is a function func1() which does some work on the resource, locking the mutex at the beginning and releasing it after it’s done. Next, somebody creates a different function func2() following that pattern on the same resource (lock, do work, release) but decides to include a call to func1() to delegate a part of the job. What will happen is the mutex will be locked once when entering func2() and then again at the call to func1(), resulting in a deadlock if the mutex is not reentrant (i.e. the plain “fast mutex” variety).

11.2 Necessary conditions

A deadlockers situation can arise if all of the following conditions hold simultaneously in a system:[1]

1. Mutual Exclusion: At least one resource must be held in a non-shareable mode.[1] Only one process can use the resource at any given instant of time. 2. Hold and Wait or Resource Holding: A process is currently holding at least one resource and requesting additional resources which are being held by other processes. 3. No Preemption: a resource can be released only voluntarily by the process holding it. 4. Circular Wait: A process must be waiting for a resource which is being held by another process, which in turn is waiting for the first process to release the resource. In general, there is a set of waiting processes, P = {P1,P2, ..., PN}, such that P1 is waiting for a resource held by P2,P2 is waiting for a resource held by P3 and [1][7] so on until PN is waiting for a resource held by P1.

These four conditions are known as the Coffman conditions from their first description in a 1971 article by Edward G. Coffman, Jr.[7] Unfulfillment of any of these conditions is enough to preclude a deadlock from occurring.

11.3 Avoiding database deadlocks

An effective way to avoid database deadlocks is to follow this approach from the Oracle Locking Survival Guide:

Application developers can eliminate all risk of enqueue deadlocks by ensuring that transactions requiring multiple resources always lock them in the same order.[8]

This single sentence needs some explanation:

• First, it highlights the fact that processes must be inside a transaction for deadlocks to happen. Note that some database systems can be configured to cascade deletes, which generate implicit transactions which then can cause deadlocks. Also, some DBMS vendors offer row-level locking, a type of record locking which greatly reduces the chance of deadlocks, as opposed to page-level locking, which has the potential of locking out much more processing. • Second, the reference to “multiple resources” means 'more than one row in one or more tables”. An example of locking in the same order might involve processing all INSERTS first, all UPDATES second, and all DELETES last; within the processing of each of these handling all parent-table changes before child-table changes; and processing table changes in the same order (such as alphabetically, or ordered by an ID or account number). 11.4. DEADLOCK HANDLING 59

• Third, eliminating all risk of deadlocks is difficult to achieve when the DBMS has automatic lock-escalation features that raise row-level locks into page locks which can escalate to table locks. Although the risk or chance of experiencing a deadlock will not go to zero as deadlocks tend to happen more on large, high-volume, complex systems, it can be greatly reduced, and - when required - programmers can enhance the software to retry transactions when the system detects a deadlock. • Fourth, deadlocks can result in data loss if developers do not write the software specifying the use of transactions on every interaction with a DBMS; such data loss is difficult to locate and can cause unexpected errors and problems.

Deadlocks offer a challenging problem to correct as they result in data loss, are difficult to isolate, cause unexpected problems, and are time-consuming to fix. Modifying every section of software code in a large database-oriented system in order to always lock resources in the same order when the order is inconsistent takes significant resources and testing to implement.

11.4 Deadlock handling

Most current operating systems cannot prevent a deadlock from occurring.[1] When a deadlock occurs, different operating systems respond to them in different non-standard manners. Most approaches work by preventing one of the four Coffman conditions from occurring, especially the fourth one.[9] Major approaches are as follows.

11.4.1 Ignoring deadlock

In this approach, it is assumed that a deadlock will never occur. This is also an application of the Ostrich algo- rithm.[9][10] This approach was initially used by MINIX and UNIX.[7] This is used when the time intervals between occurrences of deadlocks are large and the data loss incurred each time is tolerable.

11.4.2 Detection

Under deadlock detection, deadlocks are allowed to occur. Then the state of the system is examined to detect that a deadlock has occurred and subsequently it is corrected. An algorithm is employed that tracks resource allocation and process states, it rolls back and restarts one or more of the processes in order to remove the detected deadlock. Detecting a deadlock that has already occurred is easily possible since the resources that each process has locked and/or currently requested are known to the resource scheduler of the operating system.[10] Deadlock detection techniques include, but are not limited to, model checking. This approach constructs a finite state-model on which it performs a progress analysis and finds all possible terminal sets in the model. These then each represent a deadlock. After a deadlock is detected, it can be corrected by using one of the following methods:

1. Process Termination: One or more processes involved in the deadlock may be aborted. We can choose to abort all processes involved in the deadlock. This ensures that deadlock is resolved with certainty and speed. But the expense is high as partial computations will be lost. Or, we can choose to abort one process at a time until the deadlock is resolved. This approach has high overheads because after each abort an algorithm must determine whether the system is still in deadlock. Several factors must be considered while choosing a candidate for termination, such as priority and age of the process. 2. Resource Preemption: Resources allocated to various processes may be successively preempted and allocated to other processes until the deadlock is broken.

11.4.3 Prevention

Main article: Deadlock prevention algorithms

Deadlock prevention works by preventing one of the four Coffman conditions from occurring. 60 CHAPTER 11. DEADLOCK

• Removing the mutual exclusion condition means that no process will have exclusive access to a resource. This proves impossible for resources that cannot be spooled. But even with spooled resources, deadlock could still occur. Algorithms that avoid mutual exclusion are called non-blocking synchronization algorithms.

• The hold and wait or resource holding conditions may be removed by requiring processes to request all the resources they will need before starting up (or before embarking upon a particular set of operations). This advance knowledge is frequently difficult to satisfy and, in any case, is an inefficient use of resources. Another way is to require processes to request resources only when it has none. Thus, first they must release all their currently held resources before requesting all the resources they will need from scratch. This too is often impractical. It is so because resources may be allocated and remain unused for long periods. Also, a process requiring a popular resource may have to wait indefinitely, as such a resource may always be allocated to some process, resulting in resource starvation.[1] (These algorithms, such as serializing tokens, are known as the all-or-none algorithms.)

• The no preemption condition may also be difficult or impossible to avoid as a process has to be able to have a resource for a certain amount of time, or the processing outcome may be inconsistent or thrashing may occur. However, inability to enforce preemption may interfere with a priority algorithm. Preemption of a “locked out” resource generally implies a rollback, and is to be avoided, since it is very costly in overhead. Algorithms that allow preemption include lock-free and wait-free algorithms and optimistic concurrency control.This condition may be removed as follows : If a process holding some resources and requests for some another resource(s) which cannot be immediately allocated to it, then by releasing all the currently being held resources of that process.

• The final condition is the circular wait condition. Approaches that avoid circular waits include disabling interrupts during critical sections and using a hierarchy to determine a partial ordering of resources. If no obvious hierarchy exists, even the memory address of resources has been used to determine ordering and resources are requested in the increasing order of the enumeration.[1] Dijkstra’s solution can also be used.

11.4.4 Avoidance

Deadlock can be avoided if certain information about processes are available to the operating system before allocation of resources, such as which resources a process will consume in its lifetime. For every resource request, the system sees whether granting the request will mean that the system will enter an unsafe state, meaning a state that could result in deadlock. The system then only grants requests that will lead to safe states.[1] In order for the system to be able to determine whether the next state will be safe or unsafe, it must know in advance at any time:

• resources currently available

• resources currently allocated to each process

• resources that will be required and released by these processes in the future

It is possible for a process to be in an unsafe state but for this not to result in a deadlock. The notion of safe/unsafe states only refers to the ability of the system to enter a deadlock state or not. For example, if a process requests A which would result in an unsafe state, but releases B which would prevent circular wait, then the state is unsafe but the system is not in deadlock. One known algorithm that is used for deadlock avoidance is the Banker’s algorithm, which requires resource usage limit to be known in advance.[1] However, for many systems it is impossible to know in advance what every process will request. This means that deadlock avoidance is often impossible. Two other algorithms are Wait/Die and Wound/Wait, each of which uses a symmetry-breaking technique. In both these algorithms there exists an older process (O) and a younger process (Y). Process age can be determined by a timestamp at process creation time. Smaller timestamps are older processes, while larger timestamps represent younger processes. Another way to avoid deadlock is to avoid blocking, for example by using Non-blocking synchronization or Read- copy-update. 11.5. LIVELOCK 61

11.5 Livelock

A livelock is similar to a deadlock, except that the states of the processes involved in the livelock constantly change with regard to one another, none progressing. This term was defined formally at some time during the 1970s ‒ an early sighting in the published literature is in Babich’s 1979 article on program correctness.[11] Livelock is a special case of resource starvation; the general definition only states that a specific process is not progressing.[12] A real-world example of livelock occurs when two people meet in a narrow corridor, and each tries to be polite by moving aside to let the other pass, but they end up swaying from side to side without making any progress because they both repeatedly move the same way at the same time. Livelock is a risk with some algorithms that detect and recover from deadlock. If more than one process takes action, the deadlock detection algorithm can be repeatedly triggered. This can be avoided by ensuring that only one process (chosen arbitrarily or by priority) takes action.[13]

11.6 Distributed deadlock

Distributed deadlocks can occur in distributed systems when distributed transactions or concurrency control is being used. Distributed deadlocks can be detected either by constructing a global wait-for graph from local wait-for graphs at a deadlock detector or by a distributed algorithm like edge chasing. Phantom deadlocks are deadlocks that are falsely detected in a distributed system due to system internal delays but don't actually exist. For example, if a process releases a resource R1 and issues a request for R2, and the first message is lost or delayed, a coordinator (detector of deadlocks) could falsely conclude a deadlock (if the request for R2 while having R1 would cause a deadlock).

11.7 See also

11.8 References

[1] Silberschatz, Abraham (2006). Operating System Principles (7 ed.). Wiley-India. p. 237. ISBN 9788126509621. Re- trieved 29 January 2012.

[2] Padua, David (2011). Encyclopedia of Parallel Computing. Springer. p. 524. ISBN 9780387097657. Retrieved 28 January 2012.

[3] Schneider, G. Michael (2009). Invitation to Computer Science. Cengage Learning. p. 271. ISBN 0324788592. Retrieved 28 January 2012.

[4] Rolling, Andrew (2009). Andrew Rollings and Ernest Adams on game design. New Riders. p. 421. ISBN 9781592730018. Retrieved 28 January 2012.

[5] Oaks, Scott (2004). Java Threads. O'Reilly. p. 64. ISBN 9780596007829. Retrieved 28 January 2012.

[6] A Treasury of Railroad Folklore, B.A. Botkin & A.F. Harlow, p. 381

[7] Shibu, K. (2009). Intro To Embedded Systems (1st ed.). Tata McGraw-Hill Education. p. 446. ISBN 9780070145894. Retrieved 28 January 2012.

[8] “Oracle Locking Survival Guide”. Archived from the original on 15 January 2008.

[9] Stuart, Brian L. (2008). Principles of operating systems (1st ed.). Cengage Learning. p. 446. Retrieved 28 January 2012.

[10] Tanenbaum, Andrew S. (1995). Distributed Operating Systems (1st ed.). Pearson Education. p. 117. Retrieved 28 January 2012.

[11] Babich, A.F. (1979). “Proving Total Correctness of Parallel Programs”. IEEE Transactions on Software Engineering SE–5 (6): 558–574. doi:10.1109/tse.1979.230192.

[12] Anderson, James H.; Yong-jik Kim (2001). “Shared-memory mutual exclusion: Major research trends since 1986”.

[13] Zöbel, Dieter (October 1983). “The Deadlock problem: a classifying bibliography”. ACM SIGOPS Operating Systems Review 17 (4): 6–15. doi:10.1145/850752.850753. ISSN 0163-5980. 62 CHAPTER 11. DEADLOCK

11.9 Further reading

• Kaveh, Nima; Emmerich, Wolfgang. “Deadlock Detection in Distributed Object Systems” (PDF). London: University College London. • Bensalem, Saddek; Fernandez, Jean-Claude; Havelund, Klaus; Mounier, Laurent (2006). “Confirmation of deadlock potentials detected by runtime analysis”. Proceedings of the 2006 workshop on Parallel and distributed systems: Testing and debugging (ACM): 41–50. doi:10.1145/1147403.1147412. ISBN 1595934146.

• Coffman, Edward G., Jr.; Elphick, Michael J.; Shoshani, Arie (1971). “System Deadlocks” (PDF). ACM Computing Surveys 3 (2): 67–78. doi:10.1145/356586.356588. • Mogul, Jeffrey C.; Ramakrishnan, K. K. (1997). “Eliminating receive livelock in an interrupt-driven kernel”. ACM Transactions on Computer Systems 15 (3): 217–252. doi:10.1145/263326.263335. ISSN 0734-2071. • Havender, James W. (1968). “Avoiding deadlock in multitasking systems”. IBM Systems Journal 7 (2): 74. doi:10.1147/sj.72.0074. • Holliday, JoAnne L.; El Abbadi, Amr. “Distributed Deadlock Detection”. Encyclopedia of Distributed Com- puting (Kluwer Academic Publishers). • Knapp, Edgar (1987). “Deadlock detection in distributed databases”. ACM Computing Surveys 19 (4): 303– 328. doi:10.1145/45075.46163. ISSN 0360-0300. • Ling, Yibei; Chen, Shigang; Chiang, Jason (2006). “On Optimal Deadlock Detection Scheduling”. IEEE Transactions on Computers 55 (9): 1178–1187. doi:10.1109/tc.2006.151.

11.10 External links

• "Advanced Synchronization in Java Threads" by Scott Oaks and Henry Wong

• Deadlock Detection Agents

• DeadLock at the Portland Pattern Repository • Etymology of “Deadlock” Chapter 12

Distributed concurrency control

Distributed concurrency control is the concurrency control of a system distributed over a (Bernstein et al. 1987, Weikum and Vossen 2001). In database systems and (transaction management) distributed concurrency control refers pri- marily to the concurrency control of a . It also refers to the concurrency control in a multidatabase (and other multi-transactional object) environment (e.g., federated database, grid computing, and cloud computing environments. A major goal for distributed concurrency control is distributed serializability (or global serializabil- ity for multidatabase systems). Distributed concurrency control poses special challenges beyond centralized one, primarily due to communication and computer latency. It often requires special techniques, like distributed lock manager over fast computer networks with low latency, like switched fabric (e.g., InfiniBand). commitment ordering (or commit ordering) is a general serializability technique that achieves distributed serializability (and global serial- izability in particular) effectively on a large scale, without concurrency control information distribution (e.g., local precedence relations, locks, timestamps, or tickets), and thus without performance penalties that are typical to other serializability techniques (Raz 1992). The most common distributed concurrency control technique is strong strict two-phase locking (SS2PL, also named rigorousness), which is also a common centralized concurrency control technique. SS2PL provides both the serial- izability, strictness, and commitment ordering properties. Strictness, a special case of recoverability, is utilized for effective recovery from failure, and commitment ordering allows participating in a general solution for global se- rializability. For large-scale distribution and complex transactions, distributed locking’s typical heavy performance penalty (due to delays, latency) can be saved by using the atomic commitment protocol, which is needed in a dis- tributed database for (distributed) transactions’ atomicity (e.g., two-phase commit, or a simpler one in a reliable system), together with some local commitment ordering variant (e.g., local SS2PL) instead of distributed locking, to achieve in the entire system. All the commitment ordering theoretical results are applica- ble whenever atomic commitment is utilized over partitioned, distributed recoverable (transactional) data, including automatic distributed deadlock resolution. Such technique can be utilized also for a large-scale parallel database, where a single large database, residing on many nodes and using a distributed lock manager, is replaced with a (ho- mogeneous) multidatabase, comprising many relatively small databases (loosely defined; any process that supports transactions over partitioned data and participates in atomic commitment complies), fitting each into a single node, and using commitment ordering (e.g., SS2PL, strict CO) together with some appropriate atomic commitment proto- col (without using a distributed lock manager).

12.1 See also

• Global concurrency control

12.2 References

• Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman (1987): Concurrency Control and Recovery in Database Systems, Addison Wesley Publishing Company, 1987, ISBN 0-201-10715-5

63 64 CHAPTER 12. DISTRIBUTED CONCURRENCY CONTROL

• Gerhard Weikum, Gottfried Vossen (2001): Transactional Information Systems, Elsevier, ISBN 1-55860-508- 8 • Yoav Raz (1992): “The Principle of Commitment Ordering, or Guaranteeing Serializability in a Heterogeneous Environment of Multiple Autonomous Resource Managers Using Atomic Commitment.” Proceedings of the Eighteenth International Conference on Very Large Data Bases (VLDB), pp. 292-312, Vancouver, Canada, August 1992. (also DEC-TR 841, Digital Equipment Corporation, November 1990) Chapter 13

Distributed graph coloring

A proper vertex coloring of the Petersen graph with 3 colors, the minimum number possible.

In , graph coloring is a special case of graph labeling; it is an assignment of labels traditionally called “colors” to elements of a graph subject to certain constraints. In its simplest form, it is a way of coloring the vertices of a graph such that no two adjacent vertices share the same color; this is called a vertex coloring. Similarly, an edge coloring assigns a color to each edge so that no two adjacent edges share the same color, and a face coloring of a

65 66 CHAPTER 13. DISTRIBUTED GRAPH COLORING

assigns a color to each face or region so that no two faces that share a boundary have the same color. Vertex coloring is the starting point of the subject, and other coloring problems can be transformed into a vertex version. For example, an edge coloring of a graph is just a vertex coloring of its line graph, and a face coloring of a plane graph is just a vertex coloring of its dual. However, non-vertex coloring problems are often stated and studied as is. That is partly for perspective, and partly because some problems are best studied in non-vertex form, as for instance is edge coloring. The convention of using colors originates from coloring the countries of a map, where each face is literally colored. This was generalized to coloring the faces of a graph embedded in the plane. By planar duality it became coloring the vertices, and in this form it generalizes to all graphs. In mathematical and computer representations, it is typical to use the first few positive or nonnegative integers as the “colors”. In general, one can use any finite set as the “color set”. The nature of the coloring problem depends on the number of colors but not on what they are. Graph coloring enjoys many practical applications as well as theoretical challenges. Beside the classical types of problems, different limitations can also be set on the graph, or on the way a color is assigned, or even on the color itself. It has even reached popularity with the general public in the form of the popular number puzzle Sudoku. Graph coloring is still a very active field of research. Note: Many terms used in this article are defined in Glossary of graph theory.

13.1 History

See also: History of the four color theorem and History of graph theory

The first results about graph coloring deal almost exclusively with planar graphs in the form of the coloring of maps. While trying to color a map of the counties of England, Francis Guthrie postulated the four color conjecture, noting that four colors were sufficient to color the map so that no regions sharing a common border received the same color. Guthrie’s brother passed on the question to his mathematics teacher Augustus de Morgan at University College, who mentioned it in a letter to William Hamilton in 1852. Arthur Cayley raised the problem at a meeting of the London Mathematical Society in 1879. The same year, Alfred Kempe published a paper that claimed to establish the result, and for a decade the four color problem was considered solved. For his accomplishment Kempe was elected a Fellow of the Royal Society and later President of the London Mathematical Society.[1] In 1890, Heawood pointed out that Kempe’s argument was wrong. However, in that paper he proved the five color theorem, saying that every planar map can be colored with no more than five colors, using ideas of Kempe. In the following century, a vast amount of work and theories were developed to reduce the number of colors to four, until the four color theorem was finally proved in 1976 by Kenneth Appel and Wolfgang Haken. The proof went back to the ideas of Heawood and Kempe and largely disregarded the intervening developments.[2] The proof of the four color theorem is also noteworthy for being the first major computer-aided proof. In 1912, George David Birkhoff introduced the chromatic polynomial to study the coloring problems, which was generalised to the Tutte polynomial by Tutte, important structures in algebraic graph theory. Kempe had already drawn attention to the general, non-planar case in 1879,[3] and many results on generalisations of planar graph coloring to surfaces of higher order followed in the early 20th century. In 1960, Claude Berge formulated another conjecture about graph coloring, the strong conjecture, origi- nally motivated by an information-theoretic concept called the zero-error capacity of a graph introduced by Shannon. The conjecture remained unresolved for 40 years, until it was established as the celebrated strong perfect graph theorem by Chudnovsky, Robertson, Seymour, and Thomas in 2002. Graph coloring has been studied as an algorithmic problem since the early 1970s: the chromatic number problem is one of Karp’s 21 NP-complete problems from 1972, and at approximately the same time various exponential-time algorithms were developed based on backtracking and on the deletion-contraction recurrence of Zykov (1949). One of the major applications of graph coloring, register allocation in compilers, was introduced in 1981. 13.2. DEFINITION AND TERMINOLOGY 67

This graph can be 3-colored in 12 different ways.

13.2 Definition and terminology

13.2.1 Vertex coloring

When used without any qualification, a coloring of a graph is almost always a proper vertex coloring, namely a labelling of the graph’s vertices with colors such that no two vertices sharing the same edge have the same color. Since a vertex with a loop (i.e. a connection directly back to itself) could never be properly colored, it is understood that graphs in this context are loopless. The terminology of using colors for vertex labels goes back to map coloring. Labels like red and blue are only used when the number of colors is small, and normally it is understood that the labels are drawn from the integers {1, 2, 3, ...}. A coloring using at most k colors is called a (proper) k-coloring. The smallest number of colors needed to color a graph G is called its chromatic number, and is often denoted χ(G). Sometimes γ(G) is used, since χ(G) is also used to denote the Euler characteristic of a graph. A graph that can be assigned a (proper) k-coloring is k-colorable, and it is k-chromatic if its chromatic number is exactly k. A subset of vertices assigned to the same color is called a color class, every such class forms an independent set. Thus, a k-coloring is the same as a partition of the vertex set into k independent sets, and the terms k-partite and k-colorable have the same meaning.

13.2.2 Chromatic polynomial

Main article: Chromatic polynomial 68 CHAPTER 13. DISTRIBUTED GRAPH COLORING

The chromatic polynomial counts the number of ways a graph can be colored using no more than a given number of colors. For example, using three colors, the graph in the image to the right can be colored in 12 ways. With only two colors, it cannot be colored at all. With four colors, it can be colored in 24 + 4⋅12 = 72 ways: using all four colors, there are 4! = 24 valid colorings (every assignment of four colors to any 4-vertex graph is a proper coloring); and for every choice of three of the four colors, there are 12 valid 3-colorings. So, for the graph in the example, a table of the number of valid colorings would start like this: The chromatic polynomial is a function P(G, t) that counts the number of t-colorings of G. As the name indicates, for a given G the function is indeed a polynomial in t. For the example graph, P(G, t) = t(t − 1)2(t − 2), and indeed P(G, 4) = 72. The chromatic polynomial includes at least as much information about the colorability of G as does the chromatic number. Indeed, χ is the smallest positive integer that is not a root of the chromatic polynomial

χ(G) = min{k : P (G, k) > 0}.

13.2.3 Edge coloring

Main article: Edge coloring

An edge coloring of a graph is a proper coloring of the edges, meaning an assignment of colors to edges so that no vertex is incident to two edges of the same color. An edge coloring with k colors is called a k-edge-coloring and is equivalent to the problem of partitioning the edge set into k matchings. The smallest number of colors needed for an edge coloring of a graph G is the chromatic index, or edge chromatic number, χ′(G).A Tait coloring is a 3-edge coloring of a . The four color theorem is equivalent to the assertion that every planar cubic bridgeless graph admits a Tait coloring.

13.2.4 Total coloring

Main article: Total coloring

Total coloring is a type of coloring on the vertices and edges of a graph. When used without any qualification, a total coloring is always assumed to be proper in the sense that no adjacent vertices, no adjacent edges, and no edge and its endvertices are assigned the same color. The total chromatic number χ″(G) of a graph G is the least number of colors needed in any total coloring of G.

13.2.5 Unlabeled coloring

An unlabeled coloring of a graph is an orbit of a coloring under the action of the automorphism group of the graph. If we interpret a coloring of a graph on d vertices as a vector in Zd , the action of an automorphism is a permutation of the coefficients of the coloring. There are analogues of the chromatic polynomials which count the number of unlabeled colorings of a graph from a given finite color set.

13.3 Properties

13.3.1 Bounds on the chromatic number

Assigning distinct colors to distinct vertices always yields a proper coloring, so

1 ≤ χ(G) ≤ n.

The only graphs that can be 1-colored are edgeless graphs.A Kn of n vertices requires χ(Kn) = n colors. In an optimal coloring there must be at least one of the graph’s m edges between every pair of color classes, so 13.3. PROPERTIES 69

χ(G)(χ(G) − 1) ≤ 2m.

If G contains a clique of size k, then at least k colors are needed to color that clique; in other words, the chromatic number is at least the clique number:

χ(G) ≥ ω(G).

For interval graphs this bound is tight. The 2-colorable graphs are exactly the bipartite graphs, including trees and forests. By the four color theorem, every planar graph can be 4-colored. A greedy coloring shows that every graph can be colored with one more color than the maximum vertex degree,

χ(G) ≤ ∆(G) + 1.

Complete graphs have χ(G) = n and ∆(G) = n − 1 , and odd cycles have χ(G) = 3 and ∆(G) = 2 , so for these graphs this bound is best possible. In all other cases, the bound can be slightly improved; Brooks’ theorem[4] states that

Brooks’ theorem: χ(G) ≤ ∆(G) for a connected, simple graph G, unless G is a complete graph or an odd cycle.

13.3.2 Lower bounds on the chromatic number

Several lower bounds for the chromatic bounds have been discovered over the years:

Hoffman’s Bound: Let W be a real symmetric matrix such that Wi,j = 0 whenever (i, j) is not an edge in G . Define χW (G) = 1 − λmax(W )/λmin(W ) , where λmax(W ), λmin(W ) are the largest and smallest eigenvalues of W . Define χH (G) = maxW χW (G) , with W as above. Then:

χH (G) ≤ χ(G)

Vector Chromatic number: Let W be a positive semi-definite matrix such that Wi,j ≤ −1/(k −1) whenever (i, j) is an edge in G . Define χV (G) to be the least k for which such a matrix W exists. Then

χV (G) ≤ χ(G)

Lovász number: The Lovász number of a the complementary graph, is also a lower bound on the chromatic number:

ϑ(G¯) ≤ χ(G)

Fractional chromatic number: The Fractional chromatic number of a graph, is a lower bound on the chromatic number as well:

χf (G) ≤ χ(G)

These bounds are ordered as follows:

χH (G) ≤ χV (G) ≤ ϑ(G¯) ≤ χf (G) ≤ χ(G) 70 CHAPTER 13. DISTRIBUTED GRAPH COLORING

13.3.3 Graphs with high chromatic number

Graphs with large cliques have a high chromatic number, but the opposite is not true. The Grötzsch graph is an example of a 4-chromatic graph without a triangle, and the example can be generalised to the Mycielskians.

Mycielski’s Theorem (Alexander Zykov 1949, Jan Mycielski 1955): There exist triangle-free graphs with arbitrarily high chromatic number.

From Brooks’s theorem, graphs with high chromatic number must have high maximum degree. Another local prop- erty that leads to high chromatic number is the presence of a large clique. But colorability is not an entirely local phenomenon: A graph with high girth looks locally like a tree, because all cycles are long, but its chromatic number need not be 2:

Theorem (Erdős): There exist graphs of arbitrarily high girth and chromatic number.

13.3.4 Bounds on the chromatic index

An edge coloring of G is a vertex coloring of its line graph L(G) , and vice versa. Thus,

χ′(G) = χ(L(G)).

There is a strong relationship between edge colorability and the graph’s maximum degree ∆(G) . Since all edges incident to the same vertex need their own color, we have

χ′(G) ≥ ∆(G).

Moreover,

König’s theorem: χ′(G) = ∆(G) if G is bipartite.

In general, the relationship is even stronger than what Brooks’s theorem gives for vertex coloring:

Vizing’s Theorem: A graph of maximal degree ∆ has edge-chromatic number ∆ or ∆ + 1 .

13.3.5 Other properties

A graph has a k-coloring if and only if it has an acyclic orientation for which the longest path has length at most k; this is the Gallai–Hasse–Roy–Vitaver theorem (Nešetřil & Ossona de Mendez 2012). For planar graphs, vertex colorings are essentially dual to nowhere-zero flows. About infinite graphs, much less is known. The following are two of the few results about infinite graph coloring:

• If all finite subgraphs of an infinite graph G are k-colorable, then so is G, under the assumption of the axiom of choice. This is the de Bruijn–Erdős theorem of de Bruijn & Erdős (1951).

• If a graph admits a full n-coloring for every n ≥ n0, it admits an infinite full coloring (Fawcett 1978).

13.3.6 Open problems

The chromatic number of the plane, where two points are adjacent if they have unit distance, is unknown, although it is one of 4, 5, 6, or 7. Other open problems concerning the chromatic number of graphs include the Hadwiger conjecture stating that every graph with chromatic number k has a complete graph on k vertices as a minor, the Erdős–Faber–Lovász conjecture bounding the chromatic number of unions of complete graphs that have at exactly 13.4. ALGORITHMS 71

one vertex in common to each pair, and the Albertson conjecture that among k-chromatic graphs the complete graphs are the ones with smallest crossing number. When Birkhoff and Lewis introduced the chromatic polynomial in their attack on the four-color theorem, they con- jectured that for planar graphs G, the polynomial P (G, t) has no zeros in the region [4, ∞) . Although it is known that such a chromatic polynomial has no zeros in the region [5, ∞) and that P (G, 4) ≠ 0 , their conjecture is still unresolved. It also remains an unsolved problem to characterize graphs which have the same chromatic polynomial and to determine which polynomials are chromatic.

13.4 Algorithms

13.4.1 Polynomial time

Determining if a graph can be colored with 2 colors is equivalent to determining whether or not the graph is bipartite, and thus computable in linear time using breadth-first search. More generally, the chromatic number and a corre- sponding coloring of perfect graphs can be computed in polynomial time using semidefinite programming. Closed formulas for chromatic polynomial are known for many classes of graphs, such as forests, chordal graphs, cycles, wheels, and ladders, so these can be evaluated in polynomial time. If the graph is planar and has low branchwidth (or is nonplanar but with a known branch decomposition), then it can be solved in polynomial time using dynamic programming. In general, the time required is polynomial in the graph size, but exponential in the branchwidth.

13.4.2 Exact algorithms

Brute-force search for a k-coloring considers each of the kn assignments of k colors to n vertices and checks for each if it is legal. To compute the chromatic number and the chromatic polynomial, this procedure is used for every k = 1, . . . , n − 1 , impractical for all but the smallest input graphs. Using dynamic programming and a bound on the number of maximal independent sets, k-colorability can be decided in time and space O(2.445n) .[6] Using the principle of inclusion–exclusion and Yates’s algorithm for the fast zeta transform, k-colorability can be decided in time O(2nn) [5] for any k. Faster algorithms are known for 3- and 4- colorability, which can be decided in time O(1.3289n) [7] and O(1.7272n) ,[8] respectively.

13.4.3 Contraction

The contraction G/uv of graph G is the graph obtained by identifying the vertices u and v, removing any edges between them, and replacing them with a single vertex w where any edges that were incident on u or v are redirected to w. This operation plays a major role in the analysis of graph coloring. The chromatic number satisfies the recurrence relation:

χ(G) = min{χ(G + uv), χ(G/uv)}

due to Zykov (1949), where u and v are nonadjacent vertices, G + uv is the graph with the edge uv added. Several algorithms are based on evaluating this recurrence, the resulting computation tree is sometimes called a Zykov tree. The running time is based on the heuristic for choosing the vertices u and v. The chromatic polynomial satisfies following recurrence relation

P (G − uv, k) = P (G/uv, k) + P (G, k)

where u and v are adjacent vertices and G − uv is the graph with the edge uv removed. P (G − uv, k) represents the number of possible proper colorings of the graph, when the vertices may have same or different colors. The number of proper colorings therefore come from the sum of two graphs. If the vertices u and v have different colors, then we can as well consider a graph, where u and v are adjacent. If u and v have the same colors, we may as well consider 72 CHAPTER 13. DISTRIBUTED GRAPH COLORING

a graph, where u and v are contracted. Tutte’s curiosity about which other graph properties satisfied this recurrence led him to discover a bivariate generalization of the chromatic polynomial, the Tutte polynomial. The expressions give rise to a recursive procedure, called the deletion–contraction algorithm, which forms the basis of many algorithms for graph coloring. The running time satisfies the same recurrence relation√ as the Fibonacci numbers, so in the worst case, the algorithm runs in time within a polynomial factor of ((1 + 5)/2)n+m = O(1.6180n+m) for n vertices and m edges.[9] The analysis can be improved to within a polynomial factor of the number t(G) of spanning trees of the input graph.[10] In practice, branch and bound strategies and graph isomorphism rejection are employed to avoid some recursive calls, the running time depends on the heuristic used to pick the vertex pair.

13.4.4 Greedy coloring

Main article: Greedy coloring The greedy algorithm considers the vertices in a specific order v1 ,…, vn and assigns to vi the smallest available color not used by vi ’s neighbours among v1 ,…, vi−1 , adding a fresh color if needed. The quality of the resulting coloring depends on the chosen ordering. There exists an ordering that leads to a greedy coloring with the optimal number of χ(G) colors. On the other hand, greedy colorings can be arbitrarily bad; for example, the crown graph on n vertices can be 2-colored, but has an ordering that leads to a greedy coloring with n/2 colors. For chordal graphs, and for special cases of chordal graphs such as interval graphs and indifference graphs, the greedy coloring algorithm can be used to find optimal colorings in polynomial time, by choosing the vertex ordering to be the reverse of a perfect elimination ordering for the graph. The perfectly orderable graphs generalize this property, but it is NP-hard to find a perfect ordering of these graphs.

If the vertices are ordered according to their degrees, the resulting greedy coloring uses at most maximin {d(xi)+1, i} colors, at most one more than the graph’s maximum degree. This heuristic is sometimes called the Welsh–Powell algorithm.[11] Another heuristic due to Brélaz establishes the ordering dynamically while the algorithm proceeds, choosing next the vertex adjacent to the largest number of different colors.[12] Many other graph coloring heuristics are similarly based on greedy coloring for a specific static or dynamic strategy of ordering the vertices, these algorithms are sometimes called sequential coloring algorithms.

13.4.5 Parallel and distributed algorithms

In the field of distributed algorithms, graph coloring is closely related to the problem of symmetry breaking. The current state-of-the-art randomized algorithms are faster for sufficiently large maximum degree Δ than deterministic algorithms. The fastest randomized algorithms employ the multi-trials technique by Schneider et al.[13] In a symmetric graph, a deterministic distributed algorithm cannot find a proper vertex coloring. Some auxiliary information is needed in order to break symmetry. A standard assumption is that initially each node has a unique identifier, for example, from the set {1, 2, ..., n}. Put otherwise, we assume that we are given an n-coloring. The challenge is to reduce the number of colors from n to, e.g., Δ + 1. The more colors are employed, e.g. O(Δ) instead of Δ + 1, the fewer communication rounds are required.[13] A straightforward distributed version of the greedy algorithm for (Δ + 1)-coloring requires Θ(n) communication rounds in the worst case − information may need to be propagated from one side of the network to another side. The simplest interesting case is an n-cycle. Richard Cole and Uzi Vishkin[14] show that there is a distributed algorithm that reduces the number of colors from n to O(log n) in one synchronous communication step. By iterating the same procedure, it is possible to obtain a 3-coloring of an n-cycle in O(log* n) communication steps (assuming that we have unique node identifiers). The function log*, iterated logarithm, is an extremely slowly growing function, “almost constant”. Hence the result by Cole and Vishkin raised the question of whether there is a constant-time distribute algorithm for 3-coloring an n-cycle. Linial (1992) showed that this is not possible: any deterministic distributed algorithm requires Ω(log* n) communication steps to reduce an n-coloring to a 3-coloring in an n-cycle. The technique by Cole and Vishkin can be applied in arbitrary bounded-degree graphs as well; the running time is poly(Δ) + O(log* n).[15] The technique was extended to unit disk graphs by Schneider et al.[16] The fastest deterministic algorithms for (Δ + 1)-coloring for small Δ are due to Leonid Barenboim, Michael Elkin and Fabian Kuhn.[17] The algorithm by Barenboim et al. runs in time O(Δ) + log*(n)/2, which is optimal in terms of n since the constant factor 1/2 cannot be improved due to Linial’s lower bound. Panconesi et al.[18] use network decompositions to compute a 13.5. APPLICATIONS 73

(√ ) Δ+1 coloring in time 2O log n . The problem of edge coloring has also been studied in the distributed model. Panconesi & Rizzi (2001) achieve a (2Δ − 1)-coloring in O(Δ + log* n) time in this model. The lower bound for distributed vertex coloring due to Linial (1992) applies to the distributed edge coloring problem as well.

13.4.6 Decentralized algorithms

Decentralized algorithms are ones where no message passing is allowed (in contrast to distributed algorithms where local message passing takes places), and efficient decentralized algorithms exist that will color a graph if a proper coloring exists. These assume that a vertex is able to sense whether any of its neighbors are using the same color as the vertex i.e., whether a local conflict exists. This is a mild assumption in many applications e.g. in wireless channel allocation it is usually reasonable to assume that a station will be able to detect whether other interfering transmitters are using the same channel (e.g. by measuring the SINR). This sensing information is sufficient to allow algorithms based on learning automata to find a proper graph coloring with probability one, e.g. see Leith (2006) and Duffy (2008).

13.4.7 Computational complexity

Graph coloring is computationally hard. It is NP-complete to decide if a given graph admits a k-coloring for a given k except for the cases k = 1 and k = 2. In particular, it is NP-hard to compute the chromatic number.[19] The 3-coloring problem remains NP-complete even on planar graphs of degree 4.[20] However, k-coloring of a planar graph is in P, for every k>3, since every planar graph has 4-coloring (and thus, also k-coloring, for every k>=4), according to Four color theorem. The best known approximation algorithm computes a coloring of size at most within a factor O(n(log n)−3(log log n)2) of the chromatic number.[21] For all ε > 0, approximating the chromatic number within n1−ε is NP-hard.[22] It is also NP-hard to color a 3-colorable graph with 4 colors[23] and a k-colorable graph with k(log k ) / 25 colors for sufficiently large constant k.[24] Computing the coefficients of the chromatic polynomial is #P-hard. In fact, even computing the value of χ(G, k) is #P-hard at any rational point k except for k = 1 and k = 2.[25] There is no FPRAS for evaluating the chromatic polynomial at any rational point k ≥ 1.5 except for k = 2 unless NP = RP.[26] For edge coloring, the proof of Vizing’s result gives an algorithm that uses at most Δ+1 colors. However, deciding between the two candidate values for the edge chromatic number is NP-complete.[27] In terms of approximation algorithms, Vizing’s algorithm shows that the edge chromatic number can be approximated to within 4/3, and the hardness result shows that no (4/3 − ε )-algorithm exists for any ε > 0 unless P = NP. These are among the oldest results in the literature of approximation algorithms, even though neither paper makes explicit use of that notion.[28]

13.5 Applications

13.5.1 Scheduling

Vertex coloring models to a number of scheduling problems.[29] In the cleanest form, a given set of jobs need to be assigned to time slots, each job requires one such slot. Jobs can be scheduled in any order, but pairs of jobs may be in conflict in the sense that they may not be assigned to the same time slot, for example because they both rely on a shared resource. The corresponding graph contains a vertex for every job and an edge for every conflicting pair of jobs. The chromatic number of the graph is exactly the minimum makespan, the optimal time to finish all jobs without conflicts. Details of the scheduling problem define the structure of the graph. For example, when assigning aircraft to flights, the resulting conflict graph is an interval graph, so the coloring problem can be solved efficiently. In bandwidth allocation to radio stations, the resulting conflict graph is a unit disk graph, so the coloring problem is 3-approximable. 74 CHAPTER 13. DISTRIBUTED GRAPH COLORING

13.5.2 Register allocation

Main article: Register allocation

A compiler is a computer program that translates one computer language into another. To improve the execution time of the resulting code, one of the techniques of compiler optimization is register allocation, where the most frequently used values of the compiled program are kept in the fast processor registers. Ideally, values are assigned to registers so that they can all reside in the registers when they are used. The textbook approach to this problem is to model it as a graph coloring problem.[30] The compiler constructs an interference graph, where vertices are variables and an edge connects two vertices if they are needed at the same time. If the graph can be colored with k colors then any set of variables needed at the same time can be stored in at most k registers.

13.5.3 Other applications

The problem of coloring a graph has found a number of applications, including pattern matching. The recreational puzzle Sudoku can be seen as completing a 9-coloring on given specific graph with 81 vertices.

13.6 Other colorings

13.6.1 Ramsey theory

Main article: Ramsey theory

An important class of improper coloring problems is studied in Ramsey theory, where the graph’s edges are assigned to colors, and there is no restriction on the colors of incident edges. A simple example is the friendship theorem, which states that in any coloring of the edges of K6 the complete graph of six vertices there will be a monochromatic triangle; often illustrated by saying that any group of six people either has three mutual strangers or three mutual acquaintances. Ramsey theory is concerned with generalisations of this idea to seek regularity amid disorder, finding general conditions for the existence of monochromatic subgraphs with given structure.

13.6.2 Other colorings

Coloring can also be considered for signed graphs and gain graphs.

13.7 See also

• Edge coloring

• Circular coloring

• Critical graph

• Graph homomorphism

• Hajós construction

• Mathematics of Sudoku

• Multipartite graph

• Uniquely colorable graph

• Graph coloring game 13.8. NOTES 75

13.8 Notes

[1] M. Kubale, History of graph coloring, in Kubale (2004)

[2] van Lint & Wilson (2001, Chap. 33)

[3] Jensen & Toft (1995), p. 2

[4] Brooks (1941)

[5] Björklund, Husfeldt & Koivisto (2009)

[6] Lawler (1976)

[7] Beigel & Eppstein (2005)

[8] Fomin, Gaspers & Saurabh (2007)

[9] Wilf (1986)

[10] Sekine, Imai & Tani (1995)

[11] Welsh & Powell (1967)

[12] Brélaz (1979)

[13] Schneider (2010)

[14] Cole & Vishkin (1986), see also Cormen, Leiserson & Rivest (1990, Section 30.5)

[15] Goldberg, Plotkin & Shannon (1988)

[16] Schneider (2008)

[17] Barenboim & Elkin (2009); Kuhn (2009)

[18] Panconesi (1995)

[19] Garey, Johnson & Stockmeyer (1974); Garey & Johnson (1979).

[20] Dailey (1980)

[21] Halldórsson (1993)

[22] Zuckerman (2007)

[23] Guruswami & Khanna (2000)

[24] Khot (2001)

[25] Jaeger, Vertigan & Welsh (1990)

[26] Goldberg & Jerrum (2008)

[27] Holyer (1981)

[28] Crescenzi & Kann (1998)

[29] Marx (2004)

[30] Chaitin (1982) 76 CHAPTER 13. DISTRIBUTED GRAPH COLORING

13.9 References

• Barenboim, L.; Elkin, M. (2009), “Distributed (Δ + 1)-coloring in linear (in Δ) time”, Proceedings of the 41st Symposium on Theory of Computing, pp. 111–120, doi:10.1145/1536414.1536432, ISBN 978-1-60558-506-2

• Panconesi, A.; Srinivasan, A. (1996), “On the complexity of distributed network decomposition”, Journal of Algorithms 20

• Schneider, J. (2010), “A new technique for distributed symmetry breaking” (PDF), Proceedings of the Symposium on Principles of Distributed Computing

• Schneider, J. (2008), “A log- distributed maximal independent set algorithm for growth-bounded graphs” (PDF), Proceedings of the Symposium on Principles of Distributed Computing

• Beigel, R.; Eppstein, D. (2005), “3-coloring in time O(1.3289n)", Journal of Algorithms 54 (2)): 168–204, doi:10.1016/j.jalgor.2004.06.008

• Björklund, A.; Husfeldt, T.; Koivisto, M. (2009), “Set partitioning via inclusion–exclusion”, SIAM Journal on Computing 39 (2): 546–563, doi:10.1137/070683933

• Brélaz, D. (1979), “New methods to color the vertices of a graph”, Communications of the ACM 22 (4): 251– 256, doi:10.1145/359094.359101

• Brooks, R. L.; Tutte, W. T. (1941), “On colouring the nodes of a network”, Proceedings of the Cambridge Philosophical Society 37 (2): 194–197, doi:10.1017/S030500410002168X

• de Bruijn, N. G.; Erdős, P. (1951), “A colour problem for infinite graphs and a problem in the theory of relations” (PDF), Nederl. Akad. Wetensch. Proc. Ser. A 54: 371–373 (= Indag. Math. 13)

• Byskov, J.M. (2004), “Enumerating maximal independent sets with applications to graph colouring”, Operations Research Letters 32 (6): 547–556, doi:10.1016/j.orl.2004.03.002

• Chaitin, G. J. (1982), “Register allocation & spilling via graph colouring”, Proc. 1982 SIGPLAN Symposium on Compiler Construction, pp. 98–105, doi:10.1145/800230.806984, ISBN 0-89791-074-5

• Cole, R.; Vishkin, U. (1986), “Deterministic coin tossing with applications to optimal parallel list ranking”, Information and Control 70 (1): 32–53, doi:10.1016/S0019-9958(86)80023-7

• Cormen, T. H.; Leiserson, C. E.; Rivest, R. L. (1990), Introduction to Algorithms (1st ed.), The MIT Press

• Dailey, D. P. (1980), “Uniqueness of colorability and colorability of planar 4-regular graphs are NP-complete”, Discrete Mathematics 30 (3): 289–293, doi:10.1016/0012-365X(80)90236-8

• Duffy, K.; O'Connell, N.; Sapozhnikov, A. (2008), “Complexity analysis of a decentralised graph colouring algorithm” (PDF), Information Processing Letters 107 (2): 60–63, doi:10.1016/j.ipl.2008.01.002

• Fawcett, B. W. (1978), “On infinite full colourings of graphs”, Can. J. Math. XXX: 455–457, doi:10.4153/cjm- 1978-039-8

• Fomin, F.V.; Gaspers, S.; Saurabh, S. (2007), “Improved Exact Algorithms for Counting 3- and 4-Colorings”, Proc. 13th Annual International Conference, COCOON 2007, Lecture Notes in Computer Science 4598, Springer, pp. 65–74, doi:10.1007/978-3-540-73545-8_9, ISBN 978-3-540-73544-1

• Garey, M. R.; Johnson, D. S. (1979), Computers and Intractability: A Guide to the Theory of NP-Completeness, W.H. Freeman, ISBN 0-7167-1045-5

• Garey, M. R.; Johnson, D. S.; Stockmeyer, L. (1974), “Some simplified NP-complete problems”, Proceedings of the Sixth Annual ACM Symposium on Theory of Computing, pp. 47–63, doi:10.1145/800119.803884

• Goldberg, L. A.; Jerrum, M. (July 2008), “Inapproximability of the Tutte polynomial”, Information and Com- putation 206 (7): 908–929, doi:10.1016/j.ic.2008.04.003 13.9. REFERENCES 77

• Goldberg, A. V.; Plotkin, S. A.; Shannon, G. E. (1988), “Parallel symmetry-breaking in sparse graphs”, SIAM Journal on Discrete Mathematics 1 (4): 434–446, doi:10.1137/0401044

• Guruswami, V.; Khanna, S. (2000), “On the hardness of 4-coloring a 3-colorable graph”, Proceedings of the 15th Annual IEEE Conference on Computational Complexity, pp. 188–197, doi:10.1109/CCC.2000.856749, ISBN 0-7695-0674-7

• Halldórsson, M. M. (1993), “A still better performance guarantee for approximate graph coloring”, Information Processing Letters 45: 19–23, doi:10.1016/0020-0190(93)90246-6

• Holyer, I. (1981), “The NP-completeness of edge-coloring”, SIAM Journal on Computing 10 (4): 718–720, doi:10.1137/0210055

• Crescenzi, P.; Kann, V. (December 1998), “How to find the best approximation results — a follow-up to Garey and Johnson”, ACM SIGACT News 29 (4): 90, doi:10.1145/306198.306210

• Jaeger, F.; Vertigan, D. L.; Welsh, D. J. A. (1990), “On the computational complexity of the Jones and Tutte polynomials”, Mathematical Proceedings of the Cambridge Philosophical Society 108: 35–53, doi:10.1017/S0305004100068936

• Jensen, T. R.; Toft, B. (1995), Graph Coloring Problems, Wiley-Interscience, New York, ISBN 0-471-02865-7

• Khot, S. (2001), “Improved inapproximability results for MaxClique, chromatic number and approximate graph coloring”, Proc. 42nd Annual Symposium on Foundations of Computer Science, pp. 600–609, doi:10.1109/SFCS.2001.959936, ISBN 0-7695-1116-3

• Kubale, M. (2004), Graph Colorings, American Mathematical Society, ISBN 0-8218-3458-4

• Kuhn, F. (2009), “Weak graph colorings: distributed algorithms and applications”, Proceedings of the 21st Symposium on Parallelism in Algorithms and Architectures, pp. 138–144, doi:10.1145/1583991.1584032, ISBN 978-1-60558-606-9

• Lawler, E.L. (1976), “A note on the complexity of the chromatic number problem”, Information Processing Letters 5 (3): 66–67, doi:10.1016/0020-0190(76)90065-X

• Leith, D.J.; Clifford, P. (2006), “A Self-Managed Distributed Channel Selection Algorithm for WLAN”, Proc. RAWNET 2006, Boston, MA (PDF)

• Linial, N. (1992), “Locality in distributed graph algorithms”, SIAM Journal on Computing 21 (1): 193–201, doi:10.1137/0221015

• van Lint, J. H.; Wilson, R. M. (2001), A Course in Combinatorics (2nd ed.), Cambridge University Press, ISBN 0-521-80340-3

• Marx, Dániel (2004), “Graph colouring problems and their applications in scheduling”, Periodica Polytechnica, Electrical Engineering 48 (1–2), pp. 11–16, CiteSeerX: 10 .1 .1 .95 .4268

• Mycielski, J. (1955), “Sur le coloriage des graphes” (PDF), Colloq. Math. 3: 161–162.

• Nešetřil, Jaroslav; Ossona de Mendez, Patrice (2012), “Theorem 3.13”, Sparsity: Graphs, Structures, and Algorithms, Algorithms and Combinatorics 28, Heidelberg: Springer, p. 42, doi:10.1007/978-3-642-27875-4, ISBN 978-3-642-27874-7, MR 2920058.

• Panconesi, Alessandro; Rizzi, Romeo (2001), “Some simple distributed algorithms for sparse networks”, Dis- tributed Computing (Berlin, New York: Springer-Verlag) 14 (2): 97–100, doi:10.1007/PL00008932, ISSN 0178-2770

• Sekine, K.; Imai, H.; Tani, S. (1995), “Computing the Tutte polynomial of a graph of moderate size”, Proc. 6th International Symposium on Algorithms and Computation (ISAAC 1995), Lecture Notes in Computer Science 1004, Springer, pp. 224–233, doi:10.1007/BFb0015427, ISBN 3-540-60573-8

• Welsh, D. J. A.; Powell, M. B. (1967), “An upper bound for the chromatic number of a graph and its application to timetabling problems”, The Computer Journal 10 (1): 85–86, doi:10.1093/comjnl/10.1.85

• West, D. B. (1996), Introduction to Graph Theory, Prentice-Hall, ISBN 0-13-227828-6 78 CHAPTER 13. DISTRIBUTED GRAPH COLORING

• Wilf, H. S. (1986), Algorithms and Complexity, Prentice–Hall

• Zuckerman, D. (2007), “Linear degree extractors and the inapproximability of Max Clique and Chromatic Number”, Theory of Computing 3: 103–128, doi:10.4086/toc.2007.v003a006

• Zykov, A. A. (1949), "О некоторых свойствах линейных комплексов (On some properties of linear com- plexes)", Math. Sbornik. (in Russian), 24(66) (2): 163–188

• Jensen, Tommy R.; Toft, Bjarne (1995), Graph Coloring Problems, John Wiley & Sons, ISBN 9780471028659 • Normann, Per (2014), Parallel Graph Coloring, DIVA, ISSN 1401-5757

13.10 External links

• Graph Coloring Page by Joseph Culberson (graph coloring programs)

• CoLoRaTiOn by Jim Andrews and Mike Fellows is a graph coloring puzzle • Links to Graph Coloring source codes

• Code for efficiently computing Tutte, Chromatic and Flow Polynomials by Gary Haggard, David J. Pearce and Gordon Royle 13.10. EXTERNAL LINKS 79

All nonisomorphic graphs on 3 vertices and their chromatic polynomials. The empty graph E3 (red) admits a 1-coloring, the others admit no such colorings. The green graph admits 12 colorings with 3 colors. 80 CHAPTER 13. DISTRIBUTED GRAPH COLORING

4 8 7 8 3 7 5 6 2 6 3 4 1 5 1 2

Two greedy colorings of the same graph using different vertex orders. The right example generalises to 2-colorable graphs with n vertices, where the greedy algorithm expends n/2 colors. Chapter 14

Embarrassingly parallel

In parallel computing, an embarrassingly parallel workload, or embarrassingly parallel problem, is one for which little or no effort is required to separate the problem into a number of parallel tasks. This is often the case where there exists no dependency (or communication) between those parallel tasks.[1] Embarrassingly parallel problems (also called “perfectly parallel” or “pleasingly parallel”) tend to require little or no communication of results between tasks, and are thus different from distributed computing problems that require communication between tasks, especially communication of intermediate results. They are easy to perform on server farms which do not have any of the special infrastructure used in a true supercomputer cluster. They are thus well suited to large, Internet-based distributed platforms such as BOINC, and do not suffer from parallel slowdown. The diametric opposite of embarrassingly parallel problems are inherently serial problems, which cannot be parallelized at all. A common example of an embarrassingly parallel problem lies within graphics processing units (GPUs) for the task of 3D projection, where each pixel on the screen may be rendered independently.

14.1 Etymology of the term

The genesis of the phrase “embarrassingly parallel” is not known; it is a comment on the ease of parallelizing such applications, and that it would be embarrassing for the programmer or compiler to not take advantage of such an obvious opportunity to improve performance. “Because so many important problems remain unsolved mainly due to their intrinsic computational complexity, it would be embarrassing not to develop parallel implementations of polyno- mial homotopy continuation methods.”[2] Contrastingly, the term may refer to parallelizing which is, “embarrassingly easy”.[3] It is first found in the literature in a 1986 book on multiprocessors by MATLAB's co-founder Cleve Moler.[4] Moler claims to have invented this term.[5] An alternative term, “pleasingly parallel,” has gained some use, perhaps to avoid the negative connotations of em- barrassment in favor of a positive reflection on the parallelizability of the problems. “Of course, there is nothing embarrassing about these programs at all.”[6]

14.2 Examples

Some examples of embarrassingly parallel problems include:

• Distributed relational database queries using distributed set processing

• Serving static files on a webserver to multiple users at once.

• The Mandelbrot set, Perlin noise and similar images, where each point can be calculated independently.

• Rendering of computer graphics. In computer animation, each frame may be rendered independently (see parallel rendering).

81 82 CHAPTER 14. EMBARRASSINGLY PARALLEL

• Brute-force searches in cryptography. Notable real-world examples include distributed.net and proof-of-work systems used in cryptocurrencies.

• BLAST searches in bioinformatics for multiple queries (but not for individual large queries) [7]

• Large scale face recognition that involves comparing thousands of arbitrary acquired faces (e.g. a security or surveillance video via closed-circuit television) with similarly large number of previously stored faces (e.g., a "rogues gallery" or similar watch list).[8]

• Computer simulations comparing many independent scenarios, such as climate models.

• Genetic algorithms and other evolutionary computation metaheuristics.

• Ensemble calculations of numerical weather prediction.

• Event simulation and reconstruction in particle physics.

• The Marching squares algorithm

• Sieving step of the quadratic sieve and the number field sieve.

• Tree growth step of the random forest machine learning technique.

• Discrete Fourier Transform where each harmonic is independently calculated.

14.3 Implementations

• In R (programming language) – The snow (Simple Network of Workstations) package implements a simple mechanism for using a collection of workstations or a Beowulf cluster for embarrassingly parallel computations.

14.4 See also

• Amdahl’s law defines value P which would be almost or exactly equal to 1 for an embarrassingly parallel problem.

• Map (parallel pattern)

14.5 References

[1] Section 1.4.4 of: Foster, Ian (1995). “Designing and Building Parallel Programs”. Addison–Wesley (ISBN 9780201575941). Archived from the original on 2011-02-21.

[2] Leykin, Anton; Verschelde, Jan; Zhuang, Yan (2006). “Parallel Homotopy Algorithms to Solve Polynomial Systems”. Proceedings of ICMS 2006.

[3] Matloff, Norman (2011). The Art of R Programming: A Tour of Statistical Software Design, p.347. No Starch. ISBN 9781593274108.

[4] Moler, Cleve (1986). Heath, Michael T., ed. “Matrix Computation on Distributed Memory Multiprocessors”. Hypercube Multiprocessors (Society for Industrial and Applied Mathematics, Philadelphia). ISBN 0898712092.

[5] The Intel hypercube part 2 reposted on Cleve’s Corner blog on The MathWorks website

[6] Kepner, Jeremy (2009). Parallel MATLAB for Multicore and Multinode Computers, p.12. SIAM. ISBN 9780898716733.

[7] SeqAnswers forum

[8] How we made our face recognizer 25 times faster (developer blog post) 14.6. EXTERNAL LINKS 83

14.6 External links

• Embarrassingly Parallel Computations, Engineering a Beowulf-style Compute Cluster

• "Star-P: High Productivity Parallel Computing" Chapter 15

Failure semantics

In distributed computing, failure semantics is used to describe and classify errors that distributed systems can experience.[1][2]

15.1 Types of errors

A list of types of errors that can occur:

• An omission error is when one or more responses fails.

• A crash error is when nothing happens. A crash is a special case of omission when all responses fails. • A Timing error is when one or more responses arrive outside the time interval specified. Timing errors can be early or late. An omission error is a timing error when a response has infinite timing error. • An arbitrary error is any error, (i.e. a wrong value or a timing error).

• When a client uses a server it can cope with different type errors from the server. • If it can manage a crash at the server it is said to assume the server to have crash failure semantics. • If it can manage a service omission it is said to assume the server to have omission failure semantics. • Failure semantics are the type of errors are expected to appear.

• Should another type of error appear it will lead to a service failure because it cannot be managed.

15.2 References

[1] Flaviu Christian, Understanding Fault-Tolerant Distributed Systems

[2] Arno Puder; Kay Romer; Frank Pilhofer (2005). Distributed Systems Architecture. Morgan Kaufmann. ISBN 1558606483., pp 14–16.

84 Chapter 16

Fallacies of distributed computing

The fallacies of distributed computing are a set of assumptions that L Peter Deutsch and others at Sun Microsystems originally asserted programmers new to distributed applications invariably make. These assumptions ultimately prove false, resulting either in the failure of the system, a substantial reduction in system scope, or in large, unplanned expenses required to redesign the system to meet its original goals.

16.1 The fallacies

The fallacies are summarized below:[1]

1. The network is reliable.

2. Latency is zero.

3. Bandwidth is infinite.

4. The network is secure.

5. Topology doesn't change.

6. There is one administrator.

7. Transport cost is zero.

8. The network is homogeneous.

16.2 Effects of the fallacies

1. Ignorance of network latency, and of the packet loss it can cause, induces application- and transport-layer developers to allow unbounded traffic, greatly increasing dropped packets and wasting bandwidth.

2. Complacency regarding network security results in being blindsided by malicious users and programs that continually adapt to security measures.[2]

3. Multiple administrators, as with subnets for rival companies, may institute conflicting policies of which senders of network traffic must be aware in order to complete their desired paths.

4. The “hidden” costs of building and maintaining a network or subnet are non-negligible and must consequently be noted in budgets to avoid vast shortfalls.

5. Ignorance of bandwidth limits on the part of traffic senders can result in bottlenecks over frequency-multiplexed media.

85 86 CHAPTER 16. FALLACIES OF DISTRIBUTED COMPUTING

16.3 History

The list of fallacies generally came about at Sun Microsystems. L. Peter Deutsch, one of the original Sun "Fellows", is credited with penning the first seven fallacies in 1994; however, Bill Joy and Tom Lyon had already identified the first four as “The Fallacies of Networked Computing”[3] (the article claims “Dave Lyon”, but this is considered a mistake). Around 1997, James Gosling, another Sun Fellow and the inventor of Java, added the eighth fallacy.[3]

16.4 See also

• Distributed computing

• Infinite bandwidth zero latency • RISC vs CISC debate

• fine vs coarse grained SOA

16.5 References

[1] “The Eight Fallacies of Distributed Computing”.

[2] “Malware Defensive Techniques Will Evolve as Security Arms Race Continues”.

[3] “Deutsch’s Fallacies, 10 Years After”.

16.6 External links

• The Eight Fallacies of Distributed Computing

• Fallacies of Distributed Computing Explained by Arnon Rotem-Gal-Oz Chapter 17

Global concurrency control

Global concurrency control typically pertains to the concurrency control of a system comprising several components, each with its own concurrency control. The overall concurrency control of the whole system, the Global concurrency control, is determined by the concurrency control of its components, modules. In this case also the term Modular concurrency control is used. In many cases a system may be distributed over a communication network. In this case we deal with distributed concurrency control of the system, and the two terms sometimes overlap. However, distributed concurrency control typically relates to a case where the distributed system’s components do not have each concurrency control of its own, but rather are involved with a concurrency control mechanism that spans several components in order to operate. For example, as typical in a distributed database. In database systems and transaction processing (transaction management) global concurrency control relates to the concurrency control of a multidatabase system (for example, a Federated database; other examples are Grid com- puting and Cloud computing environments). It deals with the properties of the global schedule, which is the unified schedule of the multidatabase system, comprising all the individual schedules of the database systems and possibly other transactional objects in the system. A major goal for global concurrency control is Global serializability (or Modular serializability). The problem of achieving global serializability in a heterogeneous environment had been open for many years, until an effective solution based on Commitment ordering (CO) has been proposed (see Global serializability). Global concurrency control deals also with relaxed forms of global serializability which compromise global serializability (and in many applications also correctness, and thus are avoided there). While local (to a database system) relaxed serializability methods compromise serializability for performance gain (utilized when the applica- tion allows), it is unclear that the various proposed relaxed global serializability methods provide any performance gain over CO, which guarantees global serializability.

17.1 See also

• Concurrency control

• Global serializability

• Commitment ordering • Distributed concurrency control

87 Chapter 18

Happened-before

In computer science, the happened-before relation (denoted: → ) is a relation between the result of two events, such that if one event should happen before another event, the result must reflect that. Even if those events are in reality executed out of order (usually to optimize program flow). This involves ordering events based on the potential causal relationship of pairs of events in a concurrent system, especially asynchronous distributed systems. It was formulated by Leslie Lamport.[1] In Java specifically, a happens-before relationship is a guarantee that memory written to by statement A is visible to statement B, that is, that statement A completes its write before statement B starts its read. The happened-before relation is formally defined as the least strict partial order on events such that:

• If events a and b occur on the same process, a → b if the occurrence of event a preceded the occurrence of event b . • If event a is the sending of a message and event b is the reception of the message sent in event a , a → b .

If there are other causal relationships between events in a given system, such as between the creation of a process and its first event, these relationships are also added to the definition. Like all strict partial orders, the happened-before relation is transitive, irreflexive and antisymmetric, i.e.:

• ∀a, b, c , if a → b and b → c , then a → c (transitivity) ; • ∀a, a ↛ a (irreflexivity) ; • ∀a, b, if a → b ∧ b → a then a = b (antisymmetry).

Because the happened-before relation is both irreflexive and antisymmetric, it follows that: if a → b then b ↛ a . The processes that make up a distributed system have no knowledge of the happened-before relation unless they use a logical clock, like a Lamport clock or a vector clock. This allows one to design algorithms for mutual exclusion, and tasks like debugging or optimising distributed systems.

18.1 See also

• Java Memory Model • Lamport timestamps

18.2 References

[1] Lamport, Leslie (1978). “Time, Clocks and the Ordering of Events in a Distributed System”, Communications of the ACM, 21(7), 558-565.

88 Chapter 19

Leader election

Not to be confused with Leadership election.

In distributed computing, leader election is the process of designating a single process as the organizer of some task distributed among several computers (nodes). Before the task is begun, all network nodes are either unaware which node will serve as the “leader” (or coordinator) of the task, or unable to communicate with the current coordinator. After a leader election algorithm has been run, however, each node throughout the network recognizes a particular, unique node as the task leader. The network nodes communicate among themselves in order to decide which of them will get into the “leader” state. For that, they need some method in order to break the symmetry among them. For example, if each node has unique and comparable identities, then the nodes can compare their identities, and decide that the node with the highest identity is the leader. The definition of this problem is often attributed to LeLann, who formalized it as a method to create a new token in a token ring network in which the token has been lost. Leader election algorithms are designed to be economical in terms of total bytes transmitted, and time. The algorithm suggested by Gallager, Humblet, and Spira [1] for general undirected graphs has had a strong impact on the design of distributed algorithms in general, and won the Dijkstra Prize for an influential paper in distributed computing. Many other algorithms were suggested for different kind of network graphs, such as undirected rings, unidirectional rings, complete graphs, grids, directed Euler graphs, and others. A general method that decouples the issue of the graph family from the design of the leader election algorithm was suggested by Korach, Kutten, and Moran.[2]

19.1 Definition

The problem of leader election is for each processor eventually to decide that whether it is a leader or not subject to only one processor decides that it is the leader.[3] An algorithm solves the leader election problem if:

1. States of processors are divided into elected and not elected states. Once elected, it remains as elected (similarly if not elected). 2. In every execution, exactly one processor becomes elected and the rest determine that they are not elected.

A valid leader election algorithm must meet the following conditions:[4]

1. Termination: the algorithm should finish eventually within a finite time one leader is selected. In randomized approaches this condition is sometimes weakened (for example, requiring termination with probability 1). 2. Uniqueness: there is exactly one processor that considers itself as leader. 3. Agreement: all other processors know who the leader is.

An algorithm for leader election may vary in following aspects:[5]

89 90 CHAPTER 19. LEADER ELECTION

• Communication mechanism: the processors are either synchronous in which processes are synchronized by a clock signal or asynchronous where processes run at arbitrary speeds. • Process names: whether processes have a unique identity or are indistinguishable (anonymous). • Network topology: for instance, ring, acyclic graph or complete graph. • Size of the network: the algorithm may or may not use knowledge of the number of processes in the system.

19.2 Algorithms

19.2.1 Leader election in rings

Ring network topology

A ring network is a connected-graph topology in which each node is exactly connected to two other nodes, i.e., for a graph with n nodes, there are exactly n edges connecting the nodes. A ring can be unidirectional, which means pro- cessors only communicate in one direction, or bidirectional, meaning processors may transmit and receive messages in both directions.

Anonymous rings

A ring is said to be anonymous if every processor is identical. More formally, the system has the same state machine for every processor.[6] There is no deterministic algorithm to elect a leader in anonymous rings, even when the size of 19.2. ALGORITHMS 91

the network is known to the processes.[7][8] This is due to the fact that there is no possibility of breaking symmetry in an anonymous ring if all processes run at the same speed. The state of processors after some steps only depends on the initial state of neighbouring nodes. So, because their states are identical and execute the same procedures, in every round the same messages are sent by each processor. Therefore, each processor state also changes identically and as a result if one processor is elected as a leader, so are all the others.

Randomized (probabilistic) leader election

A common approach to solve the problem of leader election in anonymous rings is the use of probabilistic algorithms. In such approaches, generally processors assume some identities based on a probabilistic function and communicate it to the rest of the network. At the end, through the application of an algorithm, a leader is selected (with high probability).

Synchronous ring Itai and Rodeh [9] introduced an algorithm for a unidirectional ring with synchronized processes. They assume the size of the ring (number of nodes) is known to the processes. For a ring of size n, a≤n processors are active. Each processor decides with probability of a^(−1) whether to become a candidate. At the end of each phase, each processor calculates the number of candidates c and if it is equal to 1, it becomes the leader. To determine the value of c, each candidate sends a token (pebble) at the start of the phase which is passed around the ring, returning after exactly n time units to its sender. Every processor determines c by counting the number of pebbles which passed through. . This algorithm achieves leader election with expected message complexity of O(nlogn). A similar approach is also used in [10] in which a time-out mechanism is employed to detect deadlocks in the system. There are also algorithms for rings of special sizes such as prime size [11][12] and odd size.[13]

Asynchronous ring In the case of asynchronous systems, the problem of leader election becomes more difficult. This is because of the uncertainty introduced by the arbitrary response time of processors due to the lack of a global clock. To tackle this problem, there are various approaches introduced. For instance, Itai and Rodeh extended their algorithm by adding a message buffer and wake up messages to trigger the computation in processes.

Uniform algorithm In typical approaches to leader election, the size of the ring is assumed to be known to the processes. In the case of anonymous rings, without using an external entity, it is not possible to elect a leader. Even assuming an algorithm exist, the leader could not estimate the size of the ring. i.e. in any anonymous ring, there is a positive probability that an algorithm computes a wrong ring size.[14] To overcome this problem, Fisher and Jiang used a so called leader oracle Ω? that each processor can ask whether there is a unique leader. They show that from some point upward, it is guaranteed to return the same answer to all processes.[15]

Rings with unique IDs

In one of the early works, Chang and Roberts [16] proposed a uniform algorithm in which a processor with the highest ID is selected as the leader. Each processor sends its ID in a clockwise direction. A process receiving a message and compares it with its own. If it is bigger, it passes it through, otherwise it will discard the message. They show that this algorithm uses O(n^2) messages and O(nlogn) in the average case. Hirschberg and Sinclair [17] improved this algorithm with O(nlogn) message complexity by introducing a 2 directional message passing scheme allowing the processors to send messages in both directions.

19.2.2 Leader Election in Mesh

The mesh is another popular form of network topology specially in parallel systems, redundant memory systems and interconnection networks.[18] In a mesh structure, nodes are either corner (only two neighbours), border (only three neighbours) or interior (with four neighbours). The number of edges in a mesh of size a x b is m=2ab-a-b. 92 CHAPTER 19. LEADER ELECTION

Mesh network topology. Red nodes denote corners, blue border and gray interior.

Unoriented mesh

A typical algorithm to solve the leader election in an unoriented mesh is to only elect one of the four corner nodes as the leader. Since the corner nodes might not be aware of the state of other processes, the algorithm should first wake up the corner nodes. A leader can be elected as follows.[19]

1. Wake-up process: in which k nodes initiate the election process. Each initiator sends a wake-up message to all its neighbouring nodes. If a node is not initiator, it simply forwards the messages to the other nodes. In this stage at most 3n+k messages are sent.

2. Election process: the election in outer ring takes two stages at most with 6(a+b)−16 messages.

3. Termination: leader sends a terminating message to all nodes. This requires at most 2n messages.

The message complexity is at most 6(a+b)−16 and if the mesh is square shaped O(√n). 19.2. ALGORITHMS 93

Oriented Mesh

An oriented mesh is a special case where are port numbers are compass labels, i.e. north, south, east and west. Leader election in an oriented mesh is trivial. We only need to nominate a corner, e.g. “north” and “east” and make sure that node knows it is a leader.

Torus

Torus network structure.

A special case of mesh architecture is a torus which is a mesh with “wrap-around”. In this structure, every node has exactly 4 connecting edges. One approach to elect a leader in such a structure is known as electoral stages. Similar to procedures in ring structures, this method in each stage eliminates potential candidates until eventually one candidate node is left. This node becomes the leader and then notifies all other processes of termination.[20] This approach can be used to achieve a complexity of O(n). There also more practical approaches introduced for dealing with presence of faulty links in the network.[21][22] 94 CHAPTER 19. LEADER ELECTION

4x4 Hypercube network topology.

19.2.3 Election in Hypercubes

A Hypercube H_k is a network consisting of n=2^k nodes, each with degree of k and O(n log n) edges. A similar electoral stages as before can be used to solve the problem of leader election. In each stage two nodes (called duelists) compete and the winner is promoted to the next stage. This means in each stage only half of the duelists enter the next stage. This procedure continues until only one duelist is left, and it becomes the leader. Once selected, it notifies all other processes. This algorithm requires O(n) messages. In the case of unoriented hypercubes, a similar approach can be used but with a higher message complexity of O(nloglogn).[23]

19.2.4 Election in complete networks

Complete networks are structures in which all processes are connected to one another, i.e., the degree of each node is n-1, n being the size of the network. An optimal solution with O(n) message and space complexity is known.[24] In this algorithm, processes have the following states:

1. Dummy: nodes that do not participate in the leader election algorithm.

2. Passive: the initial state of processes before start.

3. Candidate: the status of nodes after waking up. The candidate nodes will be considered to become the leader.

To elect a leader, a virtual ring is considered in the network. All processors initially start in a passive state until they are woken up. Once the nodes are awake, they are candidates to become the leader. Based on a priority scheme, candidate nodes collaborate in the virtual ring. At some point, candidates become aware of the identity of candidates that precede them in the ring. The higher priority candidates ask the lower ones about their predecessors. The candidates with lower priority become dummies after replying to the candidates with higher priority. Based on this scheme, the highest priority candidate eventually knows that all nodes in the system are dummies except itself, at which point it knows it is the leader.

19.2.5 Universal leader election techniques

As the name implies, these algorithms are designed to be used in every form of process networks without any prior knowledge of the topology of a network or its properties such its size.[25] 19.2. ALGORITHMS 95

Complete network structure.

Mega-Merger

This technique in essence is similar to finding a Minimum Spanning Tree (MST) in which the root of the tree becomes the leader. The basic idea in this method is individual nodes merge with each other to form bigger structures. The result of this algorithm is a tree (a graph with no cycle) whose root is the leader of entire system. The cost of mega-merger method is O(m+nlogn) where m is the number of edges and n is the number of nodes.

YO-YO

YO-YO is a minimum finding algorithm consisting of two parts: a preprocessing phase and a series of iterations.[26] In the first phase or setup, each node exchanges its id with all its neighbours and based on the value it orients its incident edges. For instance, if node x has a smaller id than y, x orients towards y. If a node has a smaller id than all its neighbours it becomes a source. In contrast, a node with all inward edges (i.e., with id larger than all of its neighbours) is a sink. All other nodes are internal nodes. Once all the edges are oriented, the iteration phase starts. Each iteration is an electoral stage in which some candidates will be removed. Each iteration has two phases: YO- and –YO. In this phase sources start the process to propagate to each sink the smallest values of the sources connected to that sink. YO-

1. A source (local minima) transmits its value to all its out-neighbours 96 CHAPTER 19. LEADER ELECTION

An example of YO-YO procedure. a) The network, b) Oriented network after setup phase, c) YO- phase in which source values are passed, d)-YO phase sending responses from sinks, e) updated structure after -YO phase.

2. An internal node waits to receive a value from all its in-neighbours. It calculates the minimum and sends it to out-neighbour. 3. A sink(a node with no outgoing edge) receives all the values and compute their minimum.

-YO

1. A sink sends YES to neighbours from which saw the smallest value and NO to others 2. An internal node sends YES to all in-neighbours from which it received the smallest value and NO to others. If it receives only one NO, it sends NO to all. 3. A source waits until it receives all votes. If all YES, it survives and if not, it is no longer a candidate. 4. When a node x sends NO to an in-neighbour y, the logical direction of that edge is reversed. 5. When a node y receives NO from an out-neighbour, it flips the direction of that link.

After the final stage, any source who receives a NO is no longer a source and becomes a sink. An additional stage, pruning, also is introduced to remove the nodes that are useless, i.e. their existence has no impact on the next iterations.

1. If a sink is leaf, then it is useless and therefore is removed. 2. If, in the YO- phase the same value is received by a node from more than one in-neighbour, it will ask all but one to remove the link connecting them.

This method has a total cost of O(mlogn) messages. Its real message complexity including pruning is an open research problem and is unknown.

19.3 Applications

19.3.1 Radio networks

In radio network protocols, leader election is often used as a first step to approach more advanced communication primitives, such as message gathering or broadcasts.[27] The very nature of wireless networks induces collisions when 19.4. SEE ALSO 97 adjacent nodes transmit at the same time; electing a leader allows to better coordinate this process. While the diameter D of a network is a natural lower bound for the time needed to elect a leader, upper and lower bounds for the leader election problem depend on the specific radio model studied.

Models and runtime

In radio networks, the n nodes may in every round choose to either transmit or receive a message. If no collision detection is available, then a node cannot distinguish between or silence or receiving more than one message at a time. Should collision detection be available, then a node may detect more than one incoming message at the same time, even though the messages itself cannot be decoded in that case. In the beeping model, nodes can only distinguish between silence or at least one message via carrier sensing. Known runtimes for single-hop networks range from a constant (expected with collision detection) to O(n log n) rounds (deterministic and no collision detection). In multi-hop networks, known runtimes differ from roughly O((D+ log n)(log² log n)) rounds (with high probability in the beeping model), O(D log n) (deterministic in the beeping model), O(n) (deterministic with collision detection) to O(n log3/2 n (log log n)0.5) rounds (deterministic and no collision detection).

19.4 See also

• Distributed Systems#Coordinator election • Bully algorithm • Chang and Roberts algorithm • HS algorithm • Voting system

19.5 References

[1] R. G. Gallager, P. A. Humblet, and P. M. Spira (January 1983). “A Distributed Algorithm for Minimum-Weight Spanning Trees” (PDF). ACM Transactions on Programming Languages and Systems 5 (1): 66–77. doi:10.1145/357195.357200.

[2] Ephraim Korach, Shay Kutten, Shlomo Moran (1990). “A Modular Technique for the Design of Efficient Distributed Leader Finding Algorithms”. ACM Transactions on Programming Languages and Systems 12 (1): 84–101. doi:10.1145/77606.77610.

[3] H. Attiya and J. Welch, Distributed Computing: Fundamentals, Simulations and Advance Topics, John Wiley & Sons inc., 2004, chap. 3

[4] I. Gupta, R. van Renesse, and K. P. Birman,2000, A Probabilistically Correct Leader Election Protocol for Large Groups, Technical Report , Cornell University

[5] R. Bakhshi, W. Fokkink, J. pang, and J. Van de Pol,c2008 “Leader Election in Anonymous Rings:Franklin Goes Proba- bilistic”, TCS, Vol. 273, pp. 57-72.

[6] H. Attiya and J. Welch, Distributed Computing: Fundamentals, Simulations and Advance Topics, John Wiley & Sons inc., 2004, chap. 3

[7] H. Attiya and J. Welch, Distributed Computing: Fundamentals, Simulations and Advance Topics, John Wiley & Sons inc., 2004, chap. 3

[8] H. Attiya and M. Snir, 1988,"Computing on an anonymous ring”,JACM,Vol. 35, issue. 4, pp. 845-875

[9] A. Itai and M. Rodeh, 1990,"Symmetry breaking in distributed networks”, Vol. 88, issue 1, pp. 60-87.

[10] L. Higham and S. Myers, 1998, “Self-Stabilizing Token Circulation on Anonymous Message Passing Rings”, Second In- ternational Conference On Principles Of DIstributed Systems.

[11] G. Itkis, C. Lin, and J. Simon,1995,"Deterministic, constant space, self-stabilizing leader election on uniform rings.”, In Proc. 9th Workshop on Distributed Algorithms, Vol. 972, pp. 288-302. 98 CHAPTER 19. LEADER ELECTION

[12] J. Burns and J. Pachl,1989,"Uniform self-stabilizing rings”,ACM Trans. Program. Lang. Systems, Vol. 11, issue. 2, pp.330-344

[13] T. Herman, 1990, “Probabilistic self-stabilization”, Inf. Process. Lett., Vol. 35, issue 2, pp.63-67.

[14] G. Tel,Introduction to Distributed Algorithms. Cambridge University Press, 2000.2nd edition

[15] M. Fischer and H. Jiang, 2006,"Self-stabilizing leader election in networks of _nite-state anonymous agents”, In Proc. 10th Conf. on Principles of Distributed Systems,Vol. 4305, pp. 395-409.

[16] E. Chang and R. Roberts, 1979, “An improved algorithm for decentralized extrema-finding in circular configurations of processes”, ACM, Vol. 22, issue 5, pp. 281-283.

[17] D. S. Hirschberg and J. B. Sinclair, 1980, “Decentralized extrema-finding in circular configurations of processors”, ACM, Vol. 23, issue 11, pp. 627-628.

[18] N. Santoro, Design and Analysis of Distributed Algorithms, Wiley, 2006.

[19] H. Kallasjoki, 2007, “Election in Mesh, Cube and Complete Networks”, Seminar on Theoretical Computer Science.

[20] N. Santoro, Design and Analysis of Distributed Algorithms, Wiley, 2006.

[21] M. Refai, A. Sharieh and . Alsmmari, 2010, “Leader Election Algorithm in 2D Torus Network with the Presence of One Link Failure”, The International Arab Journal of Information Technology, Vol. 7, No. 2.

[22] M Al Refai,2014, “Dynamic Leader Election Algorithm in 2D Torus Network with Multi Links Failure”, IJCST, Vol. 2, issue 5.

[23] N. Santoro, Design and Analysis of Distributed Algorithms, Wiley, 2006.

[24] J. Villadangos, A. Cordoba, F. Farina, and M. Prieto, 2005, “Efficient leader election in complete networks”, PDP, pp.136- 143.

[25] N. Santoro, Design and Analysis of Distributed Algorithms, Wiley, 2006.

[26] N. Santoro, Design and Analysis of Distributed Algorithms, Wiley, 2006.

[27] Haeupler, Bernhard; Ghaffari, Mohsen (2013). “Near Optimal Leader Election in Multi-Hop Radio Networks”. Proceed- ings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithm. doi:10.1137/1.9781611973105.54. Chapter 20

Quantum Byzantine agreement

Byzantine fault tolerant protocols are algorithms that are robust to arbitrary types of failures in distributed algorithms. With the advent and popularity of the Internet, there is a need to develop algorithms that do not require any centralized control that have some guarantee of always working correctly. The Byzantine agreement protocol is an essential part of this task. In this article the quantum version of the Byzantine protocol,[1] which works in constant time is described.

20.1 Introduction

The Byzantine Agreement protocol is a protocol in distributed computing. It takes its name from a problem formu- lated by Lamport, Shostak and Pease in 1982,[2] which itself is a reference to a historical problem. The Byzantine army was divided into divisions with each division being led by a General with the following properties:

• Each General is either loyal or a traitor to the Byzantine state. • All Generals communicate by sending and receiving messages. • There are only two commands: attack and retreat. • All loyal Generals should agree on the same plan of action: attack or retreat. • 1 A small linear fraction of bad Generals should not cause the protocol to fail (less than a 3 fraction).

(See [3] for the proof of the impossibility result). The problem usually is equivalently restated in the form of a commanding General and loyal Lieutenants with the General being either loyal or a traitor and the same for the Lieutenants with the following properties.

• All loyal Lieutenants carry out the same order. • If the commanding General is loyal, all loyal Lieutenants obey the order that he sends. • 1 A strictly less than 3 fraction including the commanding General are traitors.

20.2 Byzantine Failure and Resilience

Failures in an algorithm or protocol can be categorized into three main types:

1. A failure to take another execution step in the algorithm: This is usually referred to as a “fail stop” fault. 2. A random failure to execute correctly: This is called a “random fault” or “random Byzantine” fault. 3. An arbitrary failure where the algorithm fails to execute the steps correctly (usually in a clever way by some adversary to make the whole algorithm fail) which also encompasses the previous two types of faults; this is called a “Byzantine fault”.

99 100 CHAPTER 20. QUANTUM BYZANTINE AGREEMENT

A Byzantine resilient or Byzantine fault tolerant protocol or algorithm is an algorithm that is robust to all the kinds of failures mentioned above. For example, given a space shuttle with multiple redundant processors and some of the processors give incorrect data, which processors or sets of processors should be believed? The solution can be formulated as a Byzantine fault tolerant protocol.

20.3 Sketch of the Algorithm

We will sketch here the asynchronous algorithm [1] The algorithm works in two phases:

• Phase 1 (Communication phase):

All messages are sent and received in this round. A coin flipping protocol is a procedure that allows two parties A and B that do not trust each other to toss a coin to win a particular object.

There are two types of coin flipping protocols

• • weak coin flipping protocols:[4] The two players A and B initially start with no inputs and they are to compute some value cA, cB ∈ [0, 1] and be able to accuse anyone of cheating. The protocol is successful if A and B agree on the outcome. The outcome 0 is defined as A winning and 1 as B winning. The protocol has the following properties: • If both players are honest (they follow the protocol), then they agree on the outcome of the protocol 1 ∈ { } cA = cB with P r(cA = cB = b) = 2 for a, b 0, 1 . • If one of the players is honest (i.e., the other player may deviate arbitrarily from the protocol in his or 1 her local computation), then the other party wins with probability at most 2 +ϵ . In other words, if B ≤ 1 ≤ 1 is dishonest, then P r(cA = cB = 1) 2 +ϵ , and if A is dishonest, then P r(cA = cB = 0) 2 +ϵ . • A strong coin flipping protocol: In a strong coin flipping protocol, the goal is instead to produce a random bit which is biased away from any particular value 0 or 1. Clearly, any strong coin flipping protocol with bias ϵ leads to weak coin flipping with the same bias.

20.3.1 Verifiable secret sharing.

• A verifiable secret sharing (VSS) protocol:[5] A (n,k) secret sharing protocol allows a set of n players to share a secret, s such that only a quorum of k or more players can discover the secret. The player sharing (distributing the secret pieces) the secret is usually referred to as the dealer. A verifiable secret sharing protocol differs from a basic secret sharing protocol in that players can verify that their shares are consistent even in the presence of a malicious dealer.

20.3.2 The Fail-stop protocol.

Protocol QuantumCoinFlip for player Pi

1. Round 1 generate the state |Coin ⟩ = √1 |0, 0,..., 0⟩ + √1 |1, 1,..., 1⟩ on n qubits and send the kth qubit to i 2 2 the kth player keeping one part

∑ 3 | ⟩ 1 n | ⟩ 2. Generate the state Leaderi = n3/2 a=1 a, a, . . . , a on n qubits, an equal superposition of the numbers between 1 and n3 .Distribute the n qubits between all the players 3. Receive the quantum messages from all players and wait for the next communication round, thus forcing the adversary to choose which messages were passed.

4. Round 2: Measure (in the standard base) all Leaderj qubits received in round I. Select the player with the highest leader value (ties broken arbitrarily) as the “leader” of the round. Measure the leader’s coin in the standard base.

5. Set the output of the QuantumCoinFlip protocol: vi = measurement outcome of the leader’s coin. 20.4. REMARKS 101

20.3.3 The Byzantine protocol.

To generate a random coin assign an integer in the range [0,n-1] to each player and each player is not allowed to i choose its own random ID as each player Pk selects a random number sk for every other player Pi and distributes this using a verifiable secret sharing scheme. At the end of this phase players agree∑ on which secrets were properly shared, the secrets are then opened and each i player Pi is assigned the value si = skshared properly secrets all for mod n This requires private information ∑ − | ⟩ √1 n 1 | ⟩ channels so we replace the random secrets by the superposition ϕ = n a=0 a . In which the state is encoded using a quantum verifiable secret sharing protocol (QVSS).[6] We cannot distribute the state |ϕ, ϕ, . . . ϕ⟩ since the bad players can collapse the state. To prevent bad players from doing so we encode the state using the Quantum verifiable secret sharing (QVSS) and send each player their share of the secret. Here again the verification requires Byzantine Agreement, but replacing the agreement by the grade-cast protocol is enough.[7][8]

Grade-cast protocol

A grade-cast protocol has the following properties using the definitions in [7] Informally, a graded broadcast protocol is a protocol with a designated player called “dealer” (the one who broadcasts) such that:

1. If the dealer is good, all the players get the same message.

2. Even if the dealer is bad, if some good player accepts the message, all the good players get the same message (but they may or may not accept it).

A protocol P is said to be achieve graded broadcast if, at the beginning of the protocol, a designated player D (called the dealer) holds a value v, and at the end of the protocol, every player Pi outputs a pair (valuei, confidencei) such that the following properties hold: (∀i, confidencei ∈ {0, 1, 2})

1. If D is honest, then valuei = v and confidencei = 2 for every honest player Pi .

2. For any two honest players Pi and Pj, |confidencei − confidencej| ≤ 1 .

3. (Consistency) For any two honest players Pi and Pj , if confidencei > 0 and confidencej > 0 ,then valuei = valuej .

n For t < 4 the verification stage of the QVSS protocol guarantees that for a good dealer the correct state will be encoded, and that for any, possibly faulty dealer, some particular state will be recovered during the recovery stage. We note that for the purpose of our Byzantine quantum coin flip protocol the recovery stage is much simpler. Each player measures his share of the QVSS and sends the classical value to all other players. The verification stage n guarantees, with high probability, that in the presence of up to t < 4 faulty players all the good players will recover the same classical value (which is the same value that would result from a direct measurement of the encoded state).

20.4 Remarks

In 2007, a quantum protocol for Byzantine Agreement was demonstrated experimentally [9] using a four-photon polarization-entangled state. This shows that the quantum implementation of classical Byzantine Agreement protocols is indeed feasible.

20.5 References

[1] Michael Ben-Or and Avinatan Hassidim, Fast quantum byzantine agreement,STOC '05: Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pg 481-485 [2005]

[2] L. Lamport and R. Shostak and M. Pease, The Byzantine Generals Problem, ACM Trans. Program. Lang. Syst., volume 4, number 3, pg 382-401 [1982] 102 CHAPTER 20. QUANTUM BYZANTINE AGREEMENT

[3] Michael J. Fisher, Nancy A. Lynch,Michael S. Paterson,Impossibility of Distributed Consensus with One Faulty Pro- cess, Journal of the ACM volume 32, issue=2, pg 374-382 Impossibility of Distributed Consensus with One Faulty Pro- cess[1985]

[4] I. Kerenidis, A. Nayak, coin flipping with small bias, arxiv

[5] Verifiable secret sharing verifiable secret sharing

[6] Claude Cr´epeau, Daniel Gottesman and Adam Smith, Secure Multi-party Quantum Computation, In 34th ACM Sympo- sium on the Theory of Computing, STOC, pg. 643–652, [2002]

[7] Michael Ben-Or, Elan Pavlov, Vinod Vaikuntanathan, Byzantine Agreement in the Full-Information Model in O(log n) Rounds, STOC '06: Proceedings of the thirty-eighth annual ACM symposium on Theory of computing, pg 179-186 [2006]

[8] Pesech Feldman and Silvio Micali. An optimal probabilistic protocol for synchronous byzantine agreement. SIAM J. Comput., pg 873–933, [1997]

[9] Sascha Gaertner, Mohamed Bourennane, Christian Kurtsiefer, Adán Cabello, Harald Weinfurter, Experimental Demon- stration of a Quantum Protocol for Byzantine Agreement and Liar Detection, arXiv:0710.0290v2, [2007], Phys. Rev. Lett. 100 (2008) 070504. Chapter 21

Race condition

A race condition or race hazard is the behavior of an electronic, software or other system where the output is dependent on the sequence or timing of other uncontrollable events. It becomes a bug when events do not happen in the order the programmer intended. The term originates with the idea of two signals racing each other to influence the output first. Race conditions can occur in electronics systems, especially logic circuits, and in computer software, especially multithreaded or distributed programs.

21.1 Electronics

A typical example of a race condition may occur in a system of logic gates, where inputs vary. If a particular output depends on the state of the inputs, it may only be defined for steady-state signals. As the inputs change state, a small delay will occur before the output changes, due to the physical nature of the electronic system. For a brief period, the output may change to an unwanted state before settling back to the designed state. Certain systems can tolerate such glitches, but if, for example, this output functions as a clock signal for further systems that contain memory, the system can rapidly depart from its designed behaviour (in effect, the temporary glitch becomes a permanent glitch). For example, consider a two input AND gate fed with a logic signal A on one input and its negation, NOT A, on another input. In theory, the output (A AND NOT A) should never be true. However, if changes in the value of A take longer to propagate to the second input than the first when A changes from false to true, a brief period will ensue during which both inputs are true, and so the gate’s output will also be true.[1] Design techniques such as Karnaugh maps encourage designers to recognize and eliminate race conditions before they cause problems. Often logic redundancy can be added to eliminate some kinds of races. As well as these problems, some logic elements can enter metastable states, which create further problems for circuit designers.

21.1.1 Critical and non-critical race conditions

A critical race occurs when the order in which internal variables are changed determines the eventual state that the state machine will end up in. A non-critical race occurs when the order in which internal variables are changed does not alter the eventual state. In other words, a non-critical race occurs when moving to a desired state means that more than one internal state variable must be changed at once, but no matter in what order these internal state variables change, the resultant state will be the same.

21.1.2 Static, dynamic, and essential race conditions

Static race conditions These are caused when a signal and its complement are combined together.

103 104 CHAPTER 21. RACE CONDITION

Race condition in a logic circuit. Here, ∆t1 and ∆t2 represent the propagation delays of the logic elements. When the input value (A) changes from low to high, the circuit outputs a short spike of duration (∆t1+∆t2) - ∆t2 = ∆t1.

Dynamic race conditions These result in multiple transitions when only one is intended. They are due to interaction between gates (Dynamic race conditions can be eliminated by using no more than two levels of gating).

Essential race conditions These are caused when an input has two transitions in less than the total feedback prop- agation time. Sometimes they are cured using inductive delay-line elements to effectively increase the time duration of an input signal. 21.2. SOFTWARE 105

21.2 Software

Race conditions arise in software when an application depends on the sequence or timing of processes or threads for it to operate properly. As with electronics, there are critical race conditions that result in invalid execution and bugs as well as non-critical race-conditions that result in unanticipated behavior. Critical race conditions often happen when the processes or threads depend on some shared state. Operations upon shared states are critical sections that must be mutually exclusive. Failure to obey this rule opens up the possibility of corrupting the shared state. Race conditions have a reputation of being difficult to reproduce and debug, since the end result is nondeterministic and depends on the relative timing between interfering threads. Problems occurring in production systems can therefore disappear when running in debug mode, when additional logging is added, or when attaching a debugger, often referred to as a "Heisenbug". It is therefore better to avoid race conditions by careful software design rather than attempting to fix them afterwards.

21.2.1 Example

As a simple example let us assume that two threads each want to increment the value of a global integer variable by one. Ideally, the following sequence of operations would take place: In the case shown above, the final value is 2, as expected. However, if the two threads run simultaneously without locking or synchronization, the outcome of the operation could be wrong. The alternative sequence of operations below demonstrates this scenario: The final value is 1 instead of the expected result of 2. This occurs because the increment operations of the second case are not mutually exclusive. Mutually exclusive operations are those that cannot be interrupted while accessing some resource such as a memory location.

21.2.2 File systems

In file systems, two or more programs may “collide” in their attempts to modify or access a file, which could result in data corruption. File locking provides a commonly used solution. A more cumbersome remedy involves organizing the system in such a way that one unique process (running a daemon or the like) has exclusive access to the file, and all other processes that need to access the data in that file do so only via interprocess communication with that one process (which of course requires synchronization at the process level). A different form of race condition exists in file systems where unrelated programs may affect each other by suddenly using up available resources such as disk space (or memory, or processor cycles). Software not carefully designed to anticipate and handle this race situation may then become quite fragile and unpredictable. Such a risk may be overlooked for a long time in a system that seems very reliable. But eventually enough data may accumulate or enough other software may be added to critically destabilize many parts of a system. Probably the best known example of this occurred with the near loss of the Mars Rover “Spirit” not long after landing, but this is a commonly overlooked hazard in many computer systems. A solution is for software to request and reserve all the resources it will need before beginning a task; if this request fails then the task is postponed, avoiding the many points where failure could have occurred. (Alternatively, each of those points can be equipped with error handling, or the success of the entire task can be verified afterwards, before continuing.) A more common but incorrect approach is to simply verify that enough disk space (for example) is available before starting a task; this is not adequate because in complex systems the actions of other running programs can be unpredictable.

21.2.3 Networking

In networking, consider a distributed chat network like IRC, where a user who starts a channel automatically acquires channel-operator privileges. If two users on different servers, on different ends of the same network, try to start the same-named channel at the same time, each user’s respective server will grant channel-operator privileges to each user, since neither server will yet have received the other server’s signal that it has allocated that channel. (Note that this problem has been largely solved by various IRC server implementations.) In this case of a race condition, the concept of the “shared resource" covers the state of the network (what channels exist, as well as what users started them and therefore have what privileges), which each server can freely change as 106 CHAPTER 21. RACE CONDITION long as it signals the other servers on the network about the changes so that they can update their conception of the state of the network. However, the latency across the network makes possible the kind of race condition described. In this case, heading off race conditions by imposing a form of control over access to the shared resource—say, appointing one server to control who holds what privileges—would mean turning the distributed network into a centralized one (at least for that one part of the network operation). Race conditions can also exist when a computer program is written with non-blocking sockets, in which case the performance of the program can be dependent on the speed of the network link.

21.2.4 Life-critical systems

Software flaws in life-critical systems can be disastrous. Race conditions were among the flaws in the Therac-25 radiation therapy machine, which led to the death of at least three patients and injuries to several more.[2] Another example is the Energy Management System provided by GE Energy and used by Ohio-based FirstEnergy Corp (among other power facilities). A race condition existed in the alarm subsystem; when three sagging power lines were tripped simultaneously, the condition prevented alerts from being raised to the monitoring technicians, delaying their awareness of the problem. This software flaw eventually led to the North American Blackout of 2003.[3] GE Energy later developed a software patch to correct the previously undiscovered error.

21.2.5 Computer security

A specific kind of race condition involves checking for a predicate (e.g. for authentication), then acting on the predicate, while the state can change between the time of check and the time of use. When this kind of bug exists in security-conscious code, a security vulnerability called a time-of-check-to-time-of-use (TOCTTOU) bug is created.

21.3 Examples outside of Computing

21.3.1 Biology

Neuroscience is demonstrating that race conditions can occur in mammal (rat) brains as well.[4][5]

21.4 See also

• Concurrency control

• Deadlock

• Synchronization (computer science)

• Linearizability

• Racetrack problem

• Call collision

21.5 References

[1] Unger, S.H. (June 1995). “Hazards, Critical Races, and Metastability”. IEEE Transactions on Computers 44 (6): 754–768. doi:10.1109/12.391185.

[2] “An Investigation of Therac-25 Accidents — I”. Courses.cs.vt.edu. Retrieved 2011-09-19.

[3] Kevin Poulsen (2004-04-07). “Tracking the blackout bug”. Securityfocus.com. Retrieved 2011-09-19.

[4] “How Brains Race to Cancel Errant Movements”. Discover Magazine blogs. 2013-08-03. 21.6. EXTERNAL LINKS 107

[5] Schmidt, Robert; Leventhal, Daniel K; Mallet, Nicolas; Chen, Fujun; Berke, Joshua D (2013). “Canceling actions involves a race between basal ganglia pathways”. Nature Neuroscience 16 (8): 1118–24. doi:10.1038/nn.3456. PMC 3733500. PMID 23852117.

21.6 External links

• Karam, G.M.; Buhr, R.J.A. (August 1990). “Starvation and Critical Race Analyzers for Ada”. IEEE Transac- tions on Software Engineering 16 (8): 829–843. doi:10.1109/32.57622.

• Fuhrer, R.M.; Lin, B.; Nowick, S.M. (27–29 Mar 1995). “Algorithms for the optimal state assignment of asynchronous state machines”. Advanced Research in VLSI, 1995. Proceedings., 16th Conference on. pp. 59–75. doi:10.1109/ARVLSI.1995.515611. ISBN 0-8186-7047-9. as PDF • Paper "A Novel Framework for Solving the State Assignment Problem for Event-Based Specifications" by Luciano Lavagno, Cho W. Moon, Robert K. Brayton and Alberto Sangiovanni-Vincentelli • Wheeler, David A. (7 October 2004). “Secure programmer: Prevent race conditions—Resource contention can be used against you”. IBM developerWorks. • Chapter "Avoid Race Conditions" (Secure Programming for Linux and Unix HOWTO)

• Race conditions, security, and immutability in Java, with sample source code and comparison to C code, by Chiral Software

• Karpov, Andrey (11 April 2009). “Interview with Dmitriy Vyukov — the author of Relacy Race Detector (RRD)". Intel Software Library Articles. • Microsoft Support description Chapter 22

Self-stabilization

Self-stabilization is a concept of fault-tolerance in distributed computing. A distributed system that is self-stabilizing will end up in a correct state no matter what state it is initialized with. That correct state is reached after a finite number of execution steps. At first glance, the guarantee of self stabilization may seem less promising than that of the more traditional fault- tolerance of algorithms, that aim to guarantee that the system always remains in a correct state under certain kinds of state transitions. However, that traditional fault tolerance cannot always be achieved. For example, it cannot be achieved when the system is started in an incorrect state or is corrupted by an intruder. Moreover, because of their complexity, it is very hard to debug and to analyze distributed systems. Hence, it is very hard to prevent a distributed system from reaching an incorrect state. Indeed, some forms of self-stabilization are incorporated into many modern computer and telecommunications networks, since it gives them the ability to cope with faults that were not foreseen in the design of the algorithm. Many years after the seminal paper of Edsger Dijkstra in 1974, this concept remains important as it presents an important foundation for self-managing computer systems and fault-tolerant systems. As a result, Dijkstra’s paper received the 2002 ACM PODC Influential-Paper Award, one of the highest recognitions in the distributed computing community.[1] Moreover, after Dijkstra’s death, the award was renamed and is now called the Dijkstra Award.

22.1 History

E.W. Dijkstra in 1974 presented the concept of self-stabilization, prompting further research in this area.[2] He also presented the first self-stabilizing algorithms that did not rely on strong assumptions on the system. Some previous protocols used in practice did actually stabilize, but only assuming the existence of a clock that was global to the system, and assuming a known upper bound on the duration of each system transition. It is only ten years later when Leslie Lamport pointed out the importance of Dijkstra’s work that researchers [3] directed their attention to this elegant fault-tolerance concept.

22.2 Overview

A distributed algorithm is self-stabilizing if, starting from an arbitrary state, it is guaranteed to converge to a legitimate state and remain in a legitimate set of states thereafter. A state is legitimate if starting from this state the algorithm satisfies its specification. The property of self-stabilization enables a distributed algorithm to recover from a transient fault regardless of its nature. Moreover, a self-stabilizing algorithm does not have to be initialized as it eventually starts to behave correctly regardless of its initial state. Dijkstra’s paper, which introduces the concept of self-stabilization, presents an example in the context of a "token ring" — a network of computers ordered in a circle, such that exactly one of them is supposed to “hold a token” at any given time.

• Not holding a token is a correct state for each computer in this network, since the token can be held by another computer. However, if every computer is in the state of “not holding a token” then the network altogether is

108 22.2. OVERVIEW 109

not in a correct state.

• Similarly, if more than one computer “holds a token” then this is not a correct state for the network, although it cannot be observed to be incorrect by viewing any computer individually. Since every computer can “observe” only the states of its two neighbors, it is hard for the computers to decide whether the network altogether is in a correct state.

The first self-stabilizing algorithms did not detect errors explicitly in order to subsequently repair them. Instead, they constantly pushed the system towards a legitimate state. Since traditional methods for detecting an error[4] were often very difficult and time-consuming, such a behavior was considered desirable. (The method described in the paper cited above collects a huge amount of information from the whole network to one place; after that, it attempts to determine whether the collected global state is correct; even that determination alone can be a hard task).

22.2.1 Efficiency improvements

More recently, researchers have presented newer methods for light-weight error detection for self-stabilizing systems using local checking.[5][6] The term local refers to a part of a computer network. When local detection is used, a com- puter in a network is not required to communicate with the entire network in order to detect an error — the error can be detected by having each computer communicate only with its nearest neighbors. These local detection methods simplified the task of designing self-stabilizing algorithms considerably. This is because the error detection mecha- nism and the recovery mechanism can be designed separately. Newer algorithms based on these detection methods also turned out to be much more efficient. Moreover, these papers suggested rather efficient general transformers to transform non self stabilizing algorithms to become self stabilizing. The idea is to,

1. Run the non self stabilizing protocol, at the same time,

2. detect faults (during the execution of the given protocol) using the above mentioned detection methods,

3. then, apply a (self stabilizing) “reset” protocol to return the system to some predetermined initial state, and, finally,

4. restart the given (non- self stabilizing) protocol.

The combination of these 4 parts is self stabilizing. Initial self stabilizing protocols were also presented in the above papers. More efficient reset protocols were presented later, e.g.[7] Additional efficiency was introduced with the notion of time-adaptive protocols.[8] The idea behind these is that when only a small number of errors occurs, the recovery time can (and should) be made short. Dijkstra’s original self-stabilization algorithms do not have this property. A useful property of self-stabilizing algorithms is that they can be composed of layers if the layers do not exhibit any circular dependencies. The stabilization time of the composition is then bounded by the sum of the individual stabilization times of each layer.

22.2.2 Time complexity

The time complexity of a self-stabilizing algorithm is measured in (asynchronous) rounds or cycles.

• A round is the shortest execution trace in which each processor executes at least one step.

• Similarly, a cycle is the shortest execution trace in which each processor executes at least one complete iteration of its repeatedly executed list of commands.

It is also interesting to measure the output stabilization time. For that, a subset of the state variables is defined to be externally visible (the output). Certain states of outputs are defined to be correct (legitimate). The set of the outputs of all the components of the system is said to have stabilized at the time that it starts to be correct, provided it stays correct indefinitely, unless additional faults occur. The output stabilization time is the time (the number of (asynchronous) rounds) until the output stabilizes.[5] 110 CHAPTER 22. SELF-STABILIZATION

22.3 Definition

A system is self-stabilizing if and only if:

1. Starting from any state, it is guaranteed that the system will eventually reach a correct state (convergence).

2. Given that the system is in a correct state, it is guaranteed to stay in a correct state, provided that no fault happens (closure).

A system is said to be randomized self-stabilizing if and only if it is self-stabilizing and the expected number of rounds needed to reach a correct state is bounded by some constant k .[9] Design of self-stabilization in the above mentioned sense is well known to be a difficult job. In fact, a class of distributed algorithms do not have the property of local checking: the legitimacy of the network state cannot be evaluated by a single process. The most obvious case is Dijkstra’s token-ring defined above: no process can detect whether the network state is legitimate or not in the case where more than one token is present in non-neighboring processes. This suggests that self-stabilization of a distributed system is a sort of group intelligence where each component is taking local actions, based on its local knowledge but eventually this guarantees global convergence at the end. To help overcome the difficulty of designing self-stabilization as defined above, other types of stabilization were devised. For instance, weak stabilization is the property that a distributed system has a possibility to reach its legitimate behavior from every possible state.[10] Weak stabilization is easier to design as it just guarantees a possibility of convergence for some runs of the distributed system rather than convergence for every run. A self-stabilizing algorithm is silent if and only if it converges to a global state where the values of communication registers used by the algorithm remain fixed.[11]

22.4 Related work

An extension of the concept of self-stabilization is that of superstabilization.[12] The intent here is to cope with dynamic distributed systems that undergo topological changes. In classical self-stabilization theory, arbitrary changes are viewed as errors where no guarantees are given until the system has stabilized again. With superstabilizing systems, there is a passage predicate that is always satisfied while the system’s topology is reconfigured.

22.5 References

[1] “PODC Influential Paper Award: 2002”, ACM Symposium on Principles of Distributed Computing, retrieved 2009-09-01

[2] Dijkstra, Edsger W. (1974), “Self-stabilizing systems in spite of distributed control” (PDF), Communications of the ACM 17 (11): 643–644, doi:10.1145/361179.361202.

[3] Lamport, Leslie (1985), “Solved problems, unsolved problems, and non-problems in concurrency” (PDF), ACM Special Interest Group Operating Systems Review 19 (4): 34–44, doi:10.1145/858336.858339.

[4] Katz, Shmuel; Perry, Kenneth J. (1993), “Self-stabilizing extensions for meassage-passing systems”, Distributed Computing 7 (1): 17–26, doi:10.1007/BF02278852.

[5] Awerbuch, Baruch; Patt-Shamir, Boaz; Varghese, George (1991), “Self-stabilization by local checking and correction”, Proc. 32nd Symposium on Foundations of Computer Science (FOCS), pp. 268–277, doi:10.1109/SFCS.1991.185378.

[6] Afek, Yehuda; Kutten, Shay; Yung, Moti (1997), “The local detection paradigm and its applications to self-stabilization”, Theoretical Computer Science 186 (1-2): 199–229, doi:10.1016/S0304-3975(96)00286-1, MR 1478668.

[7] [Baruch Awerbuch, Shay Kutten, Yishay Mansour, Boaz Patt-Shamir, George Varghese. Time optimal self-stabilizing synchronization. ACM STOC 1993: 652-661.]

[8] Shay Kutten, Boaz Patt-Shamir: Stabilizing Time-Adaptive Protocols. Theor. Comput. Sci. 220(1): 93-111 (1999).

[9] Dolev, Shlomi (2000), Self-Stabilization, MIT Press, ISBN 0-262-04178-2. 22.6. EXTERNAL LINKS 111

[10] Gouda, Mohamed (1995), [http://www.cs.utexas.edu/users/gouda/papers/book%20chapters/11-whole.pdf> The Triumph and Tribulation of System Stabilization] (PDF), Proceedings of the 9th international workshop on distributed algorithms..

[11] Shlomi Dolev, Mohamed G. Gouda, and Marco Schneider. Memory requirements for silent stabilization. In PODC '96: Proceedings of the fifteenth annual ACM Symposium on Principles of Distributed Computing, pages 27-−34, New York, NY, USA, 1996. ACM Press. Online extended abstract.

[12] Dolev, Shlomi; Herman, Ted (1997), “Superstabilizing protocols for dynamic distributed systems”, Chicago Journal of Theoretical Computer Science, article 4.

22.6 External links

• libcircle - An implementation of self-stabilization using token passing for termination. Chapter 23

Serializability

In concurrency control of databases,[1][2] transaction processing (transaction management), and various transactional applications (e.g., [3] and software transactional memory), both centralized and distributed, a transaction schedule is serializable if its outcome (e.g., the resulting database state) is equal to the outcome of its transactions executed serially, i.e., sequentially without overlapping in time. Transactions are normally executed concurrently (they overlap), since this is the most efficient way. Serializability is the major correctness criterion for concurrent transactions’ executions. It is considered the highest level of isolation between transactions, and plays an essential role in concurrency control. As such it is supported in all general purpose database systems. Strong strict two-phase locking (SS2PL) is a popular serializability mechanism utilized in most of the database systems (in various variants) since their early days in the 1970s. Serializability theory provides the formal framework to reason about and analyze serializability and its techniques. Though it is mathematical in nature, its fundamentals are informally (without mathematics notation) introduced below.

23.1 Database transaction

Main article: Database transaction

For this is a specific intended run (with specific parameters, e.g., with transaction identification, at least) of a computer program (or programs) that accesses a database (or databases). Such a program is written with the assumption that it is running in isolation from other executing programs, i.e., when running, its accessed data (after the access) are not changed by other running programs. Without this assumption the transaction’s results are unpredictable and can be wrong. The same transaction can be executed in different situations, e.g., in different times and locations, in parallel with different programs. A live transaction (i.e., exists in a computing environment with already allocated computing resources; to distinguish from a transaction request, waiting to get execution resources) can be in one of three states, or phases:

1. Running - Its program(s) is (are) executing.

2. Ready - Its program’s execution has ended, and it is waiting to be Ended (Completed).

3. Ended (or Completed) - It is either Committed or Aborted (Rolled-back), depending whether the execution is considered a success or not, respectively . When committed, all its recoverable (i.e., with states that can be controlled for this purpose), durable resources (typically database data) are put in their final states, states after running. When aborted, all its recoverable resources are put back in their initial states, as before running.

A failure in transaction’s computing environment before ending typically results in its abort. However, a transaction may be aborted also for other reasons as well (e.g., see below). Upon being ended (completed), transaction’s allocated computing resources are released and the transaction disap- pears from the computing environment. However, the effects of a committed transaction remain in the database, while the effects of an aborted (rolled-back) transaction disappear from the database. The concept of atomic trans-

112 23.2. CORRECTNESS 113 action (“all or nothing” semantics) was designed to exactly achieve this behavior, in order to control correctness in complex faulty systems.

23.2 Correctness

23.2.1 Correctness - serializability

Serializability is a property of a transaction schedule (history). It relates to the isolation property of a database transaction.

Serializability of a schedule means equivalence (in the outcome, the database state, data values) to a serial schedule (i.e., sequential with no transaction overlap in time) with the same transactions. It is the major criterion for the correctness of concurrent transactions’ schedule, and thus supported in all general purpose database systems.

The rationale behind serializability is the following: If each transaction is correct by itself, i.e., meets certain integrity conditions, then a schedule that com- prises any serial execution of these transactions is correct (its transactions still meet their conditions): “Serial” means that transactions do not overlap in time and cannot interfere with each other, i.e, com- plete isolation between each other exists. Any order of the transactions is legitimate, if no dependencies among them exist, which is assumed (see comment below). As a result, a schedule that comprises any execution (not necessarily serial) that is equivalent (in its outcome) to any serial execution of these transactions, is correct.

Schedules that are not serializable are likely to generate erroneous outcomes. Well known examples are with trans- actions that debit and credit accounts with money: If the related schedules are not serializable, then the total sum of money may not be preserved. Money could disappear, or be generated from nowhere. This and violations of possibly needed other invariant preservations are caused by one transaction writing, and “stepping on” and erasing what has been written by another transaction before it has become permanent in the database. It does not happen if serializability is maintained. If any specific order between some transactions is requested by an application, then it is enforced independently of the underlying serializability mechanisms. These mechanisms are typically indifferent to any specific order, and generate some unpredictable partial order that is typically compatible with multiple serial orders of these transactions. This partial order results from the scheduling orders of concurrent transactions’ data access operations, which depend on many factors. A major characteristic of a database transaction is atomicity, which means that it either commits, i.e., all its opera- tions’ results take effect in the database, or aborts (rolled-back), all its operations’ results do not have any effect on the database (“all or nothing” semantics of a transaction). In all real systems transactions can abort for many reasons, and serializability by itself is not sufficient for correctness. Schedules also need to possess the recoverability (from abort) property. Recoverability means that committed transactions have not read data written by aborted transac- tions (whose effects do not exist in the resulting database states). While serializability is currently compromised on purpose in many applications for better performance (only in cases when application’s correctness is not harmed), compromising recoverability would quickly violate the database’s integrity, as well as that of transactions’ results ex- ternal to the database. A schedule with the recoverability property (a recoverable schedule) “recovers” from aborts by itself, i.e., aborts do not harm the integrity of its committed transactions and resulting database. This is false without recoverability, where the likely integrity violations (resulting incorrect database data) need special, typically manual, corrective actions in the database. Implementing recoverability in its general form may result in cascading aborts: Aborting one transaction may result in a need to abort a second transaction, and then a third, and so on. This results in a waste of already partially executed transactions, and may result also in a performance penalty. Avoiding cascading aborts (ACA, or Cascadelessness) is a special case of recoverability that exactly prevents such phenomenon. Often in practice a special case of ACA is utilized: Strictness. Strictness allows an efficient database recovery from failure. Note that the recoverability property is needed even if no database failure occurs and no database recovery from failure is needed. It is rather needed to correctly automatically handle aborts, which may be unrelated to database failure and recovery from failure. 114 CHAPTER 23. SERIALIZABILITY

23.2.2 Relaxing serializability

In many applications, unlike with finances, absolute correctness is not needed. For example, when retrieving a list of products according to specification, in most cases it does not matter much if a product, whose data was updated a short time ago, does not appear in the list, even if it meets the specification. It will typically appear in such a list when tried again a short time later. Commercial databases provide concurrency control with a whole range of isolation levels which are in fact (controlled) serializability violations in order to achieve higher performance. Higher performance means better transaction execution rate and shorter average transaction response time (transaction duration). is an example of a popular, widely utilized efficient relaxed serializability method with many characteristics of full serializability, but still short of some, and unfit in many situations. Another common reason nowadays for distributed serializability relaxation (see below) is the requirement of availability of internet products and services. This requirement is typically answered by large-scale data replication. The straight- forward solution for synchronizing replicas’ updates of a same database object is including all these updates in a single atomic . However, with many replicas such a transaction is very large, and may span several computers and networks that some of them are likely to be unavailable. Thus such a transaction is likely to end with abort and miss its purpose.[4] Consequently Optimistic replication (Lazy replication) is often utilized (e.g., in many products and services by Google, Amazon, Yahoo, and alike), while serializability is relaxed and compromised for eventual consistency. Again in this case, relaxation is done only for applications that are not expected to be harmed by this technique. Classes of schedules defined by relaxed serializability properties either contain the serializability class, or are incom- parable with it.

23.3 View and conflict serializability

Mechanisms that enforce serializability need to execute in real time, or almost in real time, while transactions are running at high rates. In order to meet this requirement special cases of serializability, sufficient conditions for serializability which can be enforced effectively, are utilized. Two major types of serializability exist: view-serializability, and conflict-serializability. View-serializability matches the general definition of serializability given above. Conflict-serializability is a broad special case, i.e., any schedule that is conflict-serializable is also view-serializable, but not necessarily the opposite. Conflict-serializability is widely utilized because it is easier to determine and covers a substantial portion of the view-serializable schedules. Determin- ing view-serializability of a schedule is an NP-complete problem (a class of problems with only difficult-to-compute, excessively time-consuming known solutions).

View-serializability of a schedule is defined by equivalence to a serial schedule (no overlapping trans- actions) with the same transactions, such that respective transactions in the two schedules read and write the same data values (“view” the same data values).

Conflict-serializability is defined by equivalence to a serial schedule (no overlapping transactions) with the same transactions, such that both schedules have the same sets of respective chronologically ordered pairs of conflicting operations (same precedence relations of respective conflicting operations).

Operations upon data are read or write (a write: either insert or modify or delete). Two operations are conflicting, if they are of different transactions, upon the same datum (data item), and at least one of them is write. Each such pair of conflicting operations has a conflict type: It is either a read-write, or write-read, or a write-write conflict. The transaction of the second operation in the pair is said to be in conflict with the transaction of the first operation. A more general definition of conflicting operations (also for complex operations, which may consist each of several “simple” read/write operations) requires that they are noncommutative (changing their order also changes their com- bined result). Each such operation needs to be atomic by itself (by proper system support) in order to be considered an operation for a commutativity check. For example, read-read operations are commutative (unlike read-write and the other possibilities) and thus read-read is not a conflict. Another more complex example: the operations increment and decrement of a counter are both write operations (both modify the counter), but do not need to be considered conflicting (write-write conflict type) since they are commutative (thus increment-decrement is not a conflict; e.g., already has been supported in the old IBM’s IMS “fast path”). Only precedence (time order) in pairs of conflicting (non-commutative) operations is important when checking equivalence to a serial schedule, since different schedules 23.4. ENFORCING CONFLICT SERIALIZABILITY 115

consisting of the same transactions can be transformed from one to another by changing orders between different transactions’ operations (different transactions’ interleaving), and since changing orders of commutative operations (non-conflicting) does not change an overall operation sequence result, i.e., a schedule outcome (the outcome is pre- served through order change between non-conflicting operations, but typically not when conflicting operations change order). This means that if a schedule can be transformed to any serial schedule without changing orders of conflicting operations (but changing orders of non-conflicting, while preserving operation order inside each transaction), then the outcome of both schedules is the same, and the schedule is conflict-serializable by definition. Conflicts are the reason for blocking transactions and delays (non-materialized conflicts), or for aborting transactions due to serializability violations prevention. Both possibilities reduce performance. Thus reducing the number of conflicts, e.g., by commutativity (when possible), is a way to increase performance. A transaction can issue/request a conflicting operation and be in conflict with another transaction while its conflicting operation is delayed and not executed (e.g., blocked by a lock). Only executed (materialized) conflicting operations are relevant to conflict serializability (see more below).

23.4 Enforcing conflict serializability

23.4.1 Testing conflict serializability

Schedule compliance with conflict serializability can be tested with the precedence graph (serializability graph, se- rialization graph, conflict graph) for committed transactions of the schedule. It is the directed graph representing precedence of transactions in the schedule, as reflected by precedence of conflicting operations in the transactions.

In the precedence graph transactions are nodes and precedence relations are directed edges. There exists an edge from a first transaction to a second transaction, if the second transaction is in conflict with the first (see Conflict serializability above), and the conflict is materialized (i.e., if the requested conflicting operation is actually executed: in many cases a requested/issued conflicting operation by a transaction is delayed and even never executed, typically by a lock on the operation’s object, held by another transaction, or when writing to a transaction’s temporary private workspace and materializing, copying to the database itself, upon commit; as long as a requested/issued conflicting operation is not executed upon the database itself, the conflict is non-materialized; non-materialized conflicts are not represented by an edge in the precedence graph).

Comment: In many text books only committed transactions are included in the precedence graph. Here all transactions are included for convenience in later discussions.

The following observation is a key characterization of conflict serializability:

A schedule is conflict-serializable if and only if its precedence graph of committed transactions (when only committed transactions are considered) is acyclic. This means that a cycle consisting of committed transactions only is generated in the (general) precedence graph, if and only if conflict-serializability is violated.

Cycles of committed transactions can be prevented by aborting an undecided (neither committed, nor aborted) trans- action on each cycle in the precedence graph of all the transactions, which can otherwise turn into a cycle of committed transactions (and a committed transaction cannot be aborted). One transaction aborted per cycle is both required and sufficient number to break and eliminate the cycle (more aborts are possible, and can happen in some mechanisms, but unnecessary for serializability). The probability of cycle generation is typically low, but nevertheless, such a situ- ation is carefully handled, typically with a considerable overhead, since correctness is involved. Transactions aborted due to serializability violation prevention are restarted and executed again immediately. Serializability enforcing mechanisms typically do not maintain a precedence graph as a data structure, but rather prevent or break cycles implicitly (e.g., SS2PL below). 116 CHAPTER 23. SERIALIZABILITY

23.4.2 Common mechanism - SS2PL

Main article: Two-phase locking

Strong strict two phase locking (SS2PL) is a common mechanism utilized in database systems since their early days in the 1970s (the “SS” in the name SS2PL is newer though) to enforce both conflict serializability and strictness (a special case of recoverability which allows effective database recovery from failure) of a schedule. In this mechanism each datum is locked by a transaction before accessing it (any read or write operation): The item is marked by, associated with a lock of a certain type, depending on operation (and the specific implementation; various models with different lock types exist; in some models locks may change type during the transaction’s life). As a result access by another transaction may be blocked, typically upon a conflict (the lock delays or completely prevents the conflict from being materialized and be reflected in the precedence graph by blocking the conflicting operation), depending on lock type and the other transaction’s access operation type. Employing an SS2PL mechanism means that all locks on data on behalf of a transaction are released only after the transaction has ended (either committed or aborted). SS2PL is the name of the resulting schedule property as well, which is also called rigorousness. SS2PL is a special case (proper subset) of Two-phase locking (2PL) Mutual blocking between transactions results in a deadlock, where execution of these transactions is stalled, and no completion can be reached. Thus deadlocks need to be resolved to complete these transactions’ execution and release related computing resources. A deadlock is a reflection of a potential cycle in the precedence graph, that would occur without the blocking when conflicts are materialized. A deadlock is resolved by aborting a transaction involved with such potential cycle, and breaking the cycle. It is often detected using a wait-for graph (a graph of conflicts blocked by locks from being materialized; it can be also defined as the graph of non-materialized conflicts; conflicts not materialized are not reflected in the precedence graph and do not affect serializability), which indicates which transaction is “waiting for” lock release by which transaction, and a cycle means a deadlock. Aborting one transaction per cycle is sufficient to break the cycle. Transactions aborted due to deadlock resolution are restarted and executed again immediately.

23.4.3 Other enforcing techniques

Other known mechanisms include:

• Precedence graph (or Serializability graph, Conflict graph) cycle elimination

• Two-phase locking (2PL)

• Timestamp ordering (TO)

• Serializable snapshot isolation[5] (SerializableSI)

The above (conflict) serializability techniques in their general form do not provide recoverability. Special enhance- ments are needed for adding recoverability.

Optimistic versus pessimistic techniques

Concurrency control techniques are of three major types:

1. Pessimistic: In Pessimistic concurrency control a transaction blocks data access operations of other transac- tions upon conflicts, and conflicts are non-materialized until blocking is removed. This is done to ensure that operations that may violate serializability (and in practice also recoverability) do not occur.

2. Optimistic: In Optimistic concurrency control data access operations of other transactions are not blocked upon conflicts, and conflicts are immediately materialized. When the transaction reaches the ready state, i.e., its running state has been completed, possible serializability (and in practice also recoverability) violation by the transaction’s operations (relatively to other running transactions) is checked: If violation has occurred, the transaction is typically aborted (sometimes aborting another transaction to handle serializability violation is preferred). Otherwise it is committed. 23.5. DISTRIBUTED SERIALIZABILITY 117

3. Semi-optimistic: Mechanisms that mix blocking in certain situations with not blocking in other situations and employ both materialized and non-materialized conflicts

The main differences between the technique types is the conflict types that are generated by them. A pessimistic method blocks a transaction operation upon conflict and generates a non-materialized conflict, while an optimistic method does not block and generates a materialized conflict. A semi-optimistic method generates both conflict types. Both conflict types are generated by the chronological orders in which transaction operations are invoked, indepen- dently of the type of conflict. A cycle of committed transactions (with materialized conflicts) in the precedence graph (conflict graph) represents a serializability violation, and should be avoided for maintaining serializability. A cycle of (non-materialized) conflicts in the wait-for graph represents a deadlock situation, which should be resolved by breaking the cycle. Both cycle types result from conflicts, and should be broken. At any technique type conflicts should be detected and considered, with similar overhead for both materialized and non-materialized conflicts (typ- ically by using mechanisms like locking, while either blocking for locks, or not blocking but recording conflict for materialized conflicts). In a blocking method typically a context switching occurs upon conflict, with (additional) incurred overhead. Otherwise blocked transactions’ related computing resources remain idle, unutilized, which may be a worse alternative. When conflicts do not occur frequently, optimistic methods typically have an advantage. With different transactions loads (mixes of transaction types) one technique type (i.e., either optimistic or pessimistic) may provide better performance than the other. Unless schedule classes are inherently blocking (i.e., they cannot be implemented without data-access operations blocking; e.g., 2PL, SS2PL and SCO above; see chart), they can be implemented also using optimistic techniques (e.g., Serializability, Recoverability).

Serializable multi-version concurrency control

See also Multiversion concurrency control (partial coverage) and Serializable_Snapshot_Isolation in Snapshot isolation

Multi-version concurrency control (MVCC) is a common way today to increase concurrency and performance by generating a new version of a database object each time the object is written, and allowing transactions’ read operations of several last relevant versions (of each object), depending on scheduling method. MVCC can be combined with all the serializability techniques listed above (except SerializableSI which is originally MVCC based). It is utilized in most general-purpose DBMS products. MVCC is especially popular nowadays through the relaxed serializability (see above) method Snapshot isolation (SI) which provides better performance than most known serializability mechanisms (at the cost of possible serializability violation in certain cases). SerializableSI, which is an efficient enhancement of SI to make it serializable, is intended to provide an efficient serializable solution. SerializableSI has been analyzed[5][6] via a general theory of MVCC

23.5 Distributed serializability

23.5.1 Overview

Distributed serializability is the serializability of a schedule of a transactional distributed system (e.g., a distributed database system). Such system is characterized by distributed transactions (also called global transactions), i.e., trans- actions that span computer processes (a process abstraction in a general sense, depending on computing environment; e.g., operating system's thread) and possibly network nodes. A distributed transaction comprises more than one local sub-transactions that each has states as described above for a database transaction. A local sub-transaction comprises a single process, or more processes that typically fail together (e.g., in a single processor core). Distributed trans- actions imply a need in Atomic commit protocol to reach consensus among its local sub-transactions on whether to commit or abort. Such protocols can vary from a simple (one-phase) hand-shake among processes that fail together, to more sophisticated protocols, like Two-phase commit, to handle more complicated cases of failure (e.g., process, node, communication, etc. failure). Distributed serializability is a major goal of distributed concurrency control for correctness. With the proliferation of the Internet, Cloud computing, Grid computing, and small, portable, pow- erful computing devices (e.g., smartphones) the need for effective distributed serializability techniques to ensure correctness in and among distributed applications seems to increase. 118 CHAPTER 23. SERIALIZABILITY

Distributed serializability is achieved by implementing distributed versions of the known centralized techniques.[1][2] Typically all such distributed versions require utilizing conflict information (either of materialized or non-materialized conflicts, or equivalently, transaction precedence or blocking information; conflict serializability is usually utilized) that is not generated locally, but rather in different processes, and remote locations. Thus information distribution is needed (e.g., precedence relations, lock information, timestamps, or tickets). When the distributed system is of a relatively small scale, and message delays across the system are small, the centralized concurrency control methods can be used unchanged, while certain processes or nodes in the system manage the related algorithms. However, in a large-scale system (e.g., Grid and Cloud), due to the distribution of such information, substantial performance penalty is typically incurred, even when distributed versions of the methods (Vs. centralized) are used, primarily due to computer and communication latency. Also, when such information is distributed, related techniques typically do not scale well. A well-known example with scalability problems is a distributed lock manager, which distributes lock (non-materialized conflict) information across the distributed system to implement locking techniques.

23.6 See also

• Strong strict two-phase locking (SS2PL or Rigorousness). • Making snapshot isolation serializable[5] in Snapshot isolation.

• Global serializability, where the Global serializability problem and its proposed solutions are described. • Linearizability, a more general concept in

23.7 Notes

[1] Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman (1987): Concurrency Control and Recovery in Database Systems (free PDF download), Addison Wesley Publishing Company, ISBN 0-201-10715-5

[2] Gerhard Weikum, Gottfried Vossen (2001): Transactional Information Systems, Elsevier, ISBN 1-55860-508-8

[3] Maurice Herlihy and J. Eliot B. Moss. Transactional memory: architectural support for lock-free data structures. Pro- ceedings of the 20th annual international symposium on Computer architecture (ISCA '93). Volume 21, Issue 2, May 1993.

[4] Gray, J.; Helland, P.; O’Neil, P.; Shasha, D. (1996). The dangers of replication and a solution (PDF). Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data. pp. 173–182. doi:10.1145/233269.233330.

[5] Michael J. Cahill, Uwe Röhm, Alan D. Fekete (2008): “Serializable isolation for snapshot databases”, Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 729-738, Vancouver, Canada, June 2008, ISBN 978-1-60558-102-6 (SIGMOD 2008 best paper award)

[6] Alan Fekete (2009), “Snapshot Isolation and Serializable Execution”, Presentation, Page 4, 2009, The university of Sydney (Australia). Retrieved 16 September 2009

23.8 References

• Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman (1987): Concurrency Control and Recovery in Database Systems, Addison Wesley Publishing Company, ISBN 0-201-10715-5

• Gerhard Weikum, Gottfried Vossen (2001): Transactional Information Systems, Elsevier, ISBN 1-55860-508- 8 Chapter 24

Shared register

In distributed computing, shared-memory systems and message-passing systems are two means of interprocess com- munication which have been heavily studied. In shared-memory systems, processes communicate by accessing shared data structures. A shared (read-write) register, sometimes just called a register, is a fundamental type of shared data structure which stores a value and has two operations: Read, which returns the value stored in the register, and Write, which updates the value stored. Other types of shared data structures include read-modify-write, test-and-set, compare-and-swap etc. The memory location which is concurrently accessed is sometimes called a register.

24.1 Classification

Registers can be classified according to the consistency condition they satisfy when accessed concurrently, the domain of possible values that can be stored, and how many processes can access with the Read or Write operation, which leads to in total 24 register types.[1] When Read and Write happen concurrently, the value returned by Read may not be uniquely determined. Lamport defined three types of registers: safe registers, regular registers and atomic registers.[1] A Read operation of a safe register can return any value if it is concurrent with a Write operation, and returns the value written by the most recent Write operation if the Read operation does not overlap with any Write. A regular register differs from a safe register in that the read operation can return the value written by either the most recent completed Write operation or a Write operation it overlaps with. An atomic register satisfies the stronger condition of being linearizable. Registers can be characterized by how many processes can access with a Read or Write operation. A single-writer (SW) register can only be written by one process and a multiple-writer (MW) register can be written by multiple processes. Similarly single-reader (SR) register can only be read by one process and multiple-reader (MR) register can be read by multiple processes. For a SWSR register, it is not necessary that the writer process and the reader process are the same.

24.2 Constructions

The figure below illustrates the constructions stage by stage from the implementation of SWSR register in an asyn- chronous message-passing system to the implementation of MWMR register using a SW Snapshot object. This kind of construction is sometimes called simulation or emulation.[2] In each stage (except Stage 3), the object type on the right can be implemented by the simpler object type on the left. The constructions of each stage (except Stage 3) are briefly presented below. There is an article which discusses the details of constructing snapshot objects. An implementation is linearizable if, for every execution there is a linearization ordering that satisfies the following two properties: (1) if operations were done sequentially in order of their linearization, they would return the same result as in the concurrent execution. (2) If operation op1 ends before operation op2 begins, then op1 comes before op2 in linearization.

119 120 CHAPTER 24. SHARED REGISTER

Shared Register Stages of Constructions

24.2.1 Implementing an atomic SWSR register in a message passing system

A SWSR atomic (linearizable) register can be implemented in an asynchronous message-passing system, even if pro- cesses may crash. There is no time limit for processes to deliver messages to receivers or to execute local instructions. In other words, processes can not distinguish between processes which respond slowly or simply crash.

Implementation of Atomic SWSR Register in MP System

The implementation given by Attiya, Bar-Noy and Dolev[3] requires n>2f, where n is the total number of processes in the system and f is the maximum number of processes that can crash during execution. The algorithm is as follows: The linearization order of operations is: linearize WRITEs in the order as they occur and insert the READ after the WRITE whose value it returns. We can check that the implementation is linearizable. We can check property 2 espe- cially when op1 is WRITE and op2 is READ and READ is immediately after WRITE. We can show by contradiction. Assume the READ does not see the WRITE, and then according to the implementation, we must have two disjoint sets of size (n-f) among the n processes. So 2*(n-f)≤ n leading to n≤2f, which contradicts the fact that n>2f. So the READ must read at least one value written by that WRITE.

24.2.2 Implementing a SWMR register from SWSR registers

A SWMR register can be written by only one process but can be read by multiple processes. Let n be the number of processes which can read the SWMR register. Let Rᵢ, 0

Implementation of SWMR register using SWSR registers

Rᵢ when 0

24.2.3 Implementing a MWMR register from a SW Snapshot object

We can use the a SW Snapshot object of size n to construct a MWMR register. The linearization order is as follows. Order WRITE operations by t-values. If several WRITEs have the same t-value, order the operation with small process ID in front. Insert READs right after WRITE whose value they return, breaking ties by process ID and if still tied, break tie by start time. 122 CHAPTER 24. SHARED REGISTER

24.3 See also

• Hardware Register

• Distributed shared memory • Shared snapshot objects

24.4 References

[1] Kshemkalyani, Ajay D.; Singhal, Mukesh (2008). Distributed computing : principles, algorithms, and systems. Cambridge: Cambridge University Press. pp. 435–437. ISBN 9780521876346.

[2] Attiya, Hagit; Welch, Jennifer (Mar 25, 2004). Distributed computing: fundamentals, simulations, and advanced topics. John Wiley & Sons, Inc. ISBN 978-0-471-45324-6.

[3] Attiya, Hagit; Bar-Noy, Amotz; Dolev, Danny (1990). “Sharing Memory Robustly in Message-passing Systems”. Proceed- ings of the Ninth Annual ACM Symposium on Principles of Distributed Computing. PODC '90: 363–375. doi:10.1145/93385.93441. Chapter 25

Shared snapshot objects

In distributed computing, a shared snapshot object is a type of data structure, which is shared between several threads or processes. For many tasks, it is important to have a data structure, that can provide a consistent view of the state of the memory. In practice, it turns out that it is not possible to get such a consistent state of the memory by just accessing one shared register after another, since the values stored in individual registers can be changed at any time during this process. To solve this problem, snapshot objects store a vector of n components and provide the following two atomic operations: update(i,v) changes the value in the ith component to v, and scan() returns the values stored in all n components.[1][2] Snapshot objects can be constructed using atomic single-writer multi-reader shared registers. In general, one distinguishes between single-writer multi-reader (swmr) snapshot objects and multi-writer multi- reader (mwmr) snapshot objects. In a swmr snapshot object, the number of components matches the number of processes and only one process Pi is allowed to write to the memory position i and all the other processes are allowed to read the memory. In contrast, in a mwmr snapshot object all processes are allowed to write to all positions of the memory and are allowed to read the memory as well.

25.1 General

A shared memory is partitioned into multiple parts. Each of these parts holds a single data value. In the single-writer multi-reader case each process Pi has a memory position i assigned and only this process is allowed to write to the memory position. However, every process is allowed to read any position in the memory. In the multi-writer multi- reader case, the restriction changes and any process is allowed to change any position of the memory. Any process Pi ∈ {1,...,n} in an n-process system is able to perform two operations on the snapshot object: scan() and update(i,v). The scan operation has no arguments and returns a consistent view of the memory. The update(i,v) operation updates the memory at the position i with the value v. Both types of operations are considered to occur atomically between the call by the process and the return by the memory. More generally speaking, in the data vector d each entry dk corresponds to the argument of the last linearized update operation, which updates part k of the memory.[1] In order to get the full benefit of shared snapshot objects, in terms of simplifications for validations and constructions, there are two other restrictions added to the construction of snapshot objects. [1] The first restriction is an architec- tural one, meaning that any snapshot object is constructed only with single-writer multi-reader registers as the basic element. This is possible for single-writer multi-reader snapshots. For multi-writer multi-reader snapshot objects it is possible to use multi-reader multi-writer registers, which can in turn be constructed from single-writer multi-reader registers.[1][3][4] In distributed computing the construction of a system is driven by the goal, that the whole system is making progress during the execution. Thus, the behaviour of a process should not bring the whole system to a halt (Lock-freedom). The stronger version of this is the property of wait-freedom, meaning that no process can prevent another process from terminating its operation. More generally, this means that every operation has to terminate after a finite number of steps regardless of the behaviour of other processes. A very basic snapshot algorithm guarantees system-wide progress, but is only lock-free. It is easy to extend this algorithm, so that it is wait-free. The algorithm by Afek et al.,[1] which is presented in the section Implementation has this property.

123 124 CHAPTER 25. SHARED SNAPSHOT OBJECTS

25.2 Implementation

Several methods exists to implement shared snapshot objects. The first presented algorithm provides a principal implementation of a snapshot objects. However, this implementation only provides the property of lock-freedom. The second presented implementation proposed by Afel et al.[1] has a stronger property called wait-freedom. An overview of other implementations is given by Fich.[2]

25.2.1 Basic swmr snapshot algorithm

The basic idea of this algorithm is that every process executing the scan() operations, reads all the memory values twice. If the algorithm reads exactly the same memory content twice, no other process changed a value in between and it can return the result. A process, which executes a update(i,v) operation, just update his value in the memory. function scan() while true a[1..n] := collect; b[1..n] := collect; if (∀i∈{1,..,n} location i was not changed between the reads of it during the two collects)) then return b; // double collect successful loop end function update(i,v) M[i] := v; end This algorithm provides a very basic implementation of snapshot objects. It guarantees that the system

Fig.1: One process always updates the memory, during the double collect of the other process. Thus, the scanning-process is never able to terminate.

proceeds, while individual threads can starve due to the behaviour of individual processes. A process Pi can prevent another process Pj from terminating a scan() operation by always changing its value, in between the two memory collects. Thus, the algorithm is lock-free, but not wait-free. To hold this stronger the property, no process is allowed to starve due to the behavior of other processes. Figure 1 illustrates the problem. While P1 tries to execute the scan() operation, a second process P2 always disturbs the “double-collect”. Thus, the scanning process always has to restart the operation and can never terminates and starves.

25.2.2 Single-Writer Multi-Reader implementation by Afek et al.

The basic idea of the swmr snapshot algorithm by Afek et al. is that a process can detect whether another process changed its memory location and that processes help each other. In order to detect if another process changed its value, a counter is attached to each register and a process increases the counter on every update. The second idea is that, every process who updates its memory position also performs a scan() operation and provides its “view of the memory” in its register to other processes. A scanning process can borrow this scan result and return it.

Based on unbounded memory

Using this idea one can construct a wait-free algorithm that uses registers of unbounded size. A process performing an update operation can help a process to complete the scan. The basic idea is that if a process sees another process 25.2. IMPLEMENTATION 125

updating a memory location twice, that process must have executed a complete, linearized, update operation in between. To implement this, every update operation first performs a scan of the memory and then writes the snapshot value atomically together with the new value v and a sequence number. If a process is performing a scan of the memory and detects that a process updated the memory part twice, it can “borrow” the “embedded” scan of the update to complete the scan operation.[1] function scan() // returns a consistent view of the memory for j = 1 to n do moved[j] := 0 end while true do a[1..n] := collect; // collects (data, sequence, view) triples b[1..n] := collect; // collects (data, sequence, view) triples if (∀j∈{1, ..., n}) (a[j].seq = b[j].seq) then return (b[1].data, ..., b[n].data) // no process changed memory else for j = 1 to n do if a[j].seq ≠ b[j].seq then // process moved if moved[j] = 1 then // process already moved before return b[j].view; else moved[j] := moved[j] + 1; end end end function procedure update(i,v) // updates the registers with the data-values, updates the sequence number, embedded scan s[1..n] := scan; // embedded scan rᵢ := (v, rᵢ.seq = rᵢ.seq + 1, s[1..n]); end procedure Every register consists of a field for the data-value, the sequence number and a field for

Fig.2: Example linearization order for a single-writer multi-reader snapshot object. The first scan() can successfully perform a double-collect, while the “double-collect” of the second scan is interrupted twice by the second process. Thus, the process borrows an embedded scan.

the result of the last embedded scan, collected before the last update. In each scan operation the process Pi can decide whether a process changed its memory using the sequence number. If there is no change to the memory during the double collect, Pi can return the result of the second scan. Once the process observes that another process updated the memory in between, it saves this information in the moved field. If a process Pj changed its memory twice during the execution of the scan(), the scanning process Pi can return the embedded scan of the updating process, which it saved in its own register during its update operation. These operations can be linearized by linearizing each update() operation at its to the register. The scan operation is more complicated to linearize. If the double collect of the scan operation is successful the scan operation can be linearized at the end of the second scan. In the other case - one process updated its register two times - the operation can be linearized at the time the updating process collected its embedded scan before writing its value to the register. [1]

Based on bounded memory

One of the limitations of the presented algorithm is that it is based on an unbounded memory since the sequence number will increase constantly. To overcome this limitation, it is necessary to introduce a different way to detect whether a process has changed its memory position twice in between. Every pair of processes ⟨Pi,Pj⟩ communicates using two single-writer single-reader (swsr) registers, which contains two atomic bits. Before a process starts to perform a “double collect”, it copies the value of its partner process to its own register. If the scanner-process Pi observes after executing the “double-collect” that the value of the partner process Pj has changed in between it indicates that the process has performed an update operation on the memory. [1] 126 CHAPTER 25. SHARED SNAPSHOT OBJECTS

function scan() // returns a consistent view of the memory for j=1 to n do moved[j] := 0 end while true do for j=1 to n do qᵢ, := r.p,ᵢ end a[1..n] := collect; // collects (data, bit-vector, toggle, view) triples b[1..n] := collect; // collects (data, bit-vector, toggle, view) triples if (∀j∈{1, ...,n}) (a[j].p,ᵢ = b[j].p,ᵢ = qᵢ,) and a[j].toggle = b[j].toggle then return (b[1].data, ..., b[n].data) // no process changed memory else for j=1 to n do if (a[j].p,ᵢ ≠ qᵢ,) or (b[j].p,ᵢ ≠ qᵢ,) or (a[j].toggle ≠ b[j].toggle) then // process j performed an update if moved[j] = 1 then // process already moved before return b[j].view; else moved[j] := moved[j] + 1; end end end function procedure update(i,v) // updates the registers with the data-value, “write-state” of all registers, invert the toggle bit and the embedded scan for j = 1 to n do f[j] := ¬q,ᵢ end s[1..n] := scan; // embedded scan rᵢ := (v, f[1..n], ¬rᵢ.toggle, s[1..n]); end procedure The unbounded sequence number is replaced by two handshake bits for every pair of processes. These handshake bits are based on swsr registers and can be expressed by an matrix M, where process Pi is allowed to write to row i and is allowed to read the handshake bits in a column i. Before the scanning-process performs the double-collect it collects all the handshake bits from all registers, by reading its column. Afterwards, it can decide whether a process changed its value during the double value or not. Therefore the process just has to compare the column again with the initially read handshake bits. If only one process Pj has written twice, during the collection of Pi it is possible that the handshake bits do not change during the scan. Thus, it is necessary to introduce another bit called “toggle bit”, this bit is changed in every write. This makes it possible to distinguish two consecutive writes, even though no other process updated its register. This approach allows to substitute the unbounded sequence numbers with the handshake bits, without changing anything else in the scan procedure. While the scanning process Pi uses a handshake bit to detect whether it can use its double collect or not, other processes may also perform update operations. As a first step, they read again the handshake bits provided by the other processes, and generate the complement of them. Afterwards these processes again generate the embedded scan and save the updated data-value, the collected - complemented - handshake bits, the complemented toggle bit and the embedded scan to the register. Since the handshake bits equivalently replace the sequence numbers, the linearization is the same as in the unbounded memory case.

25.2.3 Multi-Writer Multi-Reader implementation by Afek et al.

The construction of multi-writer multi-reader snapshot object assumes that n processes are allowed to write to any location in the memory, which consists of m registers. So, there is no correlation, between process id and memory location anymore. Therefore it is not possible anymore to couple the handshake bits or the embedded scan with the data fields. Hence, the handshake bits, the data memory and the embedded scan cannot be stored in the same register and the write to the memory is not an atomic operation anymore.

Fig.3: Shows an exemplary linearization for a multi-writer multi-reader snapshot object

Therefore, the update() process has to update three different registers independently. It first has to save the handshake bits it reads, then do the embedded scan and finally saves its value to the designated memory position. Each write independently appears to be done atomically, but together they are not. The new update() procedure leads to some 25.3. COMPLEXITY 127

changes in the scan() function. It is not sufficient anymore to read the handshake bits and collect the memory content twice. In order to detect a beginning update process, a process has to collect the handshake bits a second time, after collecting the memory content. If a double-collect fails, it is now necessary that a process sees another process moving three times before borrowing the embedded scan. Figure 3 illustrates the problem. The first double-collect fails, because a update process started before the scan operation finishes its memory-write during the first double collect. However, the embedded scan of this write is performed and saved, before P1 starts scanning the memory and therefore no valid Linearization point. The second double-collect fails, because process P2 starts a second write and updated its handshake bits. In the swmr scenario, we would now borrow the embedded scan and return it. In the mwmr scenario, this is not possible because the embedded scan from the second write is still not linearized within the scan-interval (begin and end of operation). Thus, the process has to see a third change from the other process to be entirely sure that at least one embedded scan has been linearized during the scan-interval. After the third change by one process, the scanning process can borrow the old memory value without violating the linearization criterion.

25.3 Complexity

The basic presented implementation of shared snapshot objects by Afek et al. needs O(n2) memory operations. [1] Another implementation by Anderson, which was developed independently, needs an exponential number of operations O(2n) .[5] There are also randomized implementations of snapshot objects based on swmr registers us- ing O(n log2 n) operations. [6] Another implementation by Israeli and Shirazi, using unbounded memory, requires O(n3/2 log2 n) operations on the memory. [7][8] Israeli et al. show in a different work the lower bound of low-level operations for any update operation. The lower bound is Ω(min{w, r}) , where w is the number of updaters and r is the number of scanners. Attiya and Rachman present a deterministic snapshot algorithm based on swmr registers, which uses O(n log n) operations per update and scan. [8] Applying a general method by Israeli, Shaham, and Shi- razi [9] this can be improved to an unbounded snapshot algorithm, which only needs O(n log n) operations per scan and O(n) operations per update. There are further improvements introduced by Inoue et al.,[10] using only a linear number of read and write operations. In contrast to the other presented methods, this approach uses mwmr registers and not swmr registers.

25.4 Applications

There are several algorithms in distributed computing which can be simplified in design and/or verification using shared snapshot objects.[1] Examples of this are exclusion problems,[11][12][13] concurrent time-stamp systems,[14] approximate agreement,[15] randomized consensus[16][17] and wait-free implementations of other data structures.[18] With mwmr snapshot objects it is also possible to create atomic multi-writer multi-reader registers.

25.5 See also

• Shared register

• Shared memory

• Distributed shared memory

• Linearizability

25.6 References

[1] Afek, Yehuda; Attiya, Hagit; Dolev, Danny; Gafni, Eli; Merritt, Michael; Shavit, Nir (Sep 1993). “Atomic Snapshots of Shared Memory”. J. ACM 40 (4): 873–890. doi:10.1145/153724.153741. Retrieved 14 November 2014.

[2] Fich, Faith Ellen (2005). How hard is it to take a snapshot? (SOFSEM 2005: Theory and Practice of Computer Science ed.). Springer Berlin Heidelberg. pp. 28–37. ISBN 978-3-540-24302-1. Retrieved 14 November 2014. 128 CHAPTER 25. SHARED SNAPSHOT OBJECTS

[3] Li, Ming; Tromp, John; Vitanyi, Paul M. B. (July 1996). “How to Share Concurrent Wait-free Variables”. J. ACM 43 (4): 723–746. doi:10.1145/234533.234556. Retrieved 23 November 2014.

[4] Peterson, Gary L; Burns, James E. (1987). “Concurrent reading while writing ii: the multi-writer case”. Foundations of Computer Science, 1987., 28th Annual Symposium on. pp. 383–392.

[5] Anderson, James H (1993). “Composite registers”. Distributed Computing (Springer) 6 (3): 141–154. doi:10.1007/BF02242703. Retrieved 14 November 2014.

[6] Attiya, Hagit; Helihy, Maurice; Rachman, Ophir (1995). “Atomic snapshots using lattice agreement”. Distributed Comput- ing 8 (3): 121–132. doi:10.1007/BF02242714. Retrieved 14 November 2014.

[7] Israeli, Amos; Shirazi, Asaf (1992). “Efficient snapshot protocol using 2-lattice agreement”. manuscript.

[8] Attiya, Hagit; Rachman, Ophir (April 1998). “Atomic Snapshots in O( n log n ) Operations”. SIAM Journal on Computing 27 (2): 319–340. doi:10.1145/164051.164055. Retrieved 14 November 2014.

[9] Israeli, Amos; Shaham, Amnon; Shirazi, Asaf (1993). “Linear-time snapshot protocols for unbalanced systems”. Distributed Algorithms. Springer. pp. 26–38. doi:10.1007/3-540-57271-6_25. ISBN 978-3-540-57271-8.

[10] Inoue, Michiko; Masuzawa, Toshimitsu; Chen, Wei; Tokura, Nobuki (1994). Distributed Algorithms 857. Springer. pp. 130–140. doi:10.1007/BFb0020429. ISBN 978-3-540-58449-0. Retrieved 14 November 2014.

[11] Dolev, Danny; Gafni, Eli; Shavit, Nir (1988). “Toward a non-atomic era: l-exclusion as a test case”. pp. 78–92. Missing or empty |title= (help)

[12] Katseff, Howard P (1978). “A new solution to the critical section problem”. pp. 86–88. Missing or empty |title= (help)

[13] Lamport, Leslie (1988). “The mutual exclusion problem: partII—statement and solutions”. Journal of the ACM (JACM) 33 (2): 327–348.

[14] Dolev, Danny; Shavit, Nir (1989). “Bounded concurrrent time-stamp systems are constructible”. ACM. pp. 454–466. Missing or empty |title= (help)

[15] Attiya, Hagit; Lynch, Nancy; Shavit, Nir (1990). “Are wait-free algorithms fast?". pp. 55–64. Missing or empty |title= (help)

[16] Abrahamson, Karl (1988). “On achieving consensus using a shared memory”. pp. 291–302. Missing or empty |title= (help)

[17] Attiya, Hagit; Dolev, Danny; Shavit, Nir (1989). Bounded polynomial randomized consensus. pp. 281–293.

[18] Aspnes, James; Herlihy, Maurice (1990). “Wait-free data structures in the asynchronous PRAM model”. ACM. pp. 340– 349. Missing or empty |title= (help) Chapter 26

State machine replication

In computer science, state machine replication or state machine approach is a general method for implementing a fault-tolerant service by replicating servers and coordinating client interactions with server replicas. The approach also provides a framework for understanding and designing replication management protocols.

26.1 Problem definition

26.1.1 Distributed services

Distributed software is often structured in terms of clients and services. Each service comprises one or more servers and exports operations that clients invoke by making requests. Although using a single, centralized, server is the simplest way to implement a service, the resulting service can only be as fault tolerant as the processor executing that server. If this level of fault tolerance is unacceptable, then multiple servers that fail independently must be used. Usually, replicas of a single server are executed on separate processors of a distributed system, and protocols are used to coordinate client interactions with these replicas. The physical and electrical isolation of processors in a distributed system ensures that server failures are independent, as required.

26.1.2 State machine

Main article: Finite-state machine

For the subsequent discussion a State Machine will be defined as the following tuple of values [1] (See also Mealy machine and Moore Machine):

• A set of States • A set of Inputs • A set of Outputs • A transition function (Input × State → State) • An output function (Input × State → Output) • A distinguished State called Start.

A State Machine begins at the State labeled Start. Each Input received is passed through the transition and output function to produce a new State and an Output. The State is held stable until a new Input is received, while the Output is communicated to the appropriate receiver. This discussion requires a State Machine to be deterministic: multiple copies of the same State Machine begun in the Start state, and receiving the same Inputs in the same order will arrive at the same State having generated the same Outputs.

129 130 CHAPTER 26. STATE MACHINE REPLICATION

State Machines can implement any algorithm when driven by an appropriate Input stream, including Turing-complete algorithms (see Turing machine). Typically, systems based on State Machine Replication voluntarily restrict their implementations to use finite-state machines to simplify error recovery.

26.1.3 Fault Tolerance

Determinism is an ideal characteristic for providing fault-tolerance. Intuitively, if multiple copies of a system exist, a fault in one would be noticeable as a difference in the State or Output from the others. A little deduction shows the minimum number of copies needed for fault-tolerance is three; one which has a fault, and two others to whom we compare State and Output. Two copies is not enough; there is no way to tell which copy is the faulty one. Further deduction shows a three-copy system can support at most one failure (after which it must repair or replace the faulty copy). If more than one of the copies were to fail, all three States and Outputs might differ, and there would be no way to choose which is the correct one. In general a system which supports F failures must have 2F+1 copies (also called replicas).[2] The extra copies are used as evidence to decide which of the copies are correct and which are faulty. Special cases can improve these bounds.[3] All of this deduction pre-supposes that replicas are experiencing only random independent faults such as memory errors or hard-drive crash. Failures caused by replicas which attempt to lie, deceive, or collude can also be handled by the State Machine Approach, with isolated changes. It should be noted that failed replicas are not required to stop; they may continue operating, including generating spurious or incorrect Outputs.

Special Case: Fail-Stop

Theoretically, if a failed replica is guaranteed to stop without generating outputs, only F+1 replicas are required, and clients may accept the first output generated by the system. No existing systems achieve this limit, but it is often used when analyzing systems built on top of a fault-tolerant layer (Since the fault-tolerant layer provides fail-stop semantics to all layers above it).

Special Case: Byzantine Failure

Faults where a replica sends different values in different directions (for instance, the correct Output to some of its fellow replicas and incorrect Outputs to others) are called Byzantine Failures.[4] Byzantine failures may be random, spurious faults, or malicious, intelligent attacks. 2F+1 replicas, with non-cryptographic hashes suffices to survive all non-malicious Byzantine failures (with high probability). Malicious attacks require cryptographic primitives to achieve 2F+1 (using message signatures), or non-cryptographic techniques can be applied but the number of replicas must be increased to 3F+1.

26.2 The State Machine Approach

The preceding intuitive discussion implies a simple technique for implementing a fault-tolerant service in terms of a State Machine:

1. Place copies of the State Machine on multiple, independent servers.

2. Receive client requests, interpreted as Inputs to the State Machine.

3. Choose an ordering for the Inputs.

4. Execute Inputs in the chosen order on each server.

5. Respond to clients with the Output from the State Machine. 26.2. THE STATE MACHINE APPROACH 131

6. Monitor replicas for differences in State or Output.

The remainder of this article develops the details of this technique.

• Step 1 and 2 are outside the scope of this article. • Step 3 is the critical operation, see Ordering Inputs. • Step 4 is covered by the State Machine Definition. • Step 5, see Ordering Outputs. • Step 6, see Auditing and Failure Detection.

The appendix contains discussion on typical extensions used in real-world systems such as Logging, Checkpoints, Reconfiguration, and State Transfer.

26.2.1 Ordering Inputs

The critical step in building a distributed system of State Machines is choosing an order for the Inputs to be pro- cessed. Since all non-faulty replicas will arrive at the same State and Output if given the same Inputs, it is imperative that the Inputs are submitted in an equivalent order at each replica. Many solutions have been proposed in the literature.[1][5][6][7][8] A Visible Channel is a communication path between two entities actively participating in the system (such as clients and servers). Example: client to server, server to server A Hidden Channel is a communication path which is not revealed to the system. Example: client to client channels are usually hidden; such as users communicating over a telephone, or a process writing files to disk which are read by another process. When all communication paths are visible channels and no hidden channels exist, a partial global order (Causal Order) may be inferred from the pattern of communications.[7][9] Causal Order may be derived independently by each server. Inputs to the State Machine may be executed in Causal Order, guaranteeing consistent State and Output for all non-faulty replicas. In open systems, hidden channels are common and a weaker form of ordering must be used. An order of Inputs may be defined using a voting protocol whose results depend only on the visible channels. The problem of voting for a single value by a group of independent entities is called Consensus. By extension, a series of values may be chosen by a series of consensus instances. This problem becomes difficult when the participants or their communication medium may experience failures.[2] Inputs may be ordered by their position in the series of consensus instances (Consensus Order).[6] Consensus Order may be derived independently by each server. Inputs to the State Machine may be executed in Consensus Order, guaranteeing consistent State and Output for all non-faulty replicas.

Optimizing Causal & Consensus Ordering In some cases additional information is available (such as real-time clocks). In these cases, it is possible to achieve more efficient causal or consensus ordering for the Inputs, with a reduced number of messages, fewer message rounds, or smaller message sizes. See references for details [10][3][5][11]

Further optimizations are available when the semantics of State Machine operations are accounted for (such as Read vs Write operations). See references Generalized Paxos.[1][12]

26.2.2 Sending Outputs

Client requests are interpreted as Inputs to the State Machine, and processed into Outputs in the appropriate order. Each replica will generate an Output independently. Non-faulty replicas will always produce the same Output. Before the client response can be sent, faulty Outputs must be filtered out. Typically, a majority of the Replicas will return the same Output, and this Output is sent as the response to the client. 132 CHAPTER 26. STATE MACHINE REPLICATION

26.2.3 System Failure

If there is no majority of replicas with the same Output, or if less than a majority of replicas returns an Output, a system failure has occurred. The client response must be the unique Output: FAIL.

26.2.4 Auditing and Failure Detection

The permanent, unplanned compromise of a replica is called a Failure. Proof of failure is difficult to obtain, as the replica may simply be slow to respond,[13] or even lie about its status.[4] Non-faulty replicas will always contain the same State and produce the same Outputs. This invariant enables failure detection by comparing States and Outputs of all replicas. Typically, a replica with State or Output which differs from the majority of replicas is declared faulty. A common implementation is to pass checksums of the current replica State and recent Outputs among servers. An Audit process at each server restarts the local replica if a deviation is detected.[14] Cryptographic security is not required for checksums. It is possible that the local server is compromised, or that the Audit process is faulty, and the replica continues to operate incorrectly. This case is handled safely by the Output filter described previously (see Sending Outputs).

26.3 Appendix: Extensions

26.3.1 Input Log

In a system with no failures, the Inputs may be discarded after being processed by the State Machine. Realistic de- ployments must compensate for transient non-failure behaviors of the system such as message loss, network partitions, and slow processors.[14] One technique is to store the series of Inputs in a log. During times of transient behavior, replicas may request copies of a log entry from another replica in order to fill in missing Inputs.[6] In general the log is not required to be persistent (it may be held in memory). A persistent log may compensate for extended transient periods, or support additional system features such as Checkpoints, and Reconfiguration.

26.3.2 Checkpoints

If left unchecked a log will grow until it exhausts all available storage resources. For continued operation, it is necessary to forget log entries. In general a log entry may be forgotten when its contents are no longer relevant (for instance if all replicas have processed an Input, the knowledge of the Input is no longer needed). A common technique to control log size is store a duplicate State (called a Checkpoint), then discard any log entries which contributed to the checkpoint. This saves space when the duplicated State is smaller than the size of the log. Checkpoints may be added to any State Machine by supporting an additional Input called CHECKPOINT. Each replica maintains a checkpoint in addition to the current State value. When the log grows large, a replica submits the CHECKPOINT command just like a client request. The system will ensure non-faulty replicas process this command in the same order, after which all log entries before the checkpoint may be discarded. In a system with checkpoints, requests for log entries occurring before the checkpoint are ignored. Replicas which cannot locate copies of a needed log entry are faulty and must re-join the system (see Reconfiguration).

26.3.3 Reconfiguration

Reconfiguration allows replicas to be added and removed from a system while client requests continue to be processed. Planned maintenance and replica failure are common examples of reconfiguration. Reconfiguration involves Quitting and Joining. 26.3. APPENDIX: EXTENSIONS 133

26.3.4 Quitting

When a server detects its State or Output is faulty (see Auditing and Failure Detection), it may selectively exit the system. Likewise, an administrator may manually execute a command to remove a replica for maintenance. A new Input is added to the State Machine called QUIT.[1][5] A replica submits this command to the system just like a client request. All non-faulty replicas remove the quitting replica from the system upon processing this Input. During this time, the replica may ignore all protocol messages. If a majority of non-faulty replicas remain, the quit is successful. If not, there is a System Failure.

26.3.5 Joining

After quitting, a failed server may selectively restart or re-join the system. Likewise, an administrator may add a new replica to the group for additional capacity. A new Input is added to the State Machine called JOIN. A replica submits this command to the system just like a client request. All non-faulty replicas add the joining node to the system upon processing this Input. A new replica must be up-to-date on the system’s State before joining (see State Transfer).

26.3.6 State Transfer

When a new replica is made available or an old replica is restarted, it must be brought up to the current State before processing Inputs (see Joining). Logically, this requires applying every Input from the dawn of the system in the appropriate order. Typical deployments short-circuit the logical flow by performing a State Transfer of the most recent Checkpoint (see Checkpoints). This involves directly copying the State of one replica to another using an out-of-band protocol. A checkpoint may be large, requiring an extended transfer period. During this time, new Inputs may be added to the log. If this occurs, the new replica must also receive the new Inputs and apply them after the checkpoint is received. Typical deployments add the new replica as an observer to the ordering protocol before beginning the state transfer, allowing the new replica to collect Inputs during this period.

Optimizing State Transfer Common deployments reduce state transfer times by sending only State components which differ. This requires knowledge of the State Machine internals. Since state transfer is usually an out-of-band protocol, this assumption is not difficult to achieve.

Compression is another feature commonly added to state transfer protocols, reducing the size of the total transfer.

26.3.7 Leader Election (for Paxos)

Paxos[6] is a protocol for solving consensus, and may be used as the protocol for implementing Consensus Order. Paxos requires a single leader to ensure liveness.[6] That is, one of the replicas must remain leader long enough to achieve consensus on the next operation of the state machine. System behavior is unaffected if the leader changes after every instance, or if the leader changes multiple times per instance. The only requirement is that one replica remain leader long enough to move the system forward.

Conflict Resolution In general, a leader is necessary only when there is disagreement about which operation to perform,[11] and if those operations conflict in some way (for instance, if they do not commute).[12]

When conflicting operations are proposed, the leader acts as the single authority to set the record straight, defining an order for the operations, allowing the system to make progress.

With Paxos, multiple replicas may believe they are leaders at the same time. This property makes Leader Election for Paxos very simple, and any algorithm which guarantees an 'eventual leader' will work. 134 CHAPTER 26. STATE MACHINE REPLICATION

26.4 Historical background

Leslie Lamport was the first to propose the state machine approach, in his seminal 1984 paper on “Using Time Instead of Timeout In Distributed Systems”. Fred Schneider later elaborated the approach in his paper “Implementing Fault- Tolerant Services Using the State Machine Approach: A Tutorial”. Ken Birman developed the virtual synchrony model in a series of papers published between 1985 and 1987. The primary reference to this work is “Exploiting Virtual Synchrony in Distributed Systems”, which describes the Isis Toolkit, a system that was used to build the New York and Swiss Stock Exchanges, French Air Traffic Control System, US Navy AEGIS Warship, and other applications. Recent work by Miguel Castro and Barbara Liskov used the state machine approach in what they call a “Practical Byzantine fault tolerance” architecture that replicates especially sensitive services using a version of Lamport’s original state machine approach, but with optimizations that substantially improve performance. Most recently, there was also created the BFT-SMaRt library,[15] a high-performance Byzantine fault-tolerant state machine replication library developed in Java. This library implements a protocol very similar to PBFT’s, plus complementary protocols which offer state transfer and on-the-fly reconfiguration of hosts (i.e., JOIN and LEAVE operations). BFT-SMaRt is the most recent effort to implement state machine replication, still being actively main- tained. Raft, a consensus based algorithm, was developed in 2013.

26.5 References

[1] Lamport, Leslie (1978). “The Implementation of Reliable Distributed Multiprocess Systems”. Computer Networks 2: 95–114. doi:10.1016/0376-5075(78)90045-4. Retrieved 2008-03-13.

[2] Lamport, Leslie (2004). “Lower Bounds for Asynchronous Consensus”.

[3] Lamport, Leslie; Mike Massa (2004). “Cheap Paxos”. Proceedings of the International Conference on Dependable Systems and Networks (DSN 2004).

[4] Lamport, Leslie; Robert Shostak; Marshall Pease (July 1982). “The Byzantine Generals Problem”. ACM Transactions on Programming Languages and Systems 4 (3): 382–401. doi:10.1145/357172.357176. Retrieved 2007-02-02.

[5] Lamport, Leslie (1984). “Using Time Instead of Timeout for Fault-Tolerant Distributed Systems”. ACM Transactions on Programming Languages and Systems 6 (2): 254–280. doi:10.1145/2993.2994. Retrieved 2008-03-13.

[6] Lamport, Leslie (May 1998). “The Part-Time Parliament”. ACM Transactions on Computer Systems 16 (2): 133–169. doi:10.1145/279227.279229. Retrieved 2007-02-02.

[7] Birman, Kenneth; Thomas Joseph (1987). “Exploiting virtual synchrony in distributed systems”. Proceedings of the 11th ACM Symposium on Operating systems principles (SOSP) 21 (5): 123. doi:10.1145/37499.37515. Retrieved 2008-03-13.

[8] Lampson, Butler (1996). “How to Build a Highly Available System Using Consensus”. Retrieved 2008-03-13.

[9] Lamport, Leslie (July 1978). “Time, Clocks and the Ordering of Events in a Distributed System”. Communications of the ACM 21 (7): 558–565. doi:10.1145/359545.359563. Retrieved 2007-02-02.

[10] Schneider, Fred (1990). “Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial” (PS). ACM Computing Surveys 22 (4): 299. doi:10.1145/98163.98167.

[11] Lamport, Leslie (2005). “Fast Paxos”.

[12] Lamport, Leslie (2005). “Generalized Consensus and Paxos”.

[13] Fischer, Michael J.; Nancy A. Lynch; Michael S. Paterson (1985). “Impossibility of Distributed Consensus with One Faulty Process”. Journal of the Association for Computing Machinery 32 (2): 347–382. doi:10.1145/3149.214121. Retrieved 2008-03-13.

[14] Chandra, Tushar; Robert Griesemer; Joshua Redstone (2007). “Paxos Made Live – An Engineering Perspective” (PDF). PODC '07: 26th ACM Symposium on Principles of Distributed Computing.

[15] BFT-SMaRt. Google Code repository for the BFT-SMaRt replication library. 26.6. EXTERNAL LINKS 135

26.6 External links

• Replicated state machines video on MIT TechTV

• Apache Bookkeeper a replicated log service which can be used to build replicated state machines Chapter 27

Superstabilization

Superstabilization is a concept of fault-tolerance in distributed computing. Superstabilizing distributed algorithms combine the features of self-stabilizing algorithms and dynamic algorithms. A superstabilizing algorithm – just like any other self-stabilizing algorithm – can be started in an arbitrary state, and it will eventually converge to a legitimate state. Additionally, a superstabilizing algorithm will recover fast from a single change in the network topology (adding or removing one edge or node in the network). Any self-stabilizing algorithm recovers from a change in the network topology – the system configuration after a topology change can be treated just like any other arbitrary starting configuration. However, in a self-stabilizing algorithm, the convergence after a single change in the network topology may be as slow as the convergence from an arbitrary starting state. In the study of superstabilizing algorithms, special attention is paid to the time it takes to recover from a single change in the network topology.

27.1 Definitions

The stabilization time of a superstabilizing algorithm is defined exactly as in the case of self-stabilizing algorithm: how long it takes to converge to a legitimate state from an arbitrary configuration. Depending on the computational model, time is measured, e.g., in synchronous communication rounds or in asynchronous cycles. The superstabilization time is the time to recover from a single topology change. It is assumed that the system is initially in a legitimate configuration. Then the network topology is changed; the superstabilization time is the maximum time it takes for the system to reach a legitimate configuration again. Similarly, the adjustment measure is the maximum number of nodes that have to change their state after such changes. The “almost-legitimate configurations” that occurs after one topology change can be formally modelled by using passage predicates: a passage predicate is a predicate that holds after a single change in the network topology, and also during the convergence to a legitimate configuration.

27.2 References

• Dolev, Shlomi; Herman, Ted (1997), “Superstabilizing protocols for dynamic distributed systems”, Chicago Journal of Theoretical Computer Science, article 4.

• Dolev, Shlomi (2000), Self-Stabilization, MIT Press, ISBN 0-262-04178-2, Section 7.1.

136 Chapter 28

Terminating Reliable Broadcast

Terminating Reliable Broadcast (TRB) is a problem in distributed computing that encapsulates the task of broad- casting a message to a set of receiving processes in the presence of faults.[1] In particular, the sender and any other process might fail (“crash”) at any time.

28.1 Problem description

A TRB protocol typically organizes the system into a sending process and a set of receiving processes, which may include the sender itself. A process is called “correct” if it does not fail at any point during its execution. The goal of the protocol is to transfer data (the “message”) from the sender to the set of receiving processes. A process may perform many I/O operations during protocol execution, but eventually “delivers” a message by passing it to the application on that process that invoked the TRB protocol. The protocol must provide important guarantees to the receiving processes. All correct receiving processes, for example, must deliver the sender’s message if the sender is also correct. A receiving process may deliver a special message, SF (“sender faulty”), if the sender failed, but either all correct processes will deliver SF or none will. A correct process is therefore guaranteed that data delivered to it was also delivered to all other correct processes. More precisely, a TRB protocol must satisfy the four formal properties below.

• Termination: every correct process delivers some value. • Validity: if the sender is correct and broadcasts a message m , then every correct process delivers m .

• Integrity: a process delivers a message at most once, and if it delivers some message m ≠ SF , then m was broadcast by the sender.

• Agreement: if a correct process delivers a message m , then all correct processes deliver m .

The presence of faults in the system makes these properties more difficult to satisfy. A simple but invalid TRB protocol might have the sender broadcast the message to all processes, and have receiving processes deliver the message as soon as it is received. This protocol, however, does not satisfy agreement if faults can occur: if the sender crashes after sending the message to some processes, but before sending it to others, then the first set of processes may deliver the message while the second set delivers SF . TRB is closely related, but not identical, to the fundamental distributed computing problem of consensus.

28.2 References

[1] Alvisi, Lorenzo (2006). “Consensus and Reliable Broadcast” (PDF). Retrieved 2006-05-21.

137 Chapter 29

Timing failure

Timing failure is a failure of a process, or part of a process, in a synchronous distributed system or real-time system to meet limits set on execution time, message delivery, clock drift rate, or clock skew. Asynchronous distributed systems cannot be said to have timing failures as guarantees are not provided for response times.

138 Chapter 30

Transitive data skew

In distributed computing problems, transitive data skew is an issue of data synchronization. It arises with the uneven distribution of otherwise evenly distributed data across a number of devices while the data is in transition. If sorted data is being distributed across multiple devices and the column on which that data is sorted is the “key” used to identify the target device, the resulting transitive data skew may be self-correcting.

139 Chapter 31

Two Generals’ Problem

In computing, the Two Generals Problem, is a thought experiment meant to illustrate the pitfalls and design chal- lenges of attempting to coordinate an action by communicating over an unreliable link. It is related to the more general Byzantine Generals Problem (though published long before that later generalization) and appears often in introductory classes about computer networking (particularly with regard to the Transmission Control Protocol), though it can also apply to other types of communication. A key concept in epistemic logic, this problem highlights the importance of common knowledge. Some authors also refer to this as the Two Generals Paradox, the Two Armies Problem, or the Coordinated Attack Problem.[1][2]

31.1 Definition

Two armies, each led by a general, are preparing to attack a fortified city. The armies are encamped near the city, each on its own hill. A valley separates the two hills, and the only way for the two generals to communicate is by sending messengers through the valley. Unfortunately, the valley is occupied by the city’s defenders and there’s a chance that any given messenger sent through the valley will be captured.

A1 B A2

Positions of the armies. Armies A1 and A2 need to communicate but their messengers may be captured by army B.

While the two generals have agreed that they will attack, they haven't agreed upon a time for attack. It is required that the two generals have their armies attack the city at the same time in order to succeed, else the lone attacker army will die trying. They must thus communicate with each other to decide on a time to attack and to agree to attack at that time, and each general must know that the other general knows that they have agreed to the attack plan. Because acknowledgement of message receipt can be lost as easily as the original message, a potentially infinite series of messages are required to come to consensus. The thought experiment involves considering how they might go about coming to consensus. In its simplest form one general is known to be the leader, decides on the time of attack, and must communicate this time to the other general. The problem is to come up with algorithms that the generals can use, including sending messages and processing received messages, that can allow them to correctly conclude:

Yes, we will both attack at the agreed-upon time.

140 31.2. ILLUSTRATING THE PROBLEM 141

Allowing that it is quite simple for the generals to come to an agreement on the time to attack (i.e. one successful message with a successful acknowledgement), the subtlety of the Two Generals’ Problem is in the impossibility of designing algorithms for the generals to use to safely agree to the above statement.

31.2 Illustrating the problem

The first general may start by sending a message “Attack at 0900 on August 4.” However, once dispatched, the first general has no idea whether or not the messenger got through. This uncertainty may lead the first general to hesitate to attack due to the risk of being the sole attacker. To be sure, the second general may send a confirmation back to the first: “I received your message and will attack at 0900 on August 4.” However, the messenger carrying the confirmation could face capture and the second general may hesitate, knowing that the first might hold back without the confirmation. Further confirmations may seem like a solution - let the first general send a second confirmation: “I received your confirmation of the planned attack at 0900 on August 4.” However, this new messenger from the first general is liable to be captured too. Thus it quickly becomes evident that no matter how many rounds of confirmation are made, there is no way to guarantee the second requirement that each general be sure the other has agreed to the attack plan. Whichever general sends the final messenger will always be left wondering whether the messenger got through.

31.3 Proof

31.3.1 For deterministic protocols with a fixed number of messages

Suppose there is any fixed-length sequence of messages, some successfully delivered and some not, that suffice to meet the requirement of shared certainty for both generals to attack. In that case there must be some minimal non-empty subset of the successfully delivered messages that suffices (at least one message with the time/plan must be delivered). Consider the last such message that was successfully delivered in such a minimal sequence. If that last message had not been successfully delivered then the requirement wouldn't have been met, and one general at least (presumably the receiver) would decide not to attack. From the viewpoint of the sender of that last message, however, the sequence of messages sent and delivered is exactly the same as it would have been, had that message been delivered. Therefore the general sending that last message will still decide to attack (since the protocol is deterministic). We've now constructed a circumstance where the purported protocol leads one general to attack and the other not to attack - contradicting the assumption that the protocol was a solution to the problem.

31.3.2 For nondeterministic and variable-length protocols

Such a protocol can be modeled as a labeled finite forest, where each node represents a run of the protocol up to a specified point. The roots are labeled with the possible starting messages, and the children of a node N are labeled with the possible next messages after N. Leaf nodes represent runs in which the protocol terminates after sending the message the node is labeled with. The empty forest represents the protocol that terminates before sending any message. Let P be a protocol that solves the Two Generals problem. Then, by a similar argument to the one used for fixed- length protocols above, P' must also solve the Two Generals’ problem, where P' is obtained from P by removing all leaf nodes. Since P is finite, it follows that the protocol represented by the empty forest solves the Two Generals’ problem. But clearly it does not, contradicting the existence of P.

31.4 Engineering approaches

A pragmatic approach to dealing with the Two Generals’ Problem is to use schemes that accept the uncertainty of the communications channel and not attempt to eliminate it, but rather mitigate it to an acceptable degree. For example, the first general could send 100 messengers, anticipating that the probability of all being captured is low. With this approach the first general will attack no matter what, and the second general will attack if any message is received. 142 CHAPTER 31. TWO GENERALS’ PROBLEM

Alternatively the first general could send a stream of messages and the second general could send acknowledgments to each, with each general feeling more comfortable with every message received. As seen in the proof, however, neither can be certain that the attack will be coordinated. There’s no algorithm that they can use (e.g. attack if more than four messages are received) which will be certain to prevent one from attacking without the other. Also, the first general can send a marking on each message saying it is message 1, 2, 3 ... of n. This method will allow the second general to know how reliable the channel is and send an appropriate number of messages back to ensure a high probability of at least one message being received. If the channel can be made to be reliable, then one message will suffice and additional messages do not help. The last is as likely to get lost as the first. Assuming that the generals must sacrifice lives every time a messenger is sent and intercepted, an algorithm can be designed to minimize the number of messengers required to achieve the maximum amount of confidence the attack is coordinated. To save them from sacrificing hundreds of lives to achieve a very high confidence in coordination, the generals could agree to use the absence of messengers as an indication that the general who began the transaction has received at least one confirmation, and has promised to attack. Suppose it takes a messenger 1 minute to cross the danger zone, allowing 200 minutes of silence to occur after confirmations have been received will allow us to achieve extremely high confidence while not sacrificing messenger lives. In this case messengers are used only in the case where a party has not received the attack time. At the end of 200 minutes, each general can reason: “I have not received an additional message for 200 minutes; either 200 messengers failed to cross the danger zone, or it means the other general has confirmed and committed to the attack and has faith I will too”.

31.4.1 Using Bitcoin

Although there is still no solution to this paradox, Bitcoin is the first practical method to address this problem in the real world. Bitcoin’s inventor Satoshi Nakamoto wrote: “A number of Byzantine Generals each have a computer and want to attack the King’s wi-fi by brute forcing the password, which they've learned is a certain number of characters in length. Once they stimulate the network to generate a packet, they must crack the password within a limited time to break in and erase the logs, otherwise they will be discovered and get in trouble. They only have enough CPU power to crack it fast enough if a majority of them attack at the same time. They don't particularly care when the attack will be, just that they all agree. It has been decided that anyone who feels like it will announce a time, and whatever time is heard first will be the official attack time. The problem is that the network is not instantaneous, and if two generals announce different attack times at close to the same time, some may hear one first and others hear the other first. They use a proof-of-work chain to solve the problem. Once each general receives whatever attack time he hears first, he sets his computer to solve an extremely difficult proof-of-work problem that includes the attack time in its hash. The proof-of-work is so difficult, it’s expected to take 10 minutes of them all working at once before one of them finds a solution. Once one of the generals finds a proof-of-work, he broadcasts it to the network, and everyone changes their current proof-of-work computation to include that proof-of-work in the hash they're working on. If anyone was working on a different attack time, they switch to this one, because its proof-of-work chain is now longer. After two hours, one attack time should be hashed by a chain of 12 proofs-of-work. Every general, just by verifying the difficulty of the proof-of-work chain, can estimate how much parallel CPU power per hour was expended on it and see that it must have required the majority of the computers to produce that much proof-of-work in the allotted time. They had to all have seen it because the proof-of-work is proof that they worked on it. If the CPU power exhibited by the proof-of-work chain is sufficient to crack the password, they can safely attack at the agreed time. [3]

31.5 History

The Two Generals Problem and its impossibility proof was first published by E. A. Akkoyunlu, K. Ekanadham, and R. V. Huber in 1975 in “Some Constraints and Trade-offs in the Design of Network Communications”,[4] where it is described starting on page 73 in the context of communication between two groups of gangsters. This problem was given the name the Two Generals Paradox by Jim Gray[5] in 1978 in “Notes on Data Base Operating Systems”[6] starting on page 465. This reference is widely given as a source for the definition of the problem and the impossibility proof, though both were published previously as above. 31.6. REFERENCES 143

31.6 References

[1] Gmytrasiewicz, Piotr J.; Edmund H. Durfee (1992). “Decision-theoretic recursive modeling and the coordinated attack problem”. Proceedings of the first international conference on Artificial intelligence planning systems (San Francisco: Morgan Kaufmann Publishers): 88–95. Retrieved 27 December 2013.

[2] The coordinated attack and the jealous amazons Alessandro Panconesi. Retrieved 2011-05-17.

[3] https://socrates1024.s3.amazonaws.com/consensus.pdf

[4] “Some constraints and trade-offs in the design of network communications” (PDF). Portal.acm.org. doi:10.1145/800213.806523. Retrieved 2010-03-19.

[5] “Jim Gray Summary Home Page”. Research.microsoft.com. 2004-05-03. Retrieved 2010-03-19.

[6] “Notes on Data Base Operating Systems”. Portal.acm.org. Retrieved 2010-03-19. Chapter 32

Uniform consensus

In computer science, Uniform consensus is a distributed computing problem that is a similar to the consensus problem with one more condition which is no two processes (whether faulty or not) decide differently. More specifically one should consider this problem:

• Each process has an input, should on decide an output (one-shot problem) • Uniform Agreement: every two decisions are the same

• Validity: every decision is an input of one of the processes • Termination: eventually all correct processes decide

32.1 References

• Charron-Bost, Bernadette; Schiper, André (April 2004). “Uniform consensus is harder than consensus”. Jour- nal of Algorithms 51 (1): 15–37. doi:10.1016/j.jalgor.2003.11.001.

144 Chapter 33

Version vector

A version vector is a mechanism for tracking changes to data in a distributed system, where multiple agents might update the data at different times. The version vector allows the participants to determine if one update preceded another (happened-before), followed it, or if the two updates happened concurrently (and therefore might conflict with each other). In this way, version vectors enable causality tracking among data replicas and are a basic mechanism for optimistic replication. In mathematical terms, the version vector generates a preorder that tracks the events that precede, and may therefore influence, later updates. Version vectors maintain state identical to that in a vector clock, but the update rules differ slightly; in this example, replicas can either experience local updates (e.g., the user editing a file on the local node), or can synchronize with another replica:

• Initially all vector counters are zero. • Each time a replica experiences a local update event, it increments its own counter in the vector by one. • Each time two replicas a and b synchronize, they both set the elements in their copy of the vector to the maximum of the element across both counters: Va[x] = Vb[x] = max(Va[x],Vb[x]) . After synchronization, the two replicas have identical version vectors.

Pairs of replicas, a , b , can be compared by inspecting their version vectors and determined to be either: identical ( a = b ), concurrent ( a ∥ b ), or ordered ( a < b or b < a ). The ordered relation is defined as: Vector a < b if and only if every element of Va is less than or equal to its corresponding element in Vb , and at least one of the elements is strictly less than. If neither a < b or b < a , but the vectors are not identical, then the two vectors must be concurrent. Version vectors[1] or variants are used to track updates in many distributed file systems, such as (file system) and Ficus, and are the main data structure behind optimistic replication.[2]

33.1 Other Mechanisms

• Hash Histories [3] avoid the use of counters by keeping a set of hashes of each updated version and comparing those sets by set inclusion. However this mechanism can only give probabilistic guarantees. • Concise Version Vectors [4] allow significant space savings when handling multiple replicated items, such as in directory structures in filesystems. • Version Stamps [5] allow tracking of a variable number of replicas and do not resort to counters. This mecha- nism can depict scalability problems in some settings, but can be replaced by Interval Tree Clocks. • Interval Tree Clocks[6] generalize version vectors and vector clocks and allows dynamic numbers of repli- cas/processes. • Bounded Version Vectors [7] allow a bounded implementation, with bounded size counters, as long as replica pairs can be atomically synchronized.

145 146 CHAPTER 33. VERSION VECTOR

• Dotted Version Vectors [8] address scalability with a small set of servers mediating replica access by a large number of concurrent clients.

33.2 References

[1] Douglas Parker, Gerald Popek, Gerard Rudisin, Allen Stoughton, Bruce Walker, Evelyn Walton, Johanna Chow, David Edwards, Stephen Kiser, and Charles Kline. Detection of mutual inconsistency in distributed systems. Transactions on Software Engineering. 1983

[2] David Ratner, Peter Reiher, and Gerald Popek. Dynamic version vector maintenance. Technical Report CSD-970022, Department of Computer Science, University of California, Los Angeles, 1997

[3] ByungHoon Kang, Robert Wilensky, and John Kubiatowicz. The Hash History Approach for Reconciling Mutual Incon- sistency. ICDCS, pp. 670-677, IEEE Computer Society, 2003.

[4] Dahlia Malkhi and Doug Terry. Concise Version Vectors in WinFS.Distributed Computing, Vol. 20, 2007.

[5] Paulo Almeida, Carlos Baquero and Victor Fonte. Version Stamps: Decentralized Version Vectors. ICDCS, pp. 544-551, 2002.

[6] Paulo Almeida, Carlos Baquero and Victor Fonte. Interval Tree Clocks. OPODIS, Lecture Notes in Computer Science, Vol. 5401, pp. 259-274, Springer, 2008.

[7] José Almeida, Paulo Almeida and Carlos Baquero. Bounded Version Vectors. DISC: International Symposium on Dis- tributed Computing, LNCS, 2004.

[8] Nuno Preguiça, Carlos Baquero, Paulo Almeida, Victor Fonte and Ricardo Gonçalves. Brief Announcement: Efficient Causality Tracking in Distributed Storage Systems With Dotted Version Vectors. ACM PODC, pp. 335-336, 2012. Chapter 34

Weak coloring

Weak 2-coloring.

In graph theory, a weak coloring is a special case of a graph labeling. A weak k-coloring of a graph G = (V, E) assigns a color c(v) ∈ {1, 2, ..., k} to each vertex v ∈ V, such that each non-isolated vertex is adjacent to at least one vertex with different color. In notation, for each non-isolated v ∈ V, there is a vertex u ∈ U with {u, v} ∈ E and c(u)

147 148 CHAPTER 34. WEAK COLORING

≠ c(v). The figure on the right shows a weak 2-coloring of a graph. Each dark vertex (color 1) is adjacent to at least one light vertex (color 2) and vice versa.

34.1 Properties

A graph vertex coloring is a weak coloring, but not necessarily vice versa. Every graph has a weak 2-coloring. The figure on the right illustrates a simple algorithm for constructing a weak 2-coloring in an arbitrary graph. Part (a) shows the original graph. Part (b) shows a breadth-first search tree of the same graph. Part (c) shows how to color the tree: starting from the root, the layers of the tree are colored alternatingly with colors 1 (dark) and 2 (light). If there is no isolated vertex in the graph G, then a weak 2-coloring determines a domatic partition: the set of the nodes with c(v) = 1 is a dominating set, and the set of the nodes with c(v) = 2 is another dominating set.

34.2 Applications

Historically, weak coloring served as the first non-trivial example of a graph problem that can be solved with a local algorithm (a distributed algorithm that runs in a constant number of synchronous communication rounds). More precisely, if the degree of each node is odd and bounded by a constant, then there is a constant-time distributed algorithm for weak 2-coloring.[1] This is different from (non-weak) vertex coloring: there is no constant-time distributed algorithm for vertex color- ing; the best possible algorithms (for finding a minimal but not necessarily minimum coloring) require O(log* |V|) communication rounds.[1][2][3] Here log* x is the iterated logarithm of x.

34.3 References

[1] Naor, Moni; Stockmeyer, Larry (1995), “What can be computed locally?", SIAM Journal on Computing 24 (6): 1259–1277, doi:10.1137/S0097539793254571, MR 1361156.

[2] Linial, Nathan (1992), “Locality in distributed graph algorithms”, SIAM Journal on Computing 21 (1): 193–201, doi:10.1137/0221015, MR 1148825.

[3] Cole, Richard; Vishkin, Uzi (1986), “Deterministic coin tossing with applications to optimal parallel list ranking”, Information and Control 70 (1): 32–53, doi:10.1016/S0019-9958(86)80023-7, MR 853994. 34.3. REFERENCES 149

a)

b)

c)

Constructing a weak 2-coloring. 150 CHAPTER 34. WEAK COLORING

34.4 Text and image sources, contributors, and licenses

34.4.1 Text

• Atomic broadcast Source: https://en.wikipedia.org/wiki/Atomic_broadcast?oldid=647066618 Contributors: Szopen, Skysmith, Neilc, Rich Farmbrough, Xezbeth, Gary, Dhartung, Ej, Stormbay, SmackBot, Sanspeur, Alaibot, Tgeairn, Philip Trueman, Mercurywoodrose, Ken Birman, ClueBot, The Thing That Should Not Be, Mild Bill Hiccup, LilHelpa, Miym, PigFlu Oink, AvicBot, Newyorkadam, BG19bot, BattyBot, Oiyarbepsy and Anonymous: 7 • Atomic commit Source: https://en.wikipedia.org/wiki/Atomic_commit?oldid=673323859 Contributors: Chris Q, PhilipMW, CesarB, Charles Matthews, Pne, Rworsnop, Max Terry, NetBot, Sortior, Mike Schwartz, Abdulqabiz, Forderud, Rjwilmsi, Gurch, Chick Bowen, Bsullivan2, Segv11, SmackBot, Andyzweb, Jaksa, Radagast83, SqlPac, Thijs!bot, SGGH, Ideogram, RobJ1981, Magioladitis, Tedickey, Sumanthewhiz, CoreTechX, Yecril, TXiKiBoT, Frigolit, S.Örvarr.S, Happysailor, JCLately, Randallbsmith, WurmWoode, Jusdafax, Addbot, Dawynn, Tassedethe, Bmistree, Miym, FrescoBot, Alaerts, John of Reading, Acather96, Vanished user zq46pw21, Akumka, Meatsgains, C5st4wr6ch and Anonymous: 27 • Automatic vectorization Source: https://en.wikipedia.org/wiki/Automatic_vectorization?oldid=640848525 Contributors: SimonP, Boud, Sebbe, Marudubshinki, Qwertyus, Alvin-cs, Radagast83, Alphachimpbot, Brownout, Oicumayberight and Anonymous: 2 • Big data Source: https://en.wikipedia.org/wiki/Big_data?oldid=676703166 Contributors: William Avery, Heron, Kku, Samw, Andrew- ,Topbanana, Paul W, F3meyer, Sunray, Giftlite, Langec, Erik Carson, Utcursch, Beland, Jeremykemp, [email protected] ,דוד ,man327, Ryuch Discospinster, Rich Farmbrough, Kdammers, ArnoldReinhold, Narsil, Viriditas, Lenov, Gary, Pinar, Tobych, Miranche, Broeni, Tomlzz1, Axeman89, Woohookitty, Pol098, Qwertyus, Rjwilmsi, Koavf, ElKevbo, Jehochman, Nihiltres, Lumin~enwiki, Tedder, DVdm, SteveL- oughran, Aeusoes1, Daniel Mietchen, Cedar101, Dimensionsix, Katieh5584, Henryyan, McGeddon, Od Mishehu, Gilliam, Ohnoitsjamie, Chris the speller, RDBrown, Pegua, Madman2001, Krexer, Kuru, Almaz~enwiki, Dl2000, The Letter J, Chris55, Yragha, Jac16888, Marc W. Abel, Cydebot, Matrix61312, Quibik, DumbBOT, Malleus Fatuorum, EdJohnston, Nick Number, Cowb0y, Lmusher, Joseph- marty, Kforeman1, Rmyeid, OhanaUnited, Relyk, Wllm, Lvsubram, Magioladitis, Nyq, Tedickey, Steven Walling, Thevoid00, Casieg, Jim.henderson, Tokyogirl79, MacShimi, McSly, NewEnglandYankee, Lamp90, Asefati, Pchackal, Mgualtieri, VolkovBot, JohnBlack- burne, Vincent Lextrait, Philip Trueman, Ottb19, Billinghurst, Grinq, Scottywong, Luca Naso, Dawn Bard, Yintan, Jazzwang, Jojikiba, Eikoku, SPACKlick, CutOffTies, Mkbergman, Melcombe, Siskus, PabloStraub, Dilaila, Martarius, Sfan00 IMG, Faalagorn, Apptrain, Morrisjd1, Grantbow, Mild Bill Hiccup, Ottawahitech, Cirt, Auntof6, Lbertolotti, Gnome de plume, Resoru, Pablomendes, Saisdur, Vehementlyirish, SchreiberBike, MPH007, Rui Gabriel Correia, Mymallandnews, XLinkBot, Ost316, Benboy00, MystBot, Itadapter, P.r.newman, Addbot, Mortense, Drevicko, Thomas888b, AndrewHZ, Tothwolf, Ronhjones, Moosehadley, MrOllie, Download, Jarble, Arbitrarily0, Luckas-bot, Yobot, Fraggle81, Manivannan pk, Elfix, Jean.julius, AnomieBOT, Jim1138, Babrodtk, Bluerasberry, Mate- rialscientist, Citation bot, Xqbot, Marko Grobelnik, Bgold12, Anna Frodesiak, Tomwsulcer, Srich32977, Omnipaedista, Smallman12q, Joaquin008, Jugdev, FrescoBot, Jonathanchaitow, I42, PeterEastern, AtmosNews, B3t, I dream of horses, HRoestBot, Jonesey95, Jandal- Sideways713, Stuartzs, Jfmantis, Mean as custard, RjwilmsiBot, Ripchip ,בן גרשון ,handler, Mengxr, Ethansdad, Yzerman123, Msalganik Bot, Mm479flarok, Winchetan, Petermcelwee, DASHBot, EmausBot, John of Reading, Oliverlyc, Timtempleton, Dewritech, Peaceray, Radshashi, Cmlloyd1969, K6ka, HiW-Bot, Richard asr, ZéroBot, Checkingfax, BobGourley, Josve05a, Xtzou, Chire, Kilopi, Laurawilber, Rcsprinter123, Rick jens, Palosirkka, MainFrame, ChuispastonBot, Sean Quixote, Axelode, Mhiji, Helpsome, ClueBot NG, Behrad3d, Danielg922, Pramanicks, Jj1236, Widr, WikiMSL, Lawsonstu, Fvillanustre, Helpful Pixie Bot, Lowercase sigmabot, BG19bot, And Adoil Descended, Seppemans123, Jantana, Innocentantic, Northamerica1000, Asplanchna, MusikAnimal, AvocatoBot, Noelwclarke, Matt tubb, Jordanzhang, Bar David, InfoCmplx, Atlasowa, Fylbecatulous, Camberleybates, BattyBot, WH98, DigitalDev, Haroldpolo, Ryguyrg, Untioencolonia, Shirishnetke, Ampersandian, MarkTraceur, ChrisGualtieri, TheJJJunk, Khazar2, Vaibhav017, IjonTichyIjonTichy, Sat- urdayswiki, Mheikkurinen, Seherrell, Mjvaugh2, ChazzI73, Davidogm, Mherradora, Jkofron4, Stevebillings, Indianbusiness, Toopathfind, Jeremy Kolb, Frosty, Jamesx12345, OnTheNet21, BrighterTomorrow, Jacoblarsen net, Epicgenius, DavidKSchneider, Socratesplato9, Anirudhrata, Parasdoshiblog, Edwinboothnyc, JuanCarlosBrandt, Helenellis, MMeTrew, Warrenpd86, AuthorAnil, ViaJFK, Gary Si- mon, Bsc, FCA, FBCS, CITP, Mcioffi, Joe204, Caraconan, Evaluatorgroup, Hessmike, TJLaher123, Chengying10, IndustrialAutoma- tionGuru, Dabramsdt, Prussonyc, Abhishek1605, Dilaila123, Willymomo, Rzicari, Mandruss, Mingminchi, BigDataGuru1, Sugamsha, Sysœp, Azra2013, Paul2520, Dudewhereismybike, Shahbazali101, SJ Defender, Yeda123, Miakeay, Stamptrader, Accountdp, Morgan- missen, JeanneHolm, Yourconnotation, JenniferAndy, Arcamacho, Amgauna, Bigdatavomit, Monkbot, Wikientg, Scottishweather, Tex- tractor, Analytics ireland, Lspin01l, ForumOxford Online, Mansoor-siamak, Belasobral, Sightestrp, Jwdang4, Amortias, Wikiauthor22, Femiolajiga, Tttcraig, Lepro2, Mythfinder, DexterToo, Mr P. Kopee, Pablollopis, SVtechie, Deathmuncher19, Smaske, Greystoke1337, Prateekkeshari, Hmrv83, KaraHayes, Iqmc, Lalith269, Helloyoubum, Jakesher, IEditEncyclopedia, Rajsbhatta123, Ragnar Valgeirsson, Vedanga Kumar, Fgtyg78, Gary2015, EricVSiegel, Benedge46, Friafternoon, KasparBot, Adzzyman, Pmaiden, Spetrowski88, JuiAmale, Yasirsid, Diyottainc, Nt8068a, WikilleWi and Anonymous: 337 • Big Memory Source: https://en.wikipedia.org/wiki/Big_Memory?oldid=675149312 Contributors: Itadapter, Leo gan 57, BG19bot, Rberchie and Anonymous: 1 • Brooks–Iyengar algorithm Source: https://en.wikipedia.org/wiki/Brooks%E2%80%93Iyengar_algorithm?oldid=659852492 Contrib- utors: Michael Hardy, Phil Boswell, Kgrr, Rjwilmsi, Ytcracker, RL0919, SmackBot, Nbarth, Lim Wei Quan, David Eppstein, Mike Cline, Melcombe, Nsk92, 1ForTheMoney, Yobot, Hairhorn, Dc987, Sitharama.iyengar1, Helpful Pixie Bot, Paulbeeb, Monkbot and Anonymous: 4 • Byzantine fault tolerance Source: https://en.wikipedia.org/wiki/Byzantine_fault_tolerance?oldid=673907846 Contributors: Dominus, Haakon, ToastyKen, BAxelrod, Dwo, Dcoetzee, Dysprosia, Phil Boswell, Everyking, SWAdair, Neilc, Beland, Quarl, Oneiros, Urhixidur, Xmlizer, TedPavlic, Jsnow, Baruneju, Ripero, Plumbago, Cburnett, PullUpYourSocks, Daira Hopwood, Kgrr, Bhound89, Marudub- shinki, Mandarax, Rjwilmsi, Jameshfisher, Quuxplusone, Hairy Dude, Vecter, Bovineone, Jess Riedel, Whitejay251, Banana04131, Bob Hu, SmackBot, Karmastan, Xaosflux, Betacommand, Thumperward, Alan smithee, Janm67, Moshe Constantine Hassan Al-Silverburg, JonHarder, Cybercobra, Tompsci, Bdiscoe, Zlclark, Cowbert, Interpretivechaos, N2e, Breadinator, BetacommandBot, Widefox, Donagle, Alphachimpbot, Stangaa, Hugh.glaser, Falcor84, Gwern, STBot, Smite-Meister, Ontarioboy, Ale2006, SieBot, Moonriddengirl, Phe-bot, Nrjs, MystBot, Addbot, DOI bot, AkhtaBot, SpBot, Yobot, 4th-otaku, AnomieBOT, Miym, Alan.A.Mick, Nageh, Sae1962, Benzol- Bot, Citation bot 1, Tim1357, Averykhoo, Dewritech, Ethine, Md4567, Kaluzman, Wingman4l7, Voomoo, ClueBot NG, Incompetence, Pacerier, Cincinatis, Mleeds12, Torkism, Dairylogical, Codehearted, LeoXXVI, Snakejerusalem, Reatlas, Mario.virtu, Speediedan, Tele- portcat, Josef.widder, Path-logical, TDAndrade, Aliensyntax and Anonymous: 66 34.4. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES 151

• Clock synchronization Source: https://en.wikipedia.org/wiki/Clock_synchronization?oldid=669188563 Contributors: The Anome, Pal- frey, BAxelrod, Lkesteloot, Joy, Frazzydee, Sander~enwiki, Academic Challenger, Alexwcovington, Everyking, Joeblakesley, Sam Ho- cevar, TedPavlic, CanisRufus, DanGunn, Graham87, Valermos, Tedder, Yamara, Mipadi, Waterguy, Sgeiger, SmackBot, Pkirlin, Chris the speller, Bluebot, Morte, Androsyn, EvelinaB, FlyHigh, Ksn, JHunterJ, Kvng, Lee Carre, DouglasCalvert, SeL, CmdrObot, Amalas, SymlynX, Btm792, Thijs!bot, Zahakiel, Maurice Carbonaro, VolkovBot, Ebright82, Jan1nad, Aleksd, Dthomsen8, Addbot, Allenchue, Yobot, AnomieBOT, Rubinbot, Citation bot, Xqbot, Miym, Citation bot 1, Tom.Reding, Codwiki, Zioskenz, Majestic Pyre, Jimw338, SD5bot, Dunkler keks, Monkbot, Poepkop and Anonymous: 50 • Consensus (computer science) Source: https://en.wikipedia.org/wiki/Consensus_(computer_science)?oldid=670594348 Contributors: Pnm, Populus, Neilc, Rich Farmbrough, Bender235, Waldir, Scvalex, Rjwilmsi, Wavelength, SmackBot, Chris the speller, Bsilver- thorn, Cybercobra, Daneshvar, Synergy, Headbomb, A3nm, David Eppstein, Asrabkin, Mild Bill Hiccup, Arjayay, Zeldafreakx86, Reza.sherafat, Dsimic, Addbot, DOI bot, Ztyx~enwiki, Roarbr, Yobot, Citation bot, Gilo1969, Loveless, Miym, D'ohBot, Hextad8, Cita- tion bot 1, Trappist the monk, RjwilmsiBot, Orphan Wiki, ZéroBot, Bloomtom, Cincinatis, Csegsatom, Mario.virtu, Monkbot, Abudnik, TDAndrade, SoSKatan and Anonymous: 24 • Data lineage Source: https://en.wikipedia.org/wiki/Data_lineage?oldid=675757874 Contributors: RHaworth, Marcus Cyron, War wiz- ard90, Yobot, AnomieBOT, John of Reading, -revi, BG19bot, Mogism, Skamisetty, Shabihsyed and Anonymous: 2 • Deadlock Source: https://en.wikipedia.org/wiki/Deadlock?oldid=676444433 Contributors: Derek Ross, Eloquence, Zundark, The Anome, Aldie, SimonP, Maury Markowitz, LapoLuchini, Frecklefoot, Michael Hardy, Dori, Jpatokal, Aragorn2, Samw, Clausen, Charles Matthews, Dysprosia, Wik, Fredrik, Sanders muc, Sander~enwiki, Wjhonson, Scooter~enwiki, Tobias Bergemann, Adam78, AlistairMcMillan, Neilc, Antandrus, OverlordQ, Urhixidur, Qfennessy, Corti, Coffeehood, Discospinster, Dmr2, PaulMcKenney, MBisanz, Xvr~enwiki, Larry V, Jumbuck, Alansohn, Psyche~enwiki, MatthewWilcox, JRaber, Jeffhos, Mrholybrain, Stephan Leeds, RJFJR, Mikeo, Rdenis, Skorgu, Kenyon, Kunalthakar, Kzollman, Ruud Koot, Wikiklrsc, Mandarax, Kbdank71, Rjwilmsi, Vegaswikian, LjL, Ohanian, Maxim Razin, Hashproduct, Fresheneesz, NevilleDNZ, ...adam..., Chobot, George Leung, YurikBot, Wavelength, Hairy Dude, Piet Delport, Stephenb, Archelon, Tavilis, Mpa, CecilWard, Hv, Jeh, Elkman, E Wing, SigmaEpsilon, Yvwv, Jbalint, Erik Sandberg, SmackBot, Slamb, Alexie, Reedy, Prodego, McGeddon, Eskimbot, Ohnoitsjamie, Betacommand, Thumperward, Helder Ribeiro, Utsutsu, Jon- Harder, Zvar, Lobner, Allan McInnes, Normxxx, Radagast83, Cybercobra, WhosAsking, Hkmaly, Khazar, Chodorkovskiy, Peter Horn, Hu12, Vocaro, Zero sharp, Richard75, Tawkerbot2, Comps, CmdrObot, Earthlyreason, Cydebot, Christian75, Legis, Ebrahim, Par- siferon, Philu, Opelio, Hpitkala, VictorAnyakin, Caper13, MagiMaster, JAnDbot, Dereckson, Sonicsuns, Greensburger, Bongwarrior, CobaltBlue, Web-Crawling Stickler, A3nm, Manav 95, Rajashar, MartinBot, R'n'B, Erkan Yilmaz, Trusilver, Thomas Larsen, Tewk, DMCer, Jnlin, Randomalious, TXiKiBoT, Sean D Martin, Seb az86556, Sst557, Billinghurst, Quiark, Larsk1985, Interstates, Alex- Fili, Caltas, M.thoriyan, Vonsche, JCLately, JSpung, Svick, Webvamsi555, Prekageo, Extensive~enwiki, WikipedianMarlith, Martarius, ClueBot, Fyyer, Thegeneralguy, Auntof6, Jusdafax, M4gnum0n, Estirabot, Atallcostsky, XLinkBot, Fastily, Stickee, Kaustav 28061987, Adib.roumani, Fikril~enwiki, Addbot, Dargenio, Magus732, GSMR, Jncraton, So Awesome, Cambalachero, Ccacsmss, Glass Sword, SamatBot, 5 albert square, Brainmachine, Тиверополник, Lightbot, OlEnglish, Sergio PJ, Teles, Angrysockhop, Luckas-bot, Yobot, THEN WHO WAS PHONE?, AnomieBOT, 1exec1, Piano non troppo, MrKoch1900, Materialscientist, Citation bot, Knielsen81, Lil- Helpa, Bart.vanassche, Capricorn42, Abecoffman, Matiwiki~enwiki, Petropoxy (Lithoderm Proxy), Miym, RibotBOT, SassoBot, Law- droid, VittGam, FrescoBot, Pepper, D'ohBot, Citation bot 1, Biker Biker, Calmer Waters, Jandalhandler, Reconsider the static, Babaya- gagypsies, TPReal, Darshana.jayasinghe, Begoon, Jfmantis, RjwilmsiBot, Tan20011, JaysonSunshine, NerdyScienceDude, Sth.pratik, Orphan Wiki, Beta M, RA0808, ZéroBot, The Nut, Chaos.squirrel, Tolly4bolly, Donner60, TamusJRoyce, Petrb, ClueBot NG, Pravin S. Pandey, Matthiaspaul, Qayyumwiki, Snotbot, Rinaku, Widr, Jules321, Clommlon Fiepss, Ipal64, Jobin RV, Op47, Tyler.norton12, Erlbaeko, Is the cake really a lie?, BattyBot, Globitz, Zampoukos, Padfoot30, Cthombor, Musicologyman, Shakeelrajput54, MR.DBA, Rasdanger, Atomsk V, François Robere, Soham, Jlehew, Cutter1234, Amshirkh, Monkbot, JohnTB, KH-1, Warpzit, Gamebuster19901, Nisien, Anupam paul, Alchemist12, BISHOPRINCE and Anonymous: 387 • Distributed concurrency control Source: https://en.wikipedia.org/wiki/Distributed_concurrency_control?oldid=562997414 Contribu- tors: Ruud Koot, Squiddy, Comps, M4gnum0n, Addbot, AnomieBOT, Miym, FrescoBot, TobeBot, ClueBot NG and Anonymous: 2 • Distributed graph coloring Source: https://en.wikipedia.org/wiki/Graph_coloring?oldid=676220890 Contributors: Michael Hardy, Booy- abazooka, Suisui, Charles Matthews, Dcoetzee, Dysprosia, Doradus, McKay, Optim, Phil Boswell, Robbot, Ke4roh, Altenmann, Kuszi, Mayooranathan, MathMartin, Thesilverbail, Saforrest, Nitishkorula, Giftlite, Mellum, Galaxy07, Sam Hocevar, Peter Kwok, Guybrush, Corti, Xrchz, Abelson, ZeroOne, Zaslav, S.K., CanisRufus, Drange net, HasharBot~enwiki, Jérôme, Fawcett5, Caesura, Hlg, Gene Ny- gaard, Oleg Alexandrov, GregorB, Terryn3, Adking80, Rjwilmsi, Wikirao, Arbor, Chris Pressey, Chobot, DVdm, YurikBot, Michael Slone, Chris Capoccia, Hv, Cheeser1, Bota47, Galeru, Ott2, Lycaon, David Jordan, Claygate, SmackBot, Zanetu, Kostmo, VMS Mo- saic, Cybercobra, Flyingspuds, Ligulembot, Geeee, Superdosh, Shirifan, Lyonsam, Hiiiiiiiiiiiiiiiiiiiii, Nialsh, Aeons, CRGreathouse, Cm- drObot, Thijs!bot, Mbell, Headbomb, Lyondif02, AgentPeppermint, Escarbot, Hermel, David Eppstein, JaGa, AgarwalSumeet, Idioma- bot, BB-Froggy, Ptrillian, TXiKiBoT, Lwr314, Rei-bot, Maxim, Jamelan, Eubulides, Homei, Robert Samal, Mjaredd, Gerakibot, An- tonio Lopez, Anchor Link Bot, Glomerule, -Midorihana-, Tonkawa68, Bender2k14, Beta79, Josephvk, Gazimoff, Addbot, DOI bot, ,Luckas-bot, Yobot, TaBOT-zerem, AnomieBOT, Citation bot, David.daileyatsrudotedu, Joshxyz, ArthurBot, LilHelpa ,ماني ,Alexs2008 Alex Dainiak, ChildofMidnight, Gilo1969, SausageLady, Miym, Thore Husfeldt, Nick twisper, Lunae, OgreBot, Citation bot 1, Jone- sey95, MastiBot, RobinK, Kapgains, TobeBot, Timdumol, Vmohanaraj, DuineSidhe, EmausBot, John of Reading, Dd4932, Wikipelli, ZéroBot, Fred Gandt, ChuispastonBot, Mjbmrbot, ClueBot NG, Wikikoff, Jørdan, Frietjes, Rezabot, Helpful Pixie Bot, Silvrous, Jozefga- jdos, CitationCleanerBot, Arpi Ter-Araqelyan, EnzaiBot, TeleTeddy, Roll-Morton, Jochen Burghardt, Jalal476, Nina Cerutti, Wierdcow- man, Bg9989, Atiliogomes, Mats.sxz, Briansdumb, El Charpi~enwiki, Jplauri, Boky90, Merinjose, Maria Johnson christ and Anonymous: 107 • Embarrassingly parallel Source: https://en.wikipedia.org/wiki/Embarrassingly_parallel?oldid=675512764 Contributors: The Anome, Nealmcb, Michael Hardy, CesarB, MichaK, Charles Matthews, Timwi, Hyacinth, Scott McNay, David Gerard, Leonard G., Bigpeteb, Abdull, Altmany, Mike Schwartz, Gbrandt, PatrickFisher, Cburnett, Leondz, Tr00st, Arcann, Ruud Koot, Bluemoose, Eyreland, Qwer- tyus, Josh Parris, Brighterorange, John Baez, NekoDaemon, Simishag, Chobot, Hairy Dude, Schoen, Bovineone, SmackBot, PaulWay, Fintler, Optikos, Nbarth, Frap, Kenta2, Peyre, Rory O'Kane, DouglasCalvert, Spoonage, Kozuch, Omicronpersei8, Talgalili, Dgies, Stybn, Whenning, Extreme Tomato, Smite-Meister, M-le-mot-dit, Mischling, LeadSongDog, PaulBone, Rilak, Karlhendrikse, Ost316, Addbot, Luckas-bot, AnomieBOT, Arjun G. Menon, Miym, Aappleby, Leboite, Keiyakins, Franciscouzo, Salvio giuliano, Ptrb, Accelerometer, BG19bot, Tarcil, BattyBot, Gforman44 and Anonymous: 39 • Failure semantics Source: https://en.wikipedia.org/wiki/Failure_semantics?oldid=624017160 Contributors: Bhny, Tony1, CmdrObot, PKT, Tangurena, UnCatBot, Ariconte, Felix Folio Secundus, Koppas, Miym, MARJAN67, Sumsum2010, GoingBatty, Mkwc1m4, Chris- Gualtieri, Monkbot and Anonymous: 2 152 CHAPTER 34. WEAK COLORING

• Fallacies of distributed computing Source: https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing?oldid=675261869 Con- tributors: Arvindn, Oneiros, Zondor, Daf, Anthony Appleyard, Aristotle Pagaltzis, Riadlem, Manasgarg, Spl, Bovineone, Nad, Tony1, JLaTondre, SmackBot, Cybercobra, Dubwai, MTSbot~enwiki, Lee Carre, Krauss, Hervegirod, Megalon, Ideogram, WikiMax, Gwern, Shaunfensom, Philomathoholic, Fahdshariff, JCLately, Vaarky, Entonian, Dkf11, Alexbot, LobStoR, Addbot, Lightbot, Twimoki, Wik- torWandachowicz, Yobot, AnomieBOT, Miym, Dpurrington, FrescoBot, Kopiersperre, Jandalhandler, Jamierlawson, Angrytoast, Dohn joe, Yami Vizzini, Nitishmd and Anonymous: 28 • Global concurrency control Source: https://en.wikipedia.org/wiki/Global_concurrency_control?oldid=544873458 Contributors: Ruud Koot, Comps, Addbot, AnomieBOT, Miym, FrescoBot, TobeBot and Anonymous: 1 • Happened-before Source: https://en.wikipedia.org/wiki/Happened-before?oldid=651457130 Contributors: Dori, Sander~enwiki, Matt Crypto, Ragib, Andreas Kaufmann, TedPavlic, Olau, MIT Trekkie, Daira Hopwood, Oliphaunt, Jörg Knappen~enwiki, Marudubshinki, Intgr, Fresheneesz, SmackBot, JonHarder, Utopianheaven, Momet, Ripounet, Ma.mazmaz, Cic, Addbot, Miym, Erik9bot, Ledt.uoft, Ehimen AC, ClueBot NG, Anttir717 and Anonymous: 14 • Leader election Source: https://en.wikipedia.org/wiki/Leader_election?oldid=669641137 Contributors: Greenrd, Babbage, Jfire, Ysangkok, Splash, SmackBot, Ryan Roos, Alaibot, Ereboschi, Thijs!bot, Rich257, Cic, David Eppstein, Treppur, Oshwah, Quesder, Adabow, MenoBot, Dominikiii, Alexbot, DumZiBoT, Addbot, DOI bot, Jill-Jênn, Yobot, Gms, AnomieBOT, Miym, Isomorphismus, Amphied, BG19bot, Monkbot, Johnnyinsb, Hemis62 and Anonymous: 8 • Quantum Byzantine agreement Source: https://en.wikipedia.org/wiki/Quantum_Byzantine_agreement?oldid=659908462 Contribu- tors: Jarekadam, Jheald, Ghirlandajo, Tabletop, Rjwilmsi, RussBot, Ytcracker, Jess Riedel, Ilmari Karonen, Dragonflare82, Jacobko, Falcor84, TreasuryTag, ImageRemovalBot, Muyiwamc2, Addbot, Zahd, Oldlab, Yobot, AnomieBOT, ArthurBot, Miym, John of Read- ing and Anonymous: 10 • Race condition Source: https://en.wikipedia.org/wiki/Race_condition?oldid=659492166 Contributors: Aldie, SimonP, Patrick, RTC, Michael Hardy, Karada, Mr100percent, GRAHAMUK, Dysprosia, Colin Marquardt, Pedant17, Furrykef, Joy, Bloodshedder, Ben- wing, Razi~enwiki, Nurg, Nilmerg, Mdrejhon, Tobias Bergemann, ManuelGR, DavidCary, Gracefool, Elmindreda, Gadfium, Quarl, Simoneau, McCart42, Freakofnurture, Rich Farmbrough, Drano, Smyth, David Schaich, Neko-chan, Aaronbrick, Cuervo, R. S. Shaw, Kamyar~enwiki, Daf, Pearle, Hooperbloob, Tom Yates, Walter Görlitz, Ynhockey, Alai, Forderud, Kenyon, Crosbiesmith, Daira Hop- wood, Mido, Cbdorsett, Male1979, Marudubshinki, E090, FlaBot, Bubbleboys, Intgr, Pinecar, YurikBot, Bhny, Barefootguru, Carl- Hewitt, Yahya Abdal-Aziz, Zwobot, Square87~enwiki, Lt-wiki-bot, Curpsbot-unicodify, Erik Sandberg, SmackBot, Slamb, Unyoyega, Dbnull, Commander Keane bot, PJTraill, RDBrown, Thumperward, Nbarth, Tsca.bot, JonHarder, Allan McInnes, Pcgomes, Soar- head77, Kuru, Moabdave, MTSbot~enwiki, Sakurambo, Cydebot, Mblumber, Jamesjiao, Barticus88, Michagal, Pietrodn, Parsiferon, Jirka6, HarmonicFeather, Greensburger, .anacondabot, Andrewdolby, Stijn Vermeeren, Madanmus, Japo, Hbent, R'n'B, Wiki Raja, Erkan Yilmaz, Szeder, Uncle Dick, AngryBear, Kyle the bot, TXiKiBoT, Softtest123, Sashman, Forlornturtle, ToePeu.bot, Jimmy the Snout, Psychless, JCLately, Chillum, PerryTachett, The Thing That Should Not Be, RFST, DumZiBoT, Addbot, Ghettoblaster, Some jerk on the Internet, Olli Niemitalo, Tothwolf, Leszek Jańczuk, Wikomidia, Numbo3-bot, Tide rolls, Luckas-bot, Yobot, PM- Lawrence, Rubinbot, Darolew, Xqbot, Miym, Erik9, Abed pacino, Winterst, Vrenator, Msghani, Alph Bot, ToneDaBass, Lambdatypes, Moswento, ZéroBot, AManWithNoPlan, Music Sorter, Eda eng, Ego White Tray, Ipsign, ChuispastonBot, ClueBot NG, Naveenmouni, Snotbot, Zakblade2000, JagexSucks, Jorgenev, Uwadb, Wbm1058, BG19bot, PhnomPencil, ElphiBot, Pleet, AllenZh, Musicologyman, Mr.goodbyte42, Hari.raghu, Eric Corbett, Kernosky, Mmpozulp, Godugu jaya, Scrabbler94, Bin927 and Anonymous: 135 • Self-stabilization Source: https://en.wikipedia.org/wiki/Self-stabilization?oldid=663994307 Contributors: Edward, Pnm, Charles Matthews, Furrykef, Everyking, Mboverload, Gracefool, Netlad, Michael.jaeger, Bestchai, Daira Hopwood, GregorB, JFromm, Bovineone, Bl0ck, RabidDeity, SmackBot, Clicketyclack, MrDomino, Amniarix, MaxEnt, Sumsingh, David Eppstein, Gemini1980, M4gnum0n, Addbot, Rorybob, Lightbot, Yobot, Citation bot, Xqbot, Miym, GrouchoBot, Citation bot 1, Peter.alexander.au, Anfarahat, Helpful Pixie Bot, Dexbot and Anonymous: 28 • Serializability Source: https://en.wikipedia.org/wiki/Serializability?oldid=668134135 Contributors: Ahoerstemeier, Greenrd, David- Cary, ArnoldReinhold, Arthena, Ruud Koot, MassGalactusUniversum, BD2412, Rjwilmsi, Darthsco, Wavelength, That Guy, From That Show!, SmackBot, Amux, Chris the speller, Mihai Capotă, Flyguy649, Cybercobra, Paul Foxworthy, Comps, MeekMark, Pad- dles, Kubanczyk, Supparluca, VoABot II, Rxtreme, R'n'B, Deor, VolkovBot, Klower, JCLately, Svick, M4gnum0n, Addbot, Fyrael, Alex.mccarthy, Zorrobot, Luckas-bot, Yobot, AnomieBOT, LilHelpa, Gilo1969, Miym, Omnipaedista, FrescoBot, Mark Renier, Craig Pemberton, Farhikht, Tbhotch, DRAGON BOOSTER, John of Reading, Fæ, Mentibot, Jack Greenmaven, MerlIwBot, Kgrittn, Cyber- power678, Cyberbot II and Anonymous: 48 • Shared register Source: https://en.wikipedia.org/wiki/Shared_register?oldid=644729675 Contributors: Bgwhite, Treppur, Arjayay, Yobot, BattyBot, Qiyitang, Meyerjo and Anonymous: 2 • Shared snapshot objects Source: https://en.wikipedia.org/wiki/Shared_snapshot_objects?oldid=644223430 Contributors: Treppur, Matthew- Vanitas, I dream of horses, SporkBot, BattyBot, CraigyDavi, Qiyitang, Meyerjo and Anonymous: 1 • State machine replication Source: https://en.wikipedia.org/wiki/State_machine_replication?oldid=676718712 Contributors: Pnm, Ho- ratio, Adrian Sampson, Echuck215, Rjwilmsi, Jameshfisher, Tedder, RadioFan, Wknight94, Pegship, Jsnx, SmackBot, Bluebot, Bsil- verthorn, Mwtoews, Sabik, CmdrObot, Pmussler, Bryan Turner, Yecril, Ken Birman, DOI bot, Gnalk, Yobot, AnomieBOT, Citation bot, Miym, Menderico, FrescoBot, Citation bot 1, Mitaligupta 17, BattyBot, JingguoYao, Snakejerusalem, Mhuxtable, Monkbot and Anonymous: 11 • Superstabilization Source: https://en.wikipedia.org/wiki/Superstabilization?oldid=546525011 Contributors: Pnm, Kku, Palfrey, Cmdr- jameson, Michael.jaeger, Mets501, Auntof6, DumZiBoT, Addbot, Miym and Anonymous: 2 • Terminating Reliable Broadcast Source: https://en.wikipedia.org/wiki/Terminating_Reliable_Broadcast?oldid=671999555 Contribu- tors: Rich Farmbrough, Intgr, SmackBot, Bsilverthorn, KylieTastic, Addbot, Yobot, Miym and Anonymous: 2 • Timing failure Source: https://en.wikipedia.org/wiki/Timing_failure?oldid=660104911 Contributors: Adoarns, Sander~enwiki, David- gothberg, Mindmatrix, Ruud Koot, MithrandirMage, Justin Eiler, SmackBot, Smile a While, Tinucherian, Infrangible, Yobot, Miym, Erik9bot, ClueBot NG, 3gg5amp1e and Anonymous: 1 • Transitive data skew Source: https://en.wikipedia.org/wiki/Transitive_data_skew?oldid=641386784 Contributors: Joe Decker, Lockley, SmackBot, Fabrictramp, Pjoef, Malcolmxl5, Rsilvajr77, Addbot, Yobot, Miym, ClueBot NG and Anonymous: 1 34.4. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES 153

• Two Generals’ Problem Source: https://en.wikipedia.org/wiki/Two_Generals’{}_Problem?oldid=675839625 Contributors: Julesd, Var- laam, Guy Harris, BRW, Daira Hopwood, Bluegrass, Marudubshinki, Obersachse, Tofergregg, Tinmith, Irishguy, Eamon03, Thumper- ward, JonHarder, Cybercobra, J. Finkelstein, Axcelis555, Pmussler, Jedonnelley, Sobreira, JustAGal, Uruiamme, JAnDbot, Eleschin- ski2000, AlephGamma, Gomm, Zipzipzip, Ontarioboy, CanOfWorms, Spinningspark, DumZiBoT, Addbot, Fryed-peach, Control.valve, Sae1962, DrilBot, Leo wannabe, Aurorion, Lokpest, ClueBot NG, Sleddog116, Cheng, Jens Erat, Hemanthah, Monkbot and Anonymous: 44 • Uniform consensus Source: https://en.wikipedia.org/wiki/Uniform_consensus?oldid=648678181 Contributors: Malcolma, SmackBot, Synergy, Cic, SJP, Addbot, Ganoro, Miym and Anonymous: 1 • Version vector Source: https://en.wikipedia.org/wiki/Version_vector?oldid=666811850 Contributors: GTBacchus, Neilc, Mrtngslr, Raw- logic, Disavian, Marokwitz, David Eppstein, Sabih omar, DavidAndersen, Xemal, BG19bot and Anonymous: 2 • Weak coloring Source: https://en.wikipedia.org/wiki/Weak_coloring?oldid=504652090 Contributors: Andreas Kaufmann, David Epp- stein, Joshxyz and Miym

34.4.2 Images

• File:2-generals.svg Source: https://upload.wikimedia.org/wikipedia/commons/c/c9/2-generals.svg License: GFDL Contributors: cre- ated by myself Original artist: Jens Erat • File:2013-09-11_Bus_wrapped_with_SAP_Big_Data_parked_outside_IDF13_(9730051783).jpg Source: https://upload.wikimedia. org/wikipedia/commons/8/8d/2013-09-11_Bus_wrapped_with_SAP_Big_Data_parked_outside_IDF13_%289730051783%29.jpg Li- cense: CC BY-SA 2.0 Contributors: Bus wrapped with SAP Big Data parked outside IDF13 Original artist: Intel Free Press • File:3-coloringEx.svg Source: https://upload.wikimedia.org/wikipedia/commons/b/b7/3-coloringEx.svg License: CC-BY-SA-3.0 Con- tributors: vector version of Image:3-coloringEx.png Original artist: en:User:Booyabazooka • File:4x4_hypercube.png Source: https://upload.wikimedia.org/wikipedia/commons/2/22/4x4_hypercube.png License: CC BY-SA 4.0 Contributors: Own work Original artist: Hemis62 • File:Ambox_important.svg Source: https://upload.wikimedia.org/wikipedia/commons/b/b4/Ambox_important.svg License: Public do- main Contributors: Own work, based off of Image:Ambox scales.svg Original artist: Dsmurat (talk · contribs) • File:Big_data_cartoon_t_gregorius.jpg Source: https://upload.wikimedia.org/wikipedia/commons/b/b3/Big_data_cartoon_t_gregorius. jpg License: CC BY 2.0 Contributors: Cartoon: Big Data Original artist: Thierry Gregorius • File:Chromatic_polynomial_of_all_3-vertex_graphs.png Source: https://upload.wikimedia.org/wikipedia/en/b/b6/Chromatic_polynomial_ of_all_3-vertex_graphs.png License: CC-BY-SA-3.0 Contributors: I created this work entirely by myself. Original artist: Thore Husfeldt (talk) • File:Commons-logo.svg Source: https://upload.wikimedia.org/wikipedia/en/4/4a/Commons-logo.svg License: ? Contributors: ? Origi- nal artist: ? • File:Computer-aj_aj_ashton_01.svg Source: https://upload.wikimedia.org/wikipedia/commons/c/c1/Computer-aj_aj_ashton_01.svg License: CC0 Contributors: ? Original artist: ? • File:Containment_Hierarchy.png Source: https://upload.wikimedia.org/wikipedia/commons/d/db/Containment_Hierarchy.png License: CC BY-SA 4.0 Contributors: Own work Original artist: Skamisetty • File:Copyright-problem_paste_2.svg Source: https://upload.wikimedia.org/wikipedia/en/4/43/Copyright-problem_paste_2.svg License: CC-BY-SA-3.0 Contributors: commons:File:Copyright-problem paste 2.svg Original artist: Ilmari Karonen, Rugby471 and Cronholm144 • File:Crystal_Clear_app_network.png Source: https://upload.wikimedia.org/wikipedia/commons/4/49/Crystal_Clear_app_network.png License: LGPL Contributors: All Crystal Clear icons were posted by the author as LGPL on kde-look; Original artist: Everaldo Coelho and YellowIcon; • File:Edit-clear.svg Source: https://upload.wikimedia.org/wikipedia/en/f/f2/Edit-clear.svg License: Public domain Contributors: The Tango! Desktop Project. Original artist: The people from the Tango! project. And according to the meta-data in the file, specifically: “Andreas Nilsson, and Jakub Steiner (although minimally).” • File:Emoji_u1f4bb.svg Source: https://upload.wikimedia.org/wikipedia/commons/d/d7/Emoji_u1f4bb.svg License: Apache License 2.0 Contributors: https://code.google.com/p/noto/ Original artist: Google • File:Folder_Hexagonal_Icon.svg Source: https://upload.wikimedia.org/wikipedia/en/4/48/Folder_Hexagonal_Icon.svg License: Cc- by-sa-3.0 Contributors: ? Original artist: ? • File:Full_graph.png Source: https://upload.wikimedia.org/wikipedia/commons/1/1f/Full_graph.png License: CC BY-SA 4.0 Contrib- utors: Own work Original artist: Hemis62 • File:Graph_with_all_three-colourings_2.svg Source: https://upload.wikimedia.org/wikipedia/commons/b/b7/Graph_with_all_three-colourings_ 2.svg License: CC-BY-SA-3.0 Contributors: :Image:Graph with all three-colourings.png made by User:Arbor Original artist: Arbor at English Wikipedia (PNG file), Booyabazooka at English Wikipedia (corrections + SVG conversion) • File:Greedy_colourings.svg Source: https://upload.wikimedia.org/wikipedia/commons/0/00/Greedy_colourings.svg License: CC BY- SA 3.0 Contributors: Own work Original artist: Thore Husfeldt • File:Green_bug_and_broom.svg Source: https://upload.wikimedia.org/wikipedia/commons/8/83/Green_bug_and_broom.svg License: LGPL Contributors: File:Broom icon.svg, file:Green_bug.svg Original artist: Poznaniak, pozostali autorzy w plikach źródłowych 154 CHAPTER 34. WEAK COLORING

• File:Hilbert_InfoGrowth.png Source: https://upload.wikimedia.org/wikipedia/commons/7/7c/Hilbert_InfoGrowth.png License: CC BY-SA 3.0 Contributors: Own work Original artist: Myworkforwiki • File:Implementation_of_SWMR_register_using_SWSR_registers.jpg Source: https://upload.wikimedia.org/wikipedia/en/6/63/Implementation_ of_SWMR_register_using_SWSR_registers.jpg License: CC-BY-SA-3.0 Contributors: Own work Original artist: Qiyitang • File:LampFlowchart.svg Source: https://upload.wikimedia.org/wikipedia/commons/9/91/LampFlowchart.svg License: CC-BY-SA- 3.0 Contributors: vector version of Image:LampFlowchart.png Original artist: svg by Booyabazooka

• File:Map_Reduce_Job_-1.png Source: https://upload.wikimedia.org/wikipedia/commons/a/aa/Map_Reduce_Job_-1.png License: CC BY-SA 4.0 Contributors: Own work Original artist: Skamisetty • File:Mesh_graph.png Source: https://upload.wikimedia.org/wikipedia/commons/5/58/Mesh_graph.png License: CC BY-SA 4.0 Con- tributors: Own work Original artist: Hemis62 • File:MontreGousset001.jpg Source: https://upload.wikimedia.org/wikipedia/commons/4/45/MontreGousset001.jpg License: CC-BY- SA-3.0 Contributors: Self-published work by ZA Original artist: Isabelle Grosjean ZA • File:Multiwriter_snapshost_linearization.svg Source: https://upload.wikimedia.org/wikipedia/commons/9/95/Multiwriter_snapshost_ linearization.svg License: CC BY-SA 4.0 Contributors: Own work Original artist: Meyerjo • File:Petersen_graph_3-coloring.svg Source: https://upload.wikimedia.org/wikipedia/commons/9/90/Petersen_graph_3-coloring.svg License: Public domain Contributors: ? Original artist: ? • File:Process_deadlock.svg Source: https://upload.wikimedia.org/wikipedia/commons/2/28/Process_deadlock.svg License: FAL Con- tributors: Own work Original artist: This diagram or document was created with Dia by VolodyA! V Anarhist. • File:Question_book-new.svg Source: https://upload.wikimedia.org/wikipedia/en/9/99/Question_book-new.svg License: Cc-by-sa-3.0 Contributors: Created from scratch in Adobe Illustrator. Based on Image:Question book.png created by User:Equazcion Original artist: Tkgd2007 • File:Race_condition.svg Source: https://upload.wikimedia.org/wikipedia/commons/7/78/Race_condition.svg License: CC-BY-SA-3.0 Contributors: Transferred from en.wikipedia to Commons by Lampak using CommonsHelper. Original artist: The original uploader was Sakurambo at English Wikipedia • File:Ring_graph.png Source: https://upload.wikimedia.org/wikipedia/commons/f/f8/Ring_graph.png License: CC BY-SA 4.0 Contrib- utors: Own work Original artist: Hemis62 • File:SWSR2.JPG Source: https://upload.wikimedia.org/wikipedia/commons/6/69/SWSR2.JPG License: CC BY-SA 4.0 Contributors: Own work Original artist: Qiyitang • File:Selection_065.png Source: https://upload.wikimedia.org/wikipedia/commons/c/c4/Selection_065.png License: CC BY-SA 4.0 Con- tributors: Own work Original artist: Skamisetty • File:Shared_Resgiter_Stages_of_Constructions.JPG Source: https://upload.wikimedia.org/wikipedia/commons/2/21/Shared_Resgiter_ Stages_of_Constructions.JPG License: CC BY-SA 4.0 Contributors: Own work Original artist: Qiyitang • File:Starving_Process_in_Snapshot_object.svg Source: https://upload.wikimedia.org/wikipedia/commons/9/92/Starving_Process_in_ Snapshot_object.svg License: CC BY-SA 4.0 Contributors: Own work Original artist: Meyerjo • File:Swmr_snapshot_object_linearordering.svg Source: https://upload.wikimedia.org/wikipedia/commons/4/41/Swmr_snapshot_object_ linearordering.svg License: CC BY-SA 4.0 Contributors: Own work Original artist: Meyerjo • File:Text_document_with_red_question_mark.svg Source: https://upload.wikimedia.org/wikipedia/commons/a/a4/Text_document_ with_red_question_mark.svg License: Public domain Contributors: Created by bdesham with Inkscape; based upon Text-x-generic.svg from the Tango project. Original artist: Benjamin D. Esham (bdesham) • File:Torus_graph.png Source: https://upload.wikimedia.org/wikipedia/commons/a/a7/Torus_graph.png License: CC BY-SA 4.0 Con- tributors: Own work Original artist: Hemis62 • File:Tracing_Anomalous_Actors.png Source: https://upload.wikimedia.org/wikipedia/commons/d/db/Tracing_Anomalous_Actors.png License: CC BY-SA 4.0 Contributors: Own work Original artist: Skamisetty • File:Tracing_Outliers_in_the_data.png Source: https://upload.wikimedia.org/wikipedia/commons/0/07/Tracing_Outliers_in_the_data. png License: CC BY-SA 4.0 Contributors: Own work Original artist: Skamisetty • File:Unbalanced_scales.svg Source: https://upload.wikimedia.org/wikipedia/commons/f/fe/Unbalanced_scales.svg License: Public do- main Contributors: ? Original artist: ? • File:Viegas-UserActivityonWikipedia.gif Source: https://upload.wikimedia.org/wikipedia/commons/6/69/Viegas-UserActivityonWikipedia. gif License: CC BY 2.0 Contributors: User activity on Wikipedia Original artist: Fernanda B. Viégas • File:Weak-2-coloring-construct.svg Source: https://upload.wikimedia.org/wikipedia/commons/a/af/Weak-2-coloring-construct.svg License: CC BY-SA 3.0 Contributors: Own work Original artist: Miym • File:Weak-2-coloring.svg Source: https://upload.wikimedia.org/wikipedia/commons/1/1c/Weak-2-coloring.svg License: CC BY-SA 3.0 Contributors: Own work Original artist: Miym • File:Wiki_letter_w.svg Source: https://upload.wikimedia.org/wikipedia/en/6/6c/Wiki_letter_w.svg License: Cc-by-sa-3.0 Contributors: ? Original artist: ? • File:Wiki_letter_w_cropped.svg Source: https://upload.wikimedia.org/wikipedia/commons/1/1c/Wiki_letter_w_cropped.svg License: CC-BY-SA-3.0 Contributors: • Wiki_letter_w.svg Original artist: Wiki_letter_w.svg: Jarkko Piiroinen • File:YOYO.png Source: https://upload.wikimedia.org/wikipedia/commons/f/f6/YOYO.png License: CC BY-SA 4.0 Contributors: Own work Original artist: Hemis62 34.4. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES 155

34.4.3 Content license

• Creative Commons Attribution-Share Alike 3.0