Alex Kazorian, Aneesh Khera, ANGELA Cibi Pari, Janakirama Kalidhindi A Sparse, Distributed, and Highly Concurrent Merkle

Background Algorithm Conclusion

Motivation General Overview Concurrent Updates This project made clear the effects of multithreading and multi- Merkle Trees are binary hash trees where each leaf is deter- Given a list of update transactions to our Merkle Tree, we would like Now that we are aware of all points of conflict, we can begin percolating updates up the tree concurrently. Threads can processing over a distributed system. We see significant perfor- mined based off of a cryptographic hash of a given piece of to percolate these changes up the tree concurrently. To do so, we each percolate a single change up the tree at a time. For each thread, after updating a node, we need to decide if it is safe mance improvements with the use of our algorithm to batch data. Each parent node is thus determined via a hash on the use the following steps: to update its parent. Given the parent id, we check to see if it is inside the of conflict points. If the parent is a location of multiple writes to the database simultaneously. We also see the concatenation of its children. Merkle Trees are useful in veri- conflict, we acquire a lock on the parent and determine if we are the first to reach the conflict point. If we are, we have to overheads of network latency and connecting large systems fying the integrity of data recieved over a network, giving 1) Sort the incoming transactions by wait for our sibling to take over the update of the remainder of the percolation path to the root. So, this thread leaves a together. With a making requests, reaching a python them increased importance in recent years. 2) Leaf pairwise, find all conflict points in the tree mark that it reached the conflict, releases the lock, and stops updating, returning the thread to begin a fresh waiting server that then makes requests through Ray to a distributed 3) Begin multiple concurrent updates to the tree update. On the other hand, if the update reaches the conflict point second, it knows that its sibling has been updated and cluster of Go nodes, which then queries an Aurora cluster - The Key Transparency project, a public key infrastructure, 4) While updating, check if the node is a point of conflict can continue updating safely. If the parent node was not a conflict location, then we know there will not be any other there is significant network latency that is only overshadowed uses Google Trillian, a sparse Merkle Tree implementation a) if it is the first to reach the conflict, stop updating and return threads attempting to update the parent node and that it is safe to update. We continue with this model until all updates when using exceptionally large workloads. When using Ray, we with transparent storage that pairs a user identifier in a leaf b) else, continue as normal have percolated up the tree. see the effect of distributing to many EC2 instances together with a public key. When an update to a public key comes in, and the latency of their communication. This project was a pro- the entire tree must be locked down and updated serially, Sorting and Finding Conflicts Using this algorithm over the naive algorithm gave us greater than a 2x speed up. This is discussed further in the bench- totype to open more research into the area of using distributed introducing massive bottlenecks. Outside of Key Transparen- The first two steps consist of sorting the incoming transactions and marking section. frameworks for concurrent Merkle Tree computation and distrib- cy, this problem is generalizable to other applications that then finding the locations at which each neighoring pair of update uted Merkle Tree storage. While we did not see perfect scaling have the need for efficient, authenticated data structures, transactions conflict. Given N updates, we know that those updates in practice, which we believe is due to network and database such as the Global Data Plane file system. Google Key will meet at N-1 conflict points. Figure 2 shows this. As the updates latencies as well as imperfect distribution and limited utilization Transparency requires the scale of a billion leaves with high are sorted, by comparing each pair of neighboring nodes, we can of our Ray nodes, our algorithm shows that concurrency can be query throughput. Any concurrency overhead should not de- find the closest conflict point between the two. In implementation, the achieved as well as scalable distribution while maintaining per- stroy the system's performance when the scale of the Merkle longest matching prefix of the two keys provides us with the conflict formance on large scale production systems. Tree is very large. location. By finding all the points of conflict, we can reduce the lock- ing surface area from the entire tree, passing the entire path and copath, to just N-1 nodes given N update transactions. This greatly Top Hash Figure 1 reduces the bottleneck induced via alternative methods of locking as Hash 0 hash( + ) Hash 1 well as enabling concurrent updates to the tree. Figure 2 Figure 4

Hash Hash 0 1

Hash 0-0 Hash 1-0 hash( + ) hash( + ) Hash 0-1 Hash 1-1 Architecture

Hash Hash Hash Hash Write Phase 0-0 0-1 1-0 1-1 Overview The diagram below illustrates how a write query is processed. In the left diagram, if there is a update request during the read phase, it is not processed immediately. hash(L1) hash(L2) hash(L3) hash(L4) We will be using Ray for our distributed framework, Amazon EC2 Instead, it is placed into a write buffer in sorted order. Given the number of Workers Nodes available, a small Root Tree is taken from the top of the Merkle Tree. This instances for our clusters, and Amazon Aurora for our storage layer. Root Node should have the next power of 2 number of leaves more than Worker Nodes available. Each of those leaves in the Root Node is the root of a much larger Data Our server will reside on an orchestrator node which will accept incom- L1 L2 L3 L4 Blocks subtree which is allocated to a Worker Node. This model is highly scalable as launching more Worker Nodes directly translates to handling a greater number of up- ing transactions and act as a load balancer to distribute the workload dates in parallel. The Orchestrator Server first sends updates to the Worker Nodes. The Worker Nodes query the database for copaths and follow the algorithm de- to the different worker nodes of our tree. scribed above. They write their updates back to the database and push the root back to the Server. The server then finishes updating the root node to mark the end of an epoch. Our Tree Read Phase Angela is a sparse, distributed, and highly concurrent Merkle The diagram on the right describes how a read query propogrates Tree that is our approach to solving this problem. Literature through our system. Clients intiate a read request to the Orchestrator on such sparse representations already exists under the node labeled Server. The Server then load balances and dispatches name of Sparse Merkle Trees. Sparsity allows for lazy con- the read request to a Ray Worker running on an EC2 instance. In struction of and queries on our data. Thus, the tree itself will order to complete the read request, we need to provide the digest be more space efficient and will theoretically result in easier values of all the nodes in the co-path, so that the client can verify the coordination among the distributed clusters. The figure below integrity of the data requested. The Ray Worker uses the ID that is provides a visual representation of how a sparse tree would requested to generate a list of the needed co-path nodes and makes a call to the database. The co-path is returned to the server and finally look and what is actually stored. Specific to our case, we Figure 5 assume an empty node that is unique to each level of the returned to the client for verification. Figure 6 merkle tree. The Merkle Tree will also be organized as a binary with 2^256 leaves, where most nodes are actually empty. A prefix is assigned to each node in the Merkle Tree with a 0 appended on a left child and a 1 ap- pended on a right child, resulting in the leaf locations to be Benchmarks determined via their sorted key values. Google Key Transpar- ency uses a Verifiable Random Function to map usernames Devices Algorithm Hypertuning Figure 8 to a random index in the space of the leaves. This also gives We ran our algorithm benchmarks on a Mac- On the Macbook Pro, we measure the algorithmic performance In this benchmark, we hypertune for the optimal Figure 9 us the interesting property that any query to the tree will book Pro with a 2.2GHzIntel Core i7, 16GB of of our BatchInsert implementation against the naive insertion batchReadSize, batchPercolateSize, and appear to be random and evenly distributed. For the project, RAM, and an Intel Iris Pro1536 MB. We also implementation in golang BatchInsert achieves approximately 2x batchWriteSize. The batchReadSize is the size we assume it is given to us after passing through a VRF. attempted to hypertune three separate batch speedup over the naive implementation with its locking scheme of the read batches used when pulling in nec- Writes are batched together and processed in epochs. In our parameters on a laptop with Ubuntu 16.04 with and use of goroutines to achieve concurrent computation. Since essary copath nodes into the cache before be- system, we wait for 1000 updates to arrive before processing an 2.2GHz, Intel Core i5 processor, and 16GB distribution and database calls are constant between the two, ginning our inserts. The batchPercolateSize them; however, it is also possible to use time based epochs. RAM connected to an Amazon Aurora DB Clus- our algorithm performs better than the serial Insert approach. determines the number of goroutines we kick ter. off. The batchWriteSize gives the number of Figure 7 transactions the system waits for to finish per- colating before writing the changes to the data- base. Regarding batchPercolateSize, we notice optimal performance when each goroutine takes on 2% of the total insertion work, so for a batch insert of size 1000 we would want to choose a percolation batch size of about 20. For batchReadSize, we witness optimal perfor- mance at approximately 5% of the insertion batch size. For a batch of 1000 this would Running a Distributed System translate to a read batch of about 50. For We utilized Ray for distribution, launching 1 head node and multiple worker nodes. Ray batchWriteSize, we determined that there was allowed us to distribute our batch updates across the subtrees of the root worker. We uti- approximately similar overall performance be- lized a virtual addressing scheme to allow our Merkle Tree implementation to be reused tween our two tested values of 25 and 50. Due independently on each node. We ran our complete system on increasing transaction sizes to limitations on the number of placeholders and varying the number of worker nodes. We found that increasing the number of worker allowed in MySQL, our read and write query nodes helped us scale, but not perfectly. The dotted lines would be horizontal in the perfect sizes are limited to be below around 200 and Figure 3 case for perfect scaling, but we instead see that double the transactions would require four 75, respectively. times the nodes to scale perfectly.

Computer Science 262A