Shuaicheng Ma et al.

RESEARCH Efficient logging and querying for -based cross-site genomic dataset access audit Shuaicheng Ma1*, Yang Cao2 and Li Xiong1

Abstract Background: Genomic data have been collected by different institutions and companies and need to be shared for broader use. In a cross-site genomic data sharing system, a secure and transparent access control audit module plays an essential role in ensuring the accountability. A centralized access log audit system is vulnerable to the single point of attack and also lack transparency since the log could be tampered by a malicious system administrator or internal adversaries. Several studies have proposed blockchain-based access audit to solve this problem but without considering the efficiency of the audit queries. The 2018 iDASH competition first track provides us with an opportunity to design efficient logging and querying system for cross-site genomic dataset access audit. We designed a blockchain-based log system which can provide a light-weight and widely compatible module for existing blockchain platforms. The submitted solution won the third place of the competition. In this paper, we report the technical details in our system. Methods: We present two methods: baseline method and enhanced method. We started with the baseline method and then adjusted our implementation based on the competition evaluation criteria and characteristics of the log system. To overcome obstacles of indexing on the immutable Blockchain system, we designed a hierarchical timestamp structure which supports efficient range queries on the timestamp field. Results: We implemented our methods in Python3, tested the scalability, and compared the performance using the test data supplied by competition organizer. We successfully boosted the log retrieval speed for complex AND queries that contain multiple predicates. For the range query, we boosted the speed for at least one order of magnitude. The storage usage is reduced by 25%. Conclusion: We demonstrate that Blockchain can be used to build a time and space efficient log and query genomic dataset audit trail. Therefore, it provides a promising solution for sharing genomic data with accountability requirement across multiple sites. Keywords: Blockchain; Genome; Cross-site genomic datasets; Access log audit arXiv:1907.07303v2 [cs.DB] 26 Jul 2019 Background from under $100 to more than $2,000, depending on With the rapid development of biomedical and com- the nature and complexity of the test [4]. One can putational technologies, a large amount of genomic test her gene easily and cheaply by using services data sets have been collected and analyzed in national from DNA-testing companies such as Ancestry and and international projects such as Human Genome 23andMe. Given the above, genomic data sets have Project [1] , the HapMap project [2] and the Genotype- been scattered around the world in different institu- Tissue Expression (GTEx) project [3], which yielded tions and companies. On the other hand, the poten- invaluable research data and extended the boundary tial business value of genomic data and privacy con- of human knowledge. Thanks to the advance of com- cerns [5,6,7] hinder the sharing of cross-sites genomic puter technology, the cost of genomic testing is drop- data. Notably, the General Data Protection Regulation ping exponentially. Nowadays, the testing price ranges (GDPR) restricts the exchange of personal data. Un- *Correspondence: [email protected] der GDPR, such sensitive data only could be accessed 1Department of Computer Science, Emory University, 400 Dowman Dr, Atlanta, GA, USA after obtaining the consent of data subjects (i.e., the Full list of author information is available at the end of the article one who owns the data) and providing accountability Shuaicheng Ma et al. Page 2 of 11

audit. This requires that any cross-site genomic data Ref ID, User, Activity, Resource, the task is to design sharing system should be equipped with a secure and a time/space− efficient data structure and mechanisms transparent access control module. to store and retrieve the logs based on Multichain ver- Blockchain technology has received increasing atten- sion 1.0.4 [40]. tion because it provides a new paradigm of value ex- change. Although it stems from cryptocurrency, many Competition setup and requirement. It is required studies have investigated the adoption of blockchain in that each entry in the data access log must be saved different application scenarios beyond financial domain individually as one transaction (i.e., participants can- that typically involve multiple parties with conflict of not save the entire file in just one transaction), and interests such as personal data sharing [8,9, 10], sup- all log data and intermediate data (such as index or ply chain [11, 12, 13], identity management [14, 15] and cache) must be saved on-chain (no off-chain data stor- medical data management [16, 17, 18, 19, 20, 21, 22]. age allowed). Competition participants can determine They show that using blockchain technology can re- how to represent and store each log entry in trans- duce friction and increase transparency. A blockchain actions. It does not need to be a plain text copy of system has several notable features: decentralization, the log entry. Also, the query implementation should immutability and transparency. These are achieved by allow a user to search the log using any field of one cryptographic hash, consensus algorithm and many log entry (i.e., node, id, user, resource, activity, times- other innovations from previously unrelated fields such tamp, and a “reference id” referring to the id of the as cryptography and distributed computation [23]. original resource request), any “AND” combination Due to the space limitation, we do not introduce more (e.g., node AND id AND user AND resource), and details of blockchain technologies and refer interested any timestamp range (e.g., from 1522000002418 to readers to surveys on blockchain [24, 25, 26, 27, 28, 29, 1522000011441) using a command-line interface. Also, 30, 31]. the user should be able to sort the returning results in Several studies investigated blockchain-based access ascending/descending order with any field (e.g., times- log audit [32, 33, 34] (we introduce them in the next tamp). There will be 4 nodes in the blockchain net- section). They focus on how to achieve the immutabil- work, and 4 log files to be stored. Users should be able ity of the log. However, none of them investigated the to query the data from any of the 4 sites. Participants efficiency of logging and querying for a blockchain sys- can implement any algorithms to store, retrieve, and tem at the application layer. On the other hand, a present the log data correctly and efficiently. few recent studies [35, 36, 37, 38] from com- munity consider a blockchain system as a distributed Evaluation Criteria. The logging/querying system database, and attempt to improve the performance of needs to demonstrate good performance (i.e., accu- such system by exploring new designs of bottom lay- rate query results) by using a testing dataset, which is ers (such as storage or ) of the different from the one provided for the participants. system. However, without considering the application The speed, storage/memory cost, and scalability of characteristics, such modifications on the back-end en- gine of the system may not have the desired perfor- each solution will be evaluated. The competition or- mance improvement on every application or even cause ganizer used the binary version of Multichain 1.0.4 on unexpected side effects. 64-bit Ubuntu 14.04 with the default parameters as the The 2018 iDASH competition first track, “Blockchain- test bed for fairness. No modification of the underly- based immutable logging and querying for cross-site ge- ing Multichain source code is allowed. The submitted nomic dataset access audit trail”, provides us with an executable binaries should be non-interactive (i.e., de- opportunity to explore a light-weight and widely com- pend only on parameters with no input required while patible access audit module for existing blockchain it works), and should contain a readme file to specify platforms. Our submitted solution won the third place the parameters. The organizer tested all submissions of the competition. In this paper, we report the system using 4 virtual machines, each with 2-Core CPU, 8GB design and technical details in our solution. RAMs and 100GB storage.

The competition task [39] Related work The goal of iDASH competition 2018 first track is The closest line of work to this competition is blockchain- to develop blockchain-based ledgering solutions to log based access log audit. Suzuki et al., [32] proposed a and query the user activities of accessing genomic method using blockchain as an audit-able communi- datasets across multiple sites. Concretely, given a ge- cation channel. This study is motivated by a similar nomic data access log file in which each entry in- problem studied in this paper: in a client-server sys- cludes seven attributes including Timestamp, Node, ID, tem, the logging on either server-side or client-side Shuaicheng Ma et al. Page 3 of 11

Figure 1 Overview of the logging system.

Logging Query User System Insert MultiChain Blockchain API

does not provide strict means of auditing, because the design a blockchain-based log system that can serve as host of the logging system could tamper the log. They a light-weight and widely compatible component for implemented a proof-of-concept system on top of Bit- the existing blockchain platforms. Especially, our so- coin by encoding the messages (i.e., API calls from lution is optimized for genomic dataset access auditing clients and Replies from the server) between clients under the requirements of the competition task. and the server into the transactions of bitcoin. Since the transactions are publicly available, they can be re- Method trieved and verified by an auditor as needed. The pro- We design a blockchain-based log system that is posed system is easy to use and convenient for a client- time/space efficient to store and retrieve genomic server system. However, answering the audit query dataset access audit trail. Our method only lever- using the proposed system may be time-consuming, ages the Blockchain mechanism and is not limited especially for a large-scale system serving millions of to any specific Blockchain implementation, such as clients, as each reply is returned in the form of a bit- Bitcoin[43], Ethereum[44]. We introduce an on-chain coin transaction. The maximum transaction process- indexing data structure which can be easily adapted ing capacity of bitcoin is estimated between 3.3 and 7 to any that use a key-value database as [41]. their local storage. In our development, we use Mul- Castaldo et al., [33] implemented a blockchain-based tichain version 1.0.4 as an interface between Bitcoin tamper-proof audit mechanism for OpenNCP (Open Blockchain and our insertion and query method. Mul- National Contact Points) [42], which is a system for tichain is a Bitcoin Blockchain fork. It conveniently exchanging eHealth data between countries in Eu- provides a feature, data stream, to allow us to use Bit- rope. The idea is similar to the one proposed in [32], coin Blockchain as an append-only key-value database. but dealing with data exchange instead of answering queries. They also encode the data that need to be Overview exchanged into the transactions, but the data are en- In Figure1, we illustrate the overview of the logging crypted using symmetric keys which are shared in ad- system, which is built on top of Multichain APIs. The vance between the sender and receiver through a se- core task is to design space and time efficient meth- cure channel. The author suggests to use Multichain ods for insertion and queries. As described in Section because it provides low overhead for the transactions “Technical details of the task”, there are three types of handling. primitive queries: point query, AND query, and range ProvChain [34] is a blockchain based data prove- query. There are seven fields in the given genomic nance architecture for assuring data operation (i.e., dataset: Timestamp, Node, ID, Ref ID, User, Activity, data access and data changes) in the cloud storage ap- Resource as shown in Table1. − plication. Different from the previous two studies, the major challenge is that the provenance data are also Timestamp Node ID Ref-ID User Activity Resource 152202801 1 1 1 1 REQ RESOURCE MOD FlyBase sensitive but still need to be validated by a third party. 152208352 1 2 1 1 VIEW RESOURCE MOD FlyBase The authors proposed an additional layer as prove- 152216966 1 3 3 6 FILE ACCESS GTEx nance auditor which interacts with a blockchain net- 152237149 1 9 9 10 REQ RESOURCE MOD SGD 1 The Sample Logs. work by blockchain receipts which include provenance entry for future validation. To the best of our knowledge, there is no study inves- For point query, the user can query on any field. For tigating the design of efficient logging and querying for AND query, the user can query on any combination a blockchain system at the application layer. The 2018 of fields. For range query, the user can query only on iDASH competition first track provides us with an op- timestamp field with a start and end timestamp. See portunity to explore such possibilities. We attempt to Table1&2 as a running example. Shuaicheng Ma et al. Page 4 of 11

Insertion • Insert(“152202801 1 1 1 1 REQ RESOURCE MOD FlyBase”) Algorithm 1: Point Query Queries Input: A, K //attribute and key • Point Query(Activity=”VIEW RESOURCE”) Output: l //a list of record • AND Query(ID=”2”, Node=”1”) r 1 l ← liststreamkeyitems(A,K) • Range Query(start=1522000000000, end=1522000100000) r 2 return lr Table 2 Insertion and Queries Examples.

Baseline method Algorithm 2: AND Query We first describe a naive method as a baseline. The Input: lAK // a list of attribute and key pairs baseline method leverages only three Multichain APIs Output: lr //a list of record 1 lr ← point query(lAK [0]A, lAK [0]K ) as shown in Table3. 2 foreach (A, K) ∈ lAK do 3 lr ← lr ∩ point query(A,K) Multichain APIs Description create[stream name] reate a stream(table) in 4 return lr database publish[streamname][key][value] Insert key-value pair to specific stream(table) liststreamkeyitems[stream name][key] Retrieve all items with the given key convert Timestamp Range Query into R point queries, Table 3 Multichain APIs Used in Our Methods. where R is the range of timestamp. The run time com- plexity is O(R), where R = range of timestamp. Insertion: First, we create K streams, where K is the number of fields. Multichain builds K tables in Algorithm 3: Timestamp Range Query its back-end key-value database. Second, we build K Input: ts, te // start timestamp and end timestamp key-value pairs, where the key is the attribute data Output: lr //a list of record and the value is entire record line. Finally, we con- 1 lr ← {} vert those K pairs into one Blockchain transaction and 2 for t = ts to te do 3 l ← l ∪ point query(”Timestamp”,t) publish it to Blockchain. The following Figure2 is an r r example conversion from a log record to blockchain 4 return lr transaction. We will use this example log record in the remaining sections. After the transaction is confirmed by Blockchain, Multichain decodes the transaction and Enhanced method insert each key-value pairs to its corresponding table. After testing the baseline solution, which will be dis- cussed in the result section, we found that the retrieve Timestamp Node ID Ref-ID User Activity Resource speed heavily depends on the number of API calls. 111 1 3 3 6 FILE ACCESS GTEx Therefore, the fewer API calls we use, the faster re- Transaction⇓ trieve speed we get. More specifically, we found three Timestamp Node ID Ref-ID Resource non-optimal issues: Key Value Key Value Key Value Key Value ... Key Value The entire record is duplicated K times where K is 111 V 1 V 3 V 3 V GTEx V • V = ”111 1 3 3 6 F ILE ACCESS GT Ex”. the number of fields, which is insufficient in terms of storage overheads. Figure 2 Log Record to Transaction Conversion. Since we need to query all results and intersect • them in local memory, AND query takes signifi- Point Query: The implementation of a point query is cant amount of memory when the number of AND straightforward which simply returns a list of records operations increases. as shown in Algorithm1. In this literature, we assume If the length of a given range query is n (typi- • 6 8 the run time complexity of all Multichain API is O(1). cally, n is ranging from 10 to 10 ), the baseline The run time complexity of Point Query is O(1). method naively translate the range query into n AND Query: AND query enables a user to query point queries and concatenate the results. with multiple keys. We convert AND query to multi- The blockchain-based auditing system is an append- ple point queries and intersect the result of all point only structure, so a data structure that keeps the min- queries. The run time complexity is O(K), where imum amount of information while maintaining the K = number of keys. efficiency is essential. The percentage of read(query) Timestamp Range Query: Given a start timestamp operations in the real-world auditing system is low and an end timestamp, Timestamp Range Query re- [45], therefore we trade retrieval speed for storage turns records whose timestamp is in this range. We cost. We redesign the key-value pairs in the blockchain Shuaicheng Ma et al. Page 5 of 11

transaction, modified the query algorithm accordingly has the smallest query result size is the most selective. and built a selectivity list based on data distribu- For the enhanced AND query, we call point query only tion. Most of all, we design a hierarchical timestamp one time for the most selective key then filter the re- structure which significantly reduces the number of sult in the memory. Since we only query once from queries(APIs) needed for the range query. Blockchain, the total memory usage is bounded by the largest query result. The run time complexity is O(1). Insertion: To address these problems, we redesign the key-value pairs. The key part remained the same (at- Algorithm 5: AND Query with selectivity list tribute data), but we removed the entire entry from Input: lAK , lS // a list of attribute and key pairs and a the value part. As a result, we removed all duplicated selectivity list values in the baseline method as shown in Figure3 Output: lr //a list of record 1 SK ← findMostSelectiveKey(lAK , ls) 2 lr ← point query(SKA,SKK ) Transaction 3 foreach (A, K) ∈ lAK do 4 l ← filter(l , A, K) Timestamp Node ID Ref-ID Resource r r Key Value Key Value Key Value Key Value ... Key Value 111 1 3 3 GTEx ∅ ∅ ∅ ∅ ∅ All duplicated V s are removed. Timestamp Range Query: Since Blockchain is an immutable structure, the common indexing tech- Figure 3 Transaction with empty values. niques, such as B-tree and R-tree, which require ad- justing/balancing the entire data structure according Point Query: Since we now have an empty value to the data distribution, won’t work. We introduce in the key-value database, we cannot use the key to a hierarchical timestamp structure, which is an in- get original record directly. We now take advantage of cremental data structure and matches the append- Blockchain transaction ID which is included in the re- only characteristics of the blockchain system. Our de- turning JSON file of liststreamkeyitems API. First, we sign significantly reduces the number of queries(APIs) get a list of TXID (transaction ID) with the given key. needed for a single range query. Second, we use another Multichain API, getrawtrans- The hierarchical timestamp structure consists of action, to get the matching transactions. Finally, we multiple levels. See Table4 as an example. The range rebuild the original record from the transaction where in the high level divides into multiple smaller range in the lower level. We denote each range part as Level- all attribute data are included. It is worth mention- Number:Starting Timestamp. A timestamp is recorded ing that the point query now requires T + 1 times in the corresponding part at all levels. In our running API calls to retrieve the records where T is the size example, a timestamp 111 will be recorded in L0:100, of the TXID list. Future study may able to combine L1:110, and L2:111 in Table4. those three steps into one if we modify Multichain ver- sion 1.0.4 source code, but it is not allowed in this L0 [100,200) competition. The run time complexity is O(R), where L1 [100,110) [110,120) [120,..) L2 100 ... 109 110 ... 119 120 ...... R = number of return record. Table 4 Simple Hierarchical Timestamp Structure

Algorithm 4: Point Query with additional step To build this structure, we need to slightly modify Input: A, K //attribute and key the insertion method by adding L streams where L is Output: lr //a list of record the number of levels, and we need to add L key-value 1 [TXIDs] ← liststreamkeyitems(A,K) 2 lr ← [] pairs to Blockchain transaction as well. See Figure4 3 foreach txid ∈ [T XIDs] do as an example. 4 T ← getrawtransaction (txid) 5 R ← rebuild(T) 6 append(lr,R) Transaction

7 return lr Timestamp Resource L0 L1 L3 Key Value ... Key Value Key Value Key Value Key Value 111 GTEx 100 110 111 ∅ ∅ ∅ ∅ ∅ AND Query: In order to reduce the retrieval cost, we build a selectivity list for attributes based on the Figure 4 Transaction with Hierarchical Timestamp Structure. example test data which was given by the competi- tion organizer. A selectivity list is based on the rank In our enhanced range query method, we recursively of result size of each attribute. The attribute which find the largest range in the hierarchical timestamp Shuaicheng Ma et al. Page 6 of 11

structure and use multiple point queries to retrieve 7 fields (Timestamp, Node, ID, Ref ID, User, Activity, the result. Resource). To illustrate, we provide− a few sample data in Table1. Algorithm 6: Timestamp Range Query with hier- To find the optimal number of levels and the step archical timestamp structure multiplier of two adjacent levels for the hierarchical timestamp structure (Figure4), we test all reasonable Input: ts, te // start timestamp and end timestamp Output: lr //a list of record parameter combinations by brute-force. For the given 1 lr ← list sample data, the optimal parameter for the number of 2 l, r ← findLargestRange(ts, te) levels is 3 and the step multiplier of two adjacent levels 3 while r 6= None do 4 append(lr, point query(l, r)) is 100. Future work may include finding the optimal 5 l, r ← findLargestRange(ts, te) number of levels in a more efficient way. 6 return lr Benchmark In our benchmark experiment, we show the scala- In the following example, we show the number of bility of our two methods alongside LevelDB[49] as queries(APIs) needed for our baseline range query and a reference, a popular key-value based database sys- enhanced range query. tem which is used to store data in many blockchain Range query from timestamp 109 to timestamp 120. systems[50][43][44]. Database system and Blockchain Baseline Method: do not share the same design goal: the former is usually q(0T 0, 109) q(0T 0, 110) ... q(0T 0, 120) 11 queries ∪ ∪ ∪ → administered by a centralized entity, and the latter in- Enhanced Method: tents to work at a trustless environment. Nevertheless, q(0L20, 109) q(0L10, 110) q(0L20, 120) 3 queries ∪ ∪ → this comparison offers useful insights of Blockchain We reduce the number of queries needed for range based log system which trades speed for . PL Ri query from RTe−Ts to i=0 , where Ri+1 = Ri rLi We simulate the enhanced insertion, the enhanced mod rLi ,R0 = RTe−Ts and rLi is the elemental range point query, and the enhanced AND query behavior at level Li. in LevelDB. For range query, we use LevelDB native The run time complexity of the enhanced range method so we can properly examine our hierarchical PL Ri query is O( i=0 ). timestamp structure. In all tests, we run 10 rounds rLi for each methods with respect to varying the number Further optimizations of records. We calculate the average and the standard The database normalization can be used for both base- deviation from the results. We notice that the stan- line and enhanced solution. According to the given ge- dard deviation is extremely small which shows the lit- nomic datasets, Ref-ID refers back to the same origi- tle trace in all figures expect Figure5(Point Query). nal ID which means those User and Resource are the This is due to the identical environment and the setup same. For this reason, we can exclude User and Re- of our simulated blockchain nodes. source in Blockchain transaction.

Results Scalability Test: Queries Implementation environment Figure5 shows query time with respect to the vary- We used Python3 as our main programming lan- ing number of records for point query, range query, guage to develop our solution, Savior [46] to inter- AND query. For the point query test, the response act with Multichain API and Docker [47] to simu- time is determined by the result size. As the num- late 4 Blockchain nodes. Additionally, we created some ber of records increases, the result size increases and bash scripts to automatically setup Blockchain nodes the response time increases. The response time of the and Multichain environment. We also wrote a bench- enhanced method is worse than the baseline method mark program to compare our baseline method and because of the addition API calls which we introduced enhanced method. Our code is available online [48]. in the enhanced point query. For the range query test, The specifications of our testing machine are as fol- the performance is constant since the result size of lows: 6 cores CPU(i7 8700k), 32 GB of RAM and 6TB certain time range is constant. It is worth mention- of HDD with Ubuntu 16.04 as the operating system. ing that our enhanced range query method have very We used the sample testing data supplied by the close performance comparing to the native LevelDB competition organizer to benchmark our implementa- range query method. For AND query, since it consists tion. The sample testing data consists of 4 files, one per of point query, the response time increases with the node. Each file has 105 entries of log records which has increasing number of records. It is worth mentioning Shuaicheng Ma et al. Page 7 of 11

Figure 5 Scalability Test: Queries.

that the selectivity list design in our enhanced AND blockchain size information is collected by calling Mul- query method offsets the drawback of the enhanced tichain API. Since Blockchain and LevelDB measure point query method when the number of keys is larger their size in different ways, we exclude LevelDB in this than 2. test. The figure suggests that the enhanced method uses less storage than the baseline method. The dupli- Scalability Test: Insertion cation removal from the blockchain transaction in the Figure6 shows the completion time of insertion enhanced method works as designed. methods with respect to varying the number of records. The insertion time is depended on the trans- action size. The insertion times of the two methods are approximately the same. The enhanced method needs more key-value pairs to support hierarchical times- tamp indexing structure. However, the empty values in key-value pairs offset this transaction size increment.

Figure 7 Scalability Test: Storage.

Detailed Comparison In this section, we show a detailed performance dif- Figure 6 Scalability Test: Insertion. ference of 3 query types in the baseline method, en- hanced method, and LevelDB. We use the fixed 1000 records in the remaining tests. Point Query: Figure8 shows the query response Scalability Test: Storage time for different attributes. The enhanced method Figure7 shows the total blockchain size in bytes performance is worse than the baseline method, be- with respect to varying the number of records. The cause of the additional API calls in the enhanced Shuaicheng Ma et al. Page 8 of 11

method. The rank in the result also matches the rank AND Query: Figure 10 shows the query time with in selectivity list which indicates the return record size. respect to varying the number of keys. We test all com- The return record size of Activity is the largest among binations of keys. For example, for 2 keys test, we test the attributes. In other words, Activity has the low- all 21 combinations(7 choose 2) and average the result. est selective and need more API calls to get the result It is much easier to find a more selective key when the than other attributes, so it has the worst performance number of keys is increasing. This is the reason why difference. the enhanced method has a downward slope. When there are only 2 keys, the enhanced method has high possibility to find a low selective key. As a result, when AND query takes a low selective key, it requires a long response time.

Figure 8 Point Query.

Range Query: Figure9 shows the query response Figure 10 AND Query. time with respect to varying the time range. The en- hanced method is at least one order of magnitude better than the baseline method. Moreover, the En- Discussion hanced method has almost the same constant time Our design is heavily governed by the competition performance as LevelDB native range query method. It requirements, evaluation criteria[39], and Multichain proves that our hierarchical timestamp structure works 1.0.4 capability. In this paper, we intend to use Multi- as designed. chain as only an interface, so our design can be applied to any arbitrary blockchain system. Multichain 1.0.4 does not allow an item to have multiple keys and com- petition does not allow participators to modify Multi- chain, so we have to manually construct the blockchain transaction. There are two major developments for the future work: 1) A new interface which encodes/decodes the log entry to/from blockchain transaction more ef- ficient. For example, in Bitcoin blockchain transaction script, it can write entire log entry only once in the blockchain transaction and let local interface client translate the script to the database. A specific Bit- coin interface for this log system can significantly re- duce the transaction size. 2) A new Blockchain ori- ented database system, such as Forkbase[37]. It aims to design a new key-value architecture to reduce the development efforts of the above application and pro- vide efficient analytical query performance. It is pos- Figure 9 Range Query. sible to replace the key-value engines in the existing blockchain platforms for better query performance. Shuaicheng Ma et al. Page 9 of 11

In this paper, we focus on designing efficient log- one order of magnitude. We claim that our hierarchical ging and querying schemes for immutable blockchain timestamp structure design is Blockchain implementa- systems, and assume the blockchain network has been tion independent. It can be adapted to any Blockchain well-established under a specific consensus algorithm (e.g., Bitcoin, Ethereum, Hyperledger) with the help and acceptable transaction throughput. In the follow- of an intermediary, such as Multichain. ing, we discuss how they may affect our solution.

Consensus algorithm may affect the performance of Abbreviations insertion functions because a newly generated access GDPR: General data protection regulation; API: Application program log (as a transaction) need to be accepted by all node interface; JSON: JavaScript object notation; TXID: Transaction in the network (achieving a consensus on the next identification; GTEx: Genotype-tissue expression block) in order to be stored in the ledger. Consen- Ethics approval and consent to participate sus algorithm manifests the transaction throughput, Not applicable. which is majorly controlled by a predefined parameter Consent for publication in Multichain called mining-diversity (the default con- Not applicable. figuration is 0.3). If the transaction throughput is low, Availability of data and materials the insertion would be insufficient since it may be sus- The data that support the findings of this study are available from the iDash workshop. The code is available at pended until the previous batch of logs is finished. The Github(https://github.com/mshuaic/Blockchain med/). transaction throughput also affects the audit queries Competing interests because the query is performed on the locally syn- The authors declare that they have no competing interests. chronized ledger. Under low transaction throughput, Author’s contributions a newly generated log may take a long time to be in- SM, YC and LX designed the solution and wrote the manuscript. SM cluded in the ledger and synchronized to a node so implemented the code. All authors read and approved the final manuscript. that the query on a node may not be able to provide Acknowledgements the accurate real-time answer. We thank Dr. Tsung-Ting Kuo, the organizer of iDASH competition 2018 Further, the access log could be private since it first track, for providing informative Q&A and helpful advice! records all of the queries issued by a user. This is a Funding challenge for existing blockchain platforms since the This work was supported by NIH R01GM118609, Georgia CTSA under grant NIH UL1TR002378, NSF under grant CNS-1618932, the AFOSR ledger is public to every node in the network for in- DDDAS program under grant FA9550-121-0240, the Japan Society for the creasing transparency and security. A recent version Promotion of Science (JSPS) Grant-in-Aid for Scientific Research No. of Hyperledger Fabric [50] includes a new function for 17H06099, No. 18H04093 and No. 19K20269. The publication costs were funded by NIH R01GM118609, Georgia CTSA under grant NIH this problem. The idea is dividing the ledger to differ- UL1TR002378, Japan Society for the Promotion of Science (JSPS) No. ent channels and selectively sharing a channel among 17H06099, No. 18H04093, No. 19K20269. a subset of users. There are also other efforts for this Author details problem by adopting secure multiparty computation 1Department of Computer Science, Emory University, 400 Dowman Dr, [9], zero-knowledge proof [10] or trusted hardware [51]. Atlanta, GA, USA. 2Department of Social Informatics, Kyoto University, Although this problem is beyond the scope of this Kyoto, Japan. competition, our solution could be extended using the References 1. Collins, F.S., Morgan, M., Patrinos, A.: The human genome project: above techniques. lessons from large-scale biology. Science 300(5617), 286–290 (2003) 2. Consortium, I.H.: The international HapMap project. Nature Conclusions 426(6968), 789 (2003) 3. Lonsdale, J., Thomas, J., Salvatore, M., Phillips, R., Lo, E., Shad, S., In this paper, we presented two solutions for blockchain- Hasz, R., Walters, G., Garcia, F., Young, N., Foster, B., Moser, M., based logging and querying genomic dataset audit Karasik, E., Gillard, B., Ramsey, K., Sullivan, S., Bridge, J., Magazine, trail. We built a baseline solution and then adjusted H., Syron, J., Fleming, J., Siminoff, L., Traino, H., Mosavel, M., Barker, L., Jewell, S., Rohrer, D., Maxim, D., Filkins, D., Harbach, P., our implementation based on the evaluation criteria of Cortadillo, E., Berghuis, B., Turner, L., Hudson, E., Feenstra, K., the competition[39] and the general real-world charac- Sobin, L., Robb, J., Branton, P., Korzeniewski, G., Shive, C., Tabor, teristics of log systems[52]. The blockchain-based log D., Qi, L., Groch, K., Nampally, S., Buia, S., Zimmerman, A., Smith, A., Burges, R., Robinson, K., Valentino, K., Bradbury, D., Cosentino, system is an append-only structure, so the storage in- M., Diaz-Mayoral, N., Kennedy, M., Engel, T., Williams, P., Erickson, creases monotonically. In the real world, the percent- K., Ardlie, K., Winckler, W., Getz, G., DeLuca, D., MacArthur, D., age of writing operation(insertion) is much higher than Kellis, M., Thomson, A., Young, T., Gelfand, E., Donovan, M., Meng, Y., Grant, G., Mash, D., Marcus, Y., Basile, M., Liu, J., Zhu, J., Tu, the portion of reading operation(query) in the work- Z., Cox, N.J., Nicolae, D.L., Gamazon, E.R., Im, H.K., Konkashbaev, load [52]. Based on the above two reasons, we decided A., Pritchard, J., Stevens, M., Flutre, T., Wen, X., Dermitzakis, E.T., to prioritize the storage space over retrieval speed and Lappalainen, T., Guigo, R., Monlong, J., Sammeth, M., Koller, D., Battle, A., Mostafavi, S., McCarthy, M., Rivas, M., Maller, J., Rusyn, insertion speed. We can reduce the storage cost by I., Nobel, A., Wright, F., Shabalin, A., Feolo, M., Sharopova, N., 25% and increase the range query speed by at least Sturcke, A., Paschal, J., Anderson, J.M., Wilder, E.L., Derr, L.K., Shuaicheng Ma et al. Page 10 of 11

Green, E.D., Struewing, J.P., Temple, G., Volpi, S., Boyer, J.T., of the American Medical Informatics Association 24(6), 1211–1220 Thomson, E.J., Guyer, M.S., Ng, C., Abdallah, A., Colantuoni, D., (2017) Insel, T.R., Koester, S.E., Little, A.R., Bender, P.K., Lehner, T., Yao, 25. Underwood, S.: Blockchain beyond bitcoin. Commun. ACM 59(11), Y., Compton, C.C., Vaught, J.B., Sawyer, S., Lockhart, N.C., 15–17 (2016) Demchok, J., Moore, H.F.: The genotype-tissue expression (GTEx) 26. Sun, J., Yan, J., Zhang, K.Z.K.: Blockchain-based sharing services: project. Nature Genetics 45, 580–585 (2013) What blockchain technology can contribute to smart cities. Financial 4. Wetterstrand, K.A.: DNA sequencing costs: data from the NHGRI Innovation 2(1), 26 (2016) genome sequencing program (GSP) (2013) 27. W¨orner,D., von Bomhard, T., Schreier, Y.-P., Bilgeri, D.: The bitcoin 5. Malin, B.A., Emam, K.E., O’Keefe, C.M.: Biomedical data privacy: ecosystem: Disruption beyond financial services? (2016) problems, perspectives, and recent advances. Journal of the American 28. Bonneau, J., Miller, A., Clark, J., Narayanan, A., Kroll, J.A., Felten, Medical Informatics Association 20(1), 2–6 (2013) E.W.: SoK: research perspectives and challenges for bitcoin and 6. Gkoulalas-Divanis, A., Loukides, G., Sun, J.: Publishing data from cryptocurrencies. In: 2015 IEEE Symposium on Security and Privacy, electronic health records while preserving privacy: A survey of pp. 104–121 (2015) algorithms. Journal of Biomedical Informatics 50, 4–19 (2014) 29. Tschorsch, F., Scheuermann, B.: Bitcoin and beyond: A technical 7. Naveed, M., Ayday, E., Clayton, E.W., Fellay, J., Gunter, C.A., survey on decentralized digital currencies. IEEE Communications Hubaux, J.-P., Malin, B.A., Wang, X.: Privacy in the genomic era. Surveys Tutorials 18(3), 2084–2123 (2016) ACM Comput. Surv. 48(1), 6–1644 (2015) 30. Pilkington, M.: Blockchain technology: principles and applications. 8. Zyskind, G., Nathan, O., Pentland, A.: Decentralizing privacy: Using Research handbook on digital transformations, 225 (2016) blockchain to protect personal data. In: Security and Privacy 31. Zheng, Z., Xie, S., Dai, H., Chen, X., Wang, H.: An overview of Workshops (SPW), 2015 IEEE, pp. 180–184 (2015) blockchain technology: Architecture, consensus, and future trends. In: 9. Zyskind, G., Nathan, O., Pentland, A.: Enigma: Decentralized 2017 IEEE International Congress on Big Data (BigData Congress), computation platform with guaranteed privacy. arXiv:1506.03471 [cs] pp. 557–564 (2017) (2015) 32. Suzuki, S., Murai, J.: Blockchain as an audit-able communication 10. Froelicher, D., Egger, P., Sousa, J.S., Raisaro, J.L., Huang, Z., channel. In: 2017 IEEE 41st Annual Computer Software and Mouchet, C., Ford, B., Hubaux, J.-P.: UnLynx: a decentralized system Applications Conference (COMPSAC), vol. 2, pp. 516–522 (2017) for privacy-conscious data sharing. Proceedings on Privacy Enhancing 33. Castaldo, L., Cinque, V.: Blockchain-based logging for the cross-border Technologies 2017(4), 232–250 (2017) exchange of eHealth data in europe. In: Gelenbe, E., Campegiani, P., 11. Hackius, N., Petersen, M.: Blockchain in logistics and supply chain : Czach´orski,T., Katsikas, S.K., Komnios, I., Romano, L., Tzovaras, D. trick or treat? In: Proceedings of the Hamburg International (eds.) Communications in Computer and Information Science, pp. Conference of Logistics (HICL), pp. 3–18 (2017) 46–56 (2018) 12. Garc´ıa-Ba˜nuelos,L., Ponomarev, A., Dumas, M., Weber, I.: Optimized 34. Liang, X., Shetty, S., Tosh, D., Kamhoua, C., Kwiat, K., Njilla, L.: execution of business processes on blockchain. In: Lecture Notes in ProvChain: a blockchain-based data provenance architecture in cloud Computer Science, pp. 130–146 (2017) environment with enhanced privacy and availability. In: 2017 17th 13. Abeyratne, S.A., Monfared, R.P.: Blockchain ready manufacturing IEEE/ACM International Symposium on Cluster, Cloud and Grid supply chain using distributed ledger (2016) Computing (CCGRID), pp. 468–477 (2017) 14. Azouvi, S., Al-Bassam, M., Meiklejohn, S.: Who am i? secure identity 35. Dinh, T.T.A., Liu, R., Zhang, M., Chen, G., Ooi, B.C., Wang, J.: registration on distributed ledgers. In: Lecture Notes in Computer Untangling blockchain: A data processing of blockchain systems. Science, pp. 373–389 (2017) TKDE (2017) 15. Yasin, A., Liu, L.: An online identity and smart contract management 36. Dinh, T.T.A., Wang, J., Chen, G., Liu, R., Ooi, B.C., Tan, K.-L.: system. In: 2016 IEEE 40th Annual Computer Software and BLOCKBENCH: a framework for analyzing private blockchains. In: Applications Conference (COMPSAC), vol. 2, pp. 192–198 (2016) SIGMOD, pp. 1085–1100 (2017) 16. Kuo, T.-T., Ohno-Machado, L.: ModelChain: decentralized 37. Wang, S., Dinh, T.T.A., Lin, Q., Xie, Z., Zhang, M., Cai, Q., Chen, privacy-preserving healthcare predictive modeling framework on private G., Ooi, B.C., Ruan, P.: Forkbase: an efficient storage engine for blockchain networks. arXiv:1802.01746 [cs] (2018) blockchain and forkable applications. Proceedings of the VLDB 17. Yue, X., Wang, H., Jin, D., Li, M., Jiang, W.: Healthcare data Endowment 11(10), 1137–1150 (2018) gateways: Found healthcare intelligence on blockchain with novel 38. Xu, Z., Han, S., Chen, L.: CUB, a consensus unit-based storage privacy risk control. Journal of Medical Systems 40(10), 218 (2016) scheme for blockchain system. In: ICDE, p. 12 (2018) 18. Xia, Q., Sifah, E.B., Asamoah, K.O., Gao, J., Du, X., Guizani, M.: 39. iDASH Secure Genome Analysis Competition 2018, GMC Medical MeDShare: trust-less medical data sharing among cloud service Genomics, 2019 providers via blockchain. IEEE Access 5, 14757–14767 (2017) 40. MultiChain Private Blockchain White Paper. https: 19. Azaria, A., Ekblaw, A., Vieira, T., Lippman, A.: MedRec: using //www.multichain.com/download/MultiChain-White-Paper.pdf blockchain for medical data access and permission management. In: Accessed 4 June 2019 2016 2nd International Conference on Open and Big Data (OBD), pp. 41. Croman, K., Decker, C., Eyal, I., Gencer, A.E., Juels, A., Kosba, A., 25–30 (2016) Miller, A., Saxena, P., Shi, E., Sirer, E.G., Song, D., Wattenhofer, R.: 20. Genestier, P., Zouarhi, S., Limeux, P., Excoffier, D., Prola, A., Sandon, On scaling decentralized blockchains. In: Financial Cryptography and S., Temerson, J.-M.: Blockchain for consent management in the Data Security (FC), pp. 106–125 (2016) ehealth environment: A nugget for privacy and security challenges. 42. Fonseca, M., Karkaletsis, K., Cruz, I.A., Berler, A., Oliveira, I.C.: Journal of the International Society for Telemedicine and eHealth 5, 24 OpenNCP: a novel framework to foster cross-border e-health services. (2017) Studies in Health Technology and Informatics 210, 617–621 (2015) 21. Choudhury, O., Sarker, H., Rudolph, N., Foreman, M., Fay, N., 43. Bitcoin. https://bitcoin.org/en/ Accessed 4 June 2019 Dhuliawala, M., Sylla, I., Fairoza, N., Das, A.K.: Enforcing human 44. Ethereum. https://www.ethereum.org/ Accessed 4 June 2019 subject regulations using blockchain and smart contracts. Blockchain 45. Roselli, D., Anderson, T.E.: Characteristics of file system workloads in Healthcare Today (2018) (1998) 22. Li, C., Cao, Y., Hu, Z., Yoshikawa, M.: Blockchain-based bidirectional 46. A Python Wrapper for Multichain Json-RPC API. updates on fine-grained medical data. In: 2019 IEEE 35th International https://github.com/DXMarkets/Savoir Accessed 4 June 2019 Conference on Data Engineering Workshops (ICDEW), pp. 22–27 47. Docker. https://www.docker.com/ Accessed 4 June 2019 (2019) 48. Our Code at Github. https://github.com/mshuaic/Blockchain_med 23. Narayanan, A., Clark, J.: Bitcoin’s academic pedigree. Commun. ACM Accessed 4 June 2019 60(12), 36–45 (2017) 49. LevelDB. https://github.com/google/leveldb Accessed 4 June 24. Kuo, T.-T., Kim, H.-E., Ohno-Machado, L.: Blockchain distributed 2019 ledger technologies for biomedical and health care applications. Journal 50. Androulaki, E., Barger, A., Bortnikov, V., Cachin, C., Christidis, K., Shuaicheng Ma et al. Page 11 of 11

De Caro, A., Enyeart, D., Ferris, C., Laventman, G., Manevich, Y.: Hyperledger fabric: a distributed operating system for permissioned blockchains. In: Proceedings of the Thirteenth EuroSys Conference, p. 30 (2018) 51. Hynes, N., Dao, D., Yan, D., Cheng, R., Song, D.: A demonstration of sterling: A privacy-preserving data marketplace. Proc. VLDB Endow. 11(12), 2086–2089 (2018) 52. Rosenblum, M., Ousterhout, J.K.: The design and implementation of a log-structured file system. ACM Trans. Comput. Syst. 10(1), 26–52 (1992). doi:10.1145/146941.146943