Wireless Communications and Mobile Computing

Secure Computation on 4G/5G Enabled Internet-of-Things

Lead Guest Editor: Karl Andersson Guest Editors: Ilsun You, Rahim Rahmani, and Vishal Sharma Secure Computation on 4G/5G Enabled Internet-of-Things Wireless Communications and Mobile Computing

Secure Computation on 4G/5G Enabled Internet-of-Things

Lead Guest Editor: Karl Andersson Guest Editors: Ilsun You, Rahim Rahmani, and Vishal Sharma © 2019 Hindawi. All rights reserved.

This is a special issue published in “Wireless Communications and Mobile Computing.” All articles are open access articles distributed under the Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, pro- vided the original work is properly cited. Editorial Board

Javier Aguiar, Spain Oscar Esparza, Spain Maode Ma, Singapore Ghufran Ahmed, Pakistan Maria Fazio, Italy Imadeldin Mahgoub, USA Wessam Ajib, Canada Mauro Femminella, Italy Pietro Manzoni, Spain Muhammad Alam, China Manuel Fernandez-Veiga, Spain Álvaro Marco, Spain Eva Antonino-Daviu, Spain Gianluigi Ferrari, Italy Gustavo Marfia, Italy Shlomi Arnon, Israel Ilario Filippini, Italy Francisco J. Martinez, Spain Leyre Azpilicueta, Mexico Jesus Fontecha, Spain Davide Mattera, Italy Paolo Barsocchi, Italy Luca Foschini, Italy Michael McGuire, Canada Alessandro Bazzi, Italy A. G. Fragkiadakis, Greece Nathalie Mitton, France Zdenek Becvar, Czech Republic Sabrina Gaito, Italy Klaus Moessner, UK Francesco Benedetto, Italy Óscar García, Spain Antonella Molinaro, Italy Olivier Berder, France Manuel García Sánchez, Spain Simone Morosi, Italy Ana M. Bernardos, Spain L. J. García Villalba, Spain Kumudu S. Munasinghe, Australia Mauro Biagi, Italy José A. García-Naya, Spain Enrico Natalizio, France Dario Bruneo, Italy Miguel Garcia-Pineda, Spain Keivan Navaie, UK Jun Cai, Canada A.-J. García-Sánchez, Spain Thomas Newe, Ireland Zhipeng Cai, USA Piedad Garrido, Spain Tuan M. Nguyen, Vietnam Claudia Campolo, Italy Vincent Gauthier, France Petros Nicopolitidis, Greece Gerardo Canfora, Italy Carlo Giannelli, Italy Giovanni Pau, Italy Rolando Carrasco, UK Carles Gomez, Spain Rafael Pérez-Jiménez, Spain Vicente Casares-Giner, Spain Juan A. Gómez-Pulido, Spain Matteo Petracca, Italy Luis Castedo, Spain Ke Guan, China Nada Y. Philip, UK Ioannis Chatzigiannakis, Italy Antonio Guerrieri, Italy Marco Picone, Italy Lin Chen, France Daojing He, China Daniele Pinchera, Italy Yu Chen, USA Paul Honeine, France Giuseppe Piro, Italy Hui Cheng, UK Sergio Ilarri, Spain Vicent Pla, Spain Ernestina Cianca, Italy Antonio Jara, Switzerland Javier Prieto, Spain Riccardo Colella, Italy Xiaohong Jiang, Japan Rüdiger C. Pryss, Germany Mario Collotta, Italy Minho Jo, Republic of Korea Sujan Rajbhandari, UK Massimo Condoluci, Sweden Shigeru Kashihara, Japan Rajib Rana, Australia Daniel G. Costa, Brazil Dimitrios Katsaros, Greece Luca Reggiani, Italy Bernard Cousin, France Minseok Kim, Japan Daniel G. Reina, Spain Telmo Reis Cunha, Portugal Mario Kolberg, UK Jose Santa, Spain Igor Curcio, Finland Nikos Komninos, UK Stefano Savazzi, Italy Laurie Cuthbert, Macau Juan A. L. Riquelme, Spain Hans Schotten, Germany Donatella Darsena, Italy Pavlos I. Lazaridis, UK Patrick Seeling, USA Pham Tien , Japan Tuan Anh Le, UK Muhammad Z. Shakir, UK AndrédeAlmeida,Brazil Xianfu Lei, China Mohammad Shojafar, Italy Antonio De Domenico, France Hoa Le-Minh, UK Giovanni Stea, Italy Antonio de la Oliva, Spain Jaime Lloret, Spain Enrique Stevens-Navarro, Mexico Gianluca De Marco, Italy Miguel López-Benítez, UK Zhou Su, Japan Luca De Nardis, Italy Martín López-Nores, Spain Luis Suarez, Russia Liang Dong, USA Javier D. S. Lorente, Spain Ville Syrjälä, Finland Mohammed El-Hajjar, UK Tony T. Luo, Singapore Hwee Pink Tan, Singapore Pierre-Martin Tardif, Canada Juan F. Valenzuela-Valdés, Spain Jie Yang, USA Mauro Tortonesi, Italy Aline C. Viana, France Sherali Zeadally, USA Federico Tramarin, Italy Enrico M. Vitucci, Italy Jie Zhang, UK Reza Monir Vaghefi, USA Honggang Wang, USA Meiling Zhu, UK Contents

Secure Computation on 4G/5G Enabled Internet-of-Things Karl Andersson ,IlsunYou , Rahim Rahmani, and Vishal Sharma Editorial (1 page), Article ID 3978193, Volume 2019 (2019)

Toward Privacy-Preserving Shared Storage in Untrusted P2P Networks Sandi Rahmadika and Kyung-Hyune Rhee Research Article (13 pages), Article ID 6219868, Volume 2019 (2019)

UAV-Undertaker: Securely Verifiable Remote Erasure Scheme with a Countdown-Concept for UAV via Randomized Data Synchronization Sieun Kim ,Taek-YoungYoun , Daeseon Choi ,andKi-WoongPark Research Article (11 pages), Article ID 8913910, Volume 2019 (2019)

Authorized -Side Deduplication Using CP-ABE in Cloud Storage Taek-Young Youn ,Nam-SuJho ,KyungHyuneRhee , and Sang Uk Shin Research Article (11 pages), Article ID 7840917, Volume 2019 (2019)

A Multiuser Identification Algorithm Based on Internet of Things Kaikai Deng , Ling Xing , Mingchuan Zhang ,HonghaiWu ,andPingXie Research Article (11 pages), Article ID 6974809, Volume 2019 (2019)

Threat Assessment for Android Environment with Connectivity to IoT Devices from the Perspective of Situational Awareness Mookyu Park , Jaehyeok Han, Haengrok Oh, and Kyungho Lee Research Article (14 pages), Article ID 5121054, Volume 2019 (2019)

Radio Environment Map Construction by Kriging Algorithm Based on Mobile Crowd Sensing Zhifeng Han , Jianxin Liao ,QiQi , Haifeng Sun, and Jingyu Wang Research Article (12 pages), Article ID 4064201, Volume 2019 (2019)

Enhanced Android App-Repackaging Attack on In-Vehicle Network Yousik Lee, Samuel Woo, Jungho Lee, Yunkeun Song, Heeseok Moon, and Dong Hoon Lee Research Article (13 pages), Article ID 5650245, Volume 2019 (2019)

Automatically Traceback RDP-Based Targeted Attacks ZiHan Wang ,ChaoGeLiu ,JingQiu ,ZhiHongTian , Xiang Cui, and Shen Su Research Article (13 pages), Article ID 7943586, Volume 2018 (2019)

Enjoy the Benefit of Network Coding: Combat Pollution Attacks in 5G Multihop Networks Jian Li , Tongtong Li, Jian Ren, and Han-Chieh Chao Research Article (13 pages), Article ID 3473910, Volume 2018 (2019)

Anonymous Communication via Anonymous Identity-Based Encryption and Its Application in IoT Liaoliang Jiang, Tong Li , Xuan Li, Mohammed Atiquzzaman, Haseeb Ahmad, and Xianmin Wang Research Article (8 pages), Article ID 6809796, Volume 2018 (2019) Secure Storage and Retrieval of IoT Data Based on Private Information Retrieval Khaled Riad and Lishan Ke Research Article (8 pages), Article ID 5452463, Volume 2018 (2019)

Multiresolution Face Recognition through Virtual Faces Generation Using a Single Image for One Person Hae-Min Moon, Min-Gu Kim, Ju-Hyun Shin ,andSungBumPan Research Article (8 pages), Article ID 7584942, Volume 2018 (2019) Hindawi Wireless Communications and Mobile Computing Volume 2019, Article ID 3978193, 1 page https://doi.org/10.1155/2019/3978193

Editorial Secure Computation on 4G/5G Enabled Internet-of-Things

Karl Andersson ,1 Ilsun You ,2 Rahim Rahmani,3 and Vishal Sharma 2

1 LuleaUniversityofTechnology,Lule˚ a,˚ Sweden 2Soonchunhyang University, Asan, Republic of Korea 3Stockholm University, Stockholm, Sweden

Correspondence should be addressed to Karl Andersson; [email protected]

Received 30 April 2019; Accepted 30 April 2019; Published 2 June 2019

Copyright ©  Karl Andersson et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

e rapid development of Internet-of-ings (IoT) tech- assessment, authorized client-side deduplication in cloud niques in G/ G deployments is witnessing the generation storage, radio environment map construction, analysis of of massive amounts of data which are collected, stored, pro- the vulnerabilities of connected car environments, combat cessed, and presented in an easily interpretable form. Analysis pollution attacks in G multihop networks, automatically ofIoT data helpsprovide smart services such assmart homes, traceback RDP-based targeted ransomware attacks, multires- smart energy, smart health, and smart environments through olution face recognition through virtual faces generation, G and G technologies. At the same time, the threat of anonymous communication via anonymous identity-based the cyberattacks and issues with mobile is encryption, and Secure Storage and Retrieval of IoT Data. becoming increasingly severe, which introduces new chal- lenges for the security of IoT systems and applications and Conflicts of Interest the privacy of individuals thereby. Protecting IoT data privacy while enabling data availability is an urgent but difficult task. e editors declare that they have no conflicts of interest Data privacy in a distributed environment like IoT can regarding the publication of this special issue. be attained through secure multiparty computation. An emerging area of potential applications for secure computa- tion is to address privacy concerns in data aggregation and Acknowledgments analysis to match the explosive growth of the amount of As always, we appreciate the high quality submissions from IoT data. However, the inherent complexity of IoT systems authors and the support of the community of reviewers. really complicates the design and deployment of efficient, interoperable, and scalable secure computation mechanisms. Karl Andersson As a result, there is an increasing demand for the development Ilsun You of new secure computation methods and tools which can fill Rahim Rahmani in the gap between security and practical usage in IoT. Vishal Sharma escopeofthisspecialissueisinlinewithrecent contributions from academia and industry on the recent activities that tackle the technical challenges making com- puting secure on G/ G enabled Internet-of-ings. For the current issue, we are pleased to introduce a collection of papers covering a range of topics such as securely verifiable remote erasure schemes, multiuser identification algorithms, privacy-preserving shared storage, situational aware threat Hindawi Wireless Communications and Mobile Computing Volume 2019, Article ID 6219868, 13 pages https://doi.org/10.1155/2019/6219868

Research Article Toward Privacy-Preserving Shared Storage in Untrusted Blockchain P2P Networks

Sandi Rahmadika 1 and Kyung-Hyune Rhee 2

1 Interdisciplinary Program of Information Security, Graduate School, Pukyong National University, Republic of Korea 2Department of IT Convergence and Application Engineering, Pukyong National University, Republic of Korea

Correspondence should be addressed to Kyung-Hyune Rhee; [email protected]

Received 18 January 2019; Revised 9 March 2019; Accepted 2 April 2019; Published 16 May 2019

Academic Editor: Ilsun You

Copyright © 2019 Sandi Rahmadika and Kyung-Hyune Rhee. Tis is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Te shared storage is essential in the decentralized system. A straightforward storage model with guaranteed privacy protection on the peer-to-peer network is a challenge in the blockchain technology. Te decentralized storage system should provide the privacy for the parties since it contains numerous data that are sensitive and dangerous if misused by maliciously. In this paper, we present a model for shared storage on a blockchain network which allows the authorized parties to access the data on storage without having to reveal their identity. Ring signatures combined with several protocols are implemented to disguise the signer identity thereby the observer is unlikely to determine the identity of the parties. We apply our proposed scheme in the healthcare domain, namely, decentralized personal health information (PHI). In addition, we present a dilemma to improve performance in a decentralized system.

1. Introduction system. Te surveys indicate that users ofen do not fully trust to store their data to third parties [5]. Tere are decentralized Since being introduced to the public through the rise of storage providers that provide the alternative services to , blockchain has attracted a lot of attention among protect the user's privacy such as [6] and GNUnet researchers, especially the way it deals in a transaction with- [7]. However, those services still have some drawbacks such out involving the third parties. Te blockchain technology as free-rider problem. More precisely, the provider is less reduces the transaction costs and it improves the efciency motivated to keep improving system reliability due to the and reliability of the decentralized system in general [1]. Due fact that there is no signifcant beneft for the provider to to its merits, blockchain has been developed in various felds preserve the users' data. Apart from the free-rider problem, of study such as logistics, e-commerce, trading activity, and the main issue in the decentralized shared storage is related healthcare, to name a few. Blockchain in the healthcare area is to the privacy of users [8]. An observer may be able to see the growing rapidly [2] as a future trend of substantial impact [3]. contents of online activities or metadata of the user since the It aims at improving the quality of service and maintaining data is publicly available. the integrity of information. Blockchain must be mature enough in all aspects, especially in security matters before To deal with the issues, we propose a model of the blockchain being applied to a sensitive system (healthcare decentralized shared storage system on the blockchain that domain) [4] since it consists of valuable data for patients, provides the privacy of the users without the involvement of providers, and all parties involved. third parties. Ring signature algorithm is applied to disguise Te shared storage between healthcare providers and the original identity of the signer. Te parties involved use the patients is one of the factors that must be considered signatures on behalf of a group; hence, the original identity when determining the scheme of the decentralized healthcare of the signer is unknown called signer ambiguous. In order 2 Wireless Communications and Mobile Computing

Loop counter = i rt Generate Proof

Generate new C ’  at i Output challenge C POST POST

Repeat t times

Parameter Data Flow

Function Hash

Figure 1: Te iterative proof of , adapted from [9]. to keep the identity of the parties to remain untraceable, one- originates from the concept signature group proposed by time use address (from the stealth address) is adapted so Chaum et al. [14] which in the group signature each member that the observer cannot link the user address based on the agrees to sign the message. In short, the data is signed transaction that has been carried out in advance. on behalf of the group. In the group signature, there is a Te predecessor approaches to design the storage manager who organizes each activity in a group. As opposed system in the blockchain have been started by researchers to the group signature, the ring algorithm does not possess a lately such as Storj [10], storage with fnancial incentives, manager in the process and neither have special requirements and Filecoin (see Figure 1) which generates a proof-of- for creating groups as shown in Figure 2. spacetime (PoST) for the replica [11]. Tis paper presents In order to form a signature group, the signer requires the key concepts in the decentralized sharing storage in the public keys �� knowledge from prospective members. Te healthcare system. Te personal health information of the selected public keys are encrypted by using a trapdoor patient is propagated in the peer-to-peer blockchain network permutation function (RSA, Rabin, and Dife-Hellman). and the data are stored in a storage provider. Te model of Due to the nature of the ring signature protocol, there are decentralized healthcare system comes from our previous no specifc rules for the number of members in a group. research [12]. Te privacy-preserving for the user is beyond Te standard procedure of the ring signature protocol can be the topic at the time. Tis paper is ongoing research and defned as follows: interrelated with our previous research. (i) ���� �(���, ���,��1,��2,��3,...,���).Tesignature Te structure of the paper is organized as follows. consists of the public keys (���,��1,��2,��3,...,���) Section 2 describes the background and core system compo- of the members for every message ��� concatenated nent such as the ring signature, , and one-time with the secret key ��� of the signer to produce a use address (stealth address). Section 3 presents the system signature �. model of decentralized PHI data as well as the concept of ring Verify(���,�) confdential transaction. Section 4 presents the system anal- (ii) . Te verifcation process can be inter- � ysis including the dilemma of reparameterizing propagation preted as accepting a group signature which consists time and block size in order to improve the performance in of public keys of all the possible signers along with the ��� ���� ����� a transaction. Te limitations and future work are written in message . Te fnal output is or . Section 5. Finally, Section 6 concludes the paper. Generating a ring signature can be used directly by the signer without involving the group manager. Te initial ��� 2. Background step is the signer computes the symmetric key � as the hash value of the message ��� to be signed as ���� = In this section, we briefy present the essential information ℎ(���). Te more complicated variant generates ���� as of ring signature algorithm, CryptoNote protocol, one-time ℎ(���, ��1,��2,��3,...,��).However,thesimplercreation use transaction address, and stealth address which are basic is also secure. An initial random value �V (or “glue”) is � components for the privacy-preserving model in our system. chosen by the signer uniformly at random form {0, 1} , � where 2 is some power of two which is larger than all � 2.1. Te Essential of Ring Signature. Ring signature is frst modulo �� �. Furthermore, the signer selects the number introduced by Rivest et al. [13] in 2001 through the paper of signatures �� from the ring members 1<�< entitled “How to Leak a Secret”. Te idea of a ring signature �, � =�̸ ,where� is the ring members and � is the order Wireless Communications and Mobile Computing 3

Z = V ( ) y1 =g1 x1

Ek

yr =gr(xr)

Ek

d ( ) y2 =g2 x2

Ek Ek

( ) y3 =g3 x3 Figure 2: Ring signature algorithm which is defned by any member of a group of parties each having keys.

Transaction

Tx public key R=rG r Tx output Sender’s random data Te Amount (A, B) Receiver’s Destination Key P=HS (rA)G+B Random data

Figure 3: Te structure of CryptoNote standard transaction. of the member (�-th member) who is the actual signer. indicate the use of the same key image. Te destination of Hence, the signature gets a new value which is signifed each CryptoNote output is a public key (��1,��2,��3,...,���) by �� =�(��). Finally, the signature of the message ��� which is derived from the recipient's address combined with can be defned as (��1,��2,��3,...,���;�V;�1,�2,...,��). the sender's random value. In this regard, the sender asks the Te verifcation process is straightforward by describing recipient's public key (�, �) via secure channel and the sender the message received from the sender via secure channel generates the one-time public key �=��(��)� + � as can (��1,��2,��3,...,���;�V;�1,�2,...,��). be seen in Figure 3. Based on the key image that recipient belongs to, the recipient checks every passing transaction � using his/her secret key (�, �) and calculates � =��(��)� + 2.2. CryptoNote Protocol. Te use of the ring signature algo- � � {0, 1}∗ � rithm in blockchain transaction was frst introduced in 2012 ,where � is a cryptographic hash function and is a base point. Finally, the recipient can defne �� = ��� = which is part of the CryptoNote protocol [15] and updated in �� �� =� 2013. Te CryptoNote constructs the ring signature using the and . In addition, the recipient can recover the corresponding one-time secret key �=��(��) + �; � = ��. public key of the random addresses and it provides privacy � for the patient by leveraging the stealth address protocol By signing transaction, the recipient can spend this output [16].Bydoingso,theobserverbelievesthatthesignerhas at any time. Te security from this protocol is an untraceable a secret key that corresponds to the , but transaction for the observer since the incoming message the observer cannot determine the specifc identity of the received by a recipient associated with one-time public keys sender. However, the original ring signature protocol would (unlinkable). allow double-spending attack in the blockchain transaction. Te original ring signature protocol cannot determine the 2.3. One-Time Use Transaction and Stealth Address. We frst origin of the coin due to the fact that there is no marker if present the one-time use transaction in general and we the transaction has been sent to the recipient. As a solution, briefy describe the drawbacks of one-time use address in the CryptoNote provides one-time use ring signatures and key blockchain transaction. To address the problem, we use the image as a marker. stealth address to protect the recipient information in our Te key image acts as a unique marker for every system model. transaction. Te key image gives the information about � the transaction with a particular signature n.Ifthesame 2.3.1. One-Time Use Transaction. Using the same address for signature is used more than once, the miners will reject the each transaction on the blockchain network allows observers transaction. In other words, any attempt to double-spend will to track transactions to the original sender even though the 4 Wireless Communications and Mobile Computing

Alice sends 1 BTCon Charlie receives 1 One time use 24/03/'18 at 3:02:57 BTCon 24/03/'18 at address x pm to x 3:55:03 pm from x

Figure 4: One-time use transaction. address is in the pseudonymous form. Since the new address combined with various other techniques such as zk- does not have a track record in the blockchain metadata, the SNARKs in order to provide the stronger privacy for the user observer is unlikely to track information from a transaction. in the decentralized blockchain network. Te cryptographic However, one-time use transaction address provides privacy approaches can prevent ad hoc networks against external for the users by extending the address to the counter-parties attackers using node authentication and data encryption which detail information of the transaction still publicly [19]. available on the blockchain network. Roughly speaking, the sender enables the recipient to spend the funds that he just 3. Our System Model received even though the recipient identity remains hidden as can be seen in Figure 4. In this section, we elaborate the model of decentralized PHI For instance, an address X is a one-time address, but the data (the model is from our prior work), the sequences of the observer has knowledge that Alice sent 1BTC to X and Charlie standard blockchain transaction, the group of ring signature, received from X. It can be used by the observer to infer and the ring confdential transaction of PHI. that X corresponds to Alice and Charlie. By gathering the detailsonwhereCharlie'sfundoriginatedandwhereAlice's 3.1. Decentralized PHI Data. A decentralized personal health funds were passed on to, it could avail deanonymize the one- information model originally came from our previous time use transaction. Tis scheme is also called transaction research [12]. In our predecessor work, blockchain tech- graph analysis. Terefore, in order to develop privacy on the nology is used to manage the personal health information decentralized blockchain system, a new protocol is necessary (PHI) data of the patient which obtain from several health- such as stealth address protocol. In other words, it is a security care providers as can be seen in Figure 6. Based on the protocol that has become ubiquitous almost by stealth [17]. decentralized PHI model, we conducted testing in order to fnd out information about data communication in peer-to- 2.3.2. Stealth Address. One-time use address for transaction peer networks (see Figure 7). Te results obtained show the might be inconvenient to manage since the new address must high level of success in sending data among the parties in be generated by the recipient for every transaction that will the blockchain network that reaches 100% and the average be carried out. Unlike the one-time use payment address, the of propagation message is 1:18ms for 100 bytes of Internet stealth addresses get rid of this requirement. In short, stealth Control Message Protocol Echo. By leveraging the model addressescanbegeneratedasfollows: of our previous research, the patients and the healthcare (i) Te recipient generates a parent key pair ���(�, �) providers allow to collect efectively the PHI data onto and publishes his/her public key (the published key is a single view with integrity guarantee. Data integrity is called stealth address). essential for the patient in the blockchain network since it is a fundamental component of information security which (ii) Te sender will be able to use the stealth address of verifes that the data has remained unaltered in transit from the recipient to commit a new one-time use payment creation and reception [20]. By design, blockchain is tamper- address for a particular transaction. proof, immutability (the data stored are unchanging over (iii) At this stage, the recipient uses their parent private time or unable to be changed) so it is suitable for managing key ��(�, �) which is generated in advance to spend sensitive data such as personal health information, digital the funds received. Te generating process of stealth medical record, and other similar data. address can be seen in Figure 5. In the decentralized healthcare system model, the patient and the providers are on the same blockchain network. Te Te addresses are generated using trapdoor permuta- healthcare providers preserve a diagnosis from the patient tion such as Dife-Hellman key exchange protocol [18]. By and then store it into sharing storage right afer the data is leveraging the stealth address protocol, the system allows confrmed by the miner. Te patient aferwards enables to eliminating the possibility for observers to link a one- fnd the data stored by the provider in the storage by searching time address to another. In this sense, the observer cannot one by one based on the patient's public key ���(�, �) determine the recipient's address and is unlikely to link the attached into the transaction. Te patient whose public key transaction to the recipient due to the new address being is attached to the transaction is the only party who knows the generated based on his/her stealth address for every trans- data stored in the storage because he/she has knowledge of action. Stealth address protocol is used in CryptoNote and the secret key ��(�, �) to access the PHI data. Wireless Communications and Mobile Computing 5

Recipient generates a Parent Key Pair One-time Add Parent Private Key + Ephemeral Key = Secret Key

Parent Public Key (“Stealth Address”)

Sender generates an Ephemeral Key One-time Addr. Ephemeral Key + Ephemeral Key = Secret Key

Figure 5: Generating process of stealth address.

Patient

HospitalHospital LaboratoriesLaboratories

Health and SocialSocial ChiropractorChiropractor care professionalprofessional

ClinicalClinical PsychologistPsychologist NationalNational Provider

Figure 6: Te model of decentralized personal health information data.

As with the security model in general, every party in beforehand. Te parent keys of the parties can be defned as the system has the unique parent keys. Te public key follows: is used to commit transactions, which later will be used to generate a stealth address for every transaction. A pair (i) Te patient (����,���), the pair of patient's public of keys is obtained from trapdoor permutation functions key to commit the transaction can be defned as such as RSA algorithm and Rabin [21] which are generated (��,��) and the pair of secret keys (��,��). 6 Wireless Communications and Mobile Computing

Figure 7: Propagating the personal health information data to the entire nodes.

(ii) Hospital (����,���), the pair of hospital's public key patient’s data in the storage where the patient has published (��,��) and the pair of secret keys(��,��). his address to the provider beforehand. Te provider unpacks the address and gets the patient's public key (��,��).Te (iii) Chiropractor (����,���),Chiropractor'spublickey provider picks a random �∈[1,�−1]:where � is a prime (��,��) and the pair of secret keys(��,��). order of the base point �. Te provider then generates a one- (��� ,�� ) (iv) Clinical psychologist � � , clinical psycholo- time public key based on the pair public key (1) of the patient (� ,� ) gist's public key � � and the pair of secret keys (the process is signifed by Figure 3). (� ,� ) � � . Te patient and the healthcare providers possess a pair of (v) National provider (����,���),thepublickeyof public keys (��,��) for diferent purposes. Te frst public national provider (��,��) and the pair of secret keys key �� is used to generate a one-time public key, whilst (��,��). the public key �� is attached to the transaction as the tracking value used by the patient to fnd the data addressed (vi) Health and social care provider (����,���),health to him/her. Te provider uses � as a destination key for and social provider's public key (��,��) and the pair the output and attaches the new value �=��into the of secret keys (��,��). transaction. Te PHI data with attachments to � and � values (��� ,�� ) (vii) Laboratories � � , the pair of public keys of are stored into shared storage afer being validated by the (� ,� ) laboratories can be defned � � and the pair of miner. Te patient later checks every transaction using his secret keys (��,��). � private key (��,��) and calculates the new value � (2). Finally, � �� �=� (�� ) �+� the patient compares the value received with the value � � � (1) decrypted (3). � � =�� (���) �+��,�ℎ�� (2) 3.2. Te Group of Ring Signature. In the decentralized PHI ���=���� = ���; system, there is a group ring signature consisting of patient

� (3) and several healthcare providers. In order to generate a group, � =� the system does not demand special requirements and also unlikelyrequireamanagertomanagethegroup.Allthatis Te sequence of a standard transaction in decentralized needed to create a ring signature group is the public key of PHI data starts from the provider who wants to store the each party (����,����,����,����,����,����,and����). Wireless Communications and Mobile Computing 7

S =A(R) S =A(R) S =A(R)

S =A(R) S =A(R) S =A(R)

SH+1 =AH+1 S =A(R) ..... (RH+1)

Figure 8: Te group of ring signature which is derived from the public keys of the member.

Once a ring signature group has been generated, all members By design, the sender signs the message for the individual ofthegroupareallowedtousethesignatureandcombineit transaction using the signature of the group without a with the private key of the sender. It grants users fne-grained single group manager involved. Because the sender uses the control over the level of anonymity associated with a certain signature of the group, the observer cannot judge the identity signature [22]. of the real sender for the corresponding transaction. Te Te signer enables to choose the number of signatures sender enables to choose the number of signatures that he/she that they want to use in the transaction to provide an wants to use in the transaction. Te signature of a group ambiguous signer. Later, the public key is used for encryption can be used at any time in a transaction without having �� =��(��) as shown in Figure 8. Intuitively, �� is defned permission from the owner of the key. For instance, in a by the extended trapdoor permutation function such as RSA particular transaction the provider �� uses the following keys and Rabin algorithm. In practice, for Rabin's functions �� (��) to sign a message ℎ(���, ��,��,�� and his key ��). 2 � � extends to ��(��)=�� mod �� over {0, 1} :where2 is the � power of two which is larger than all modulo �� �.Tebucket 3.3. Ring Confdential Transaction of PHI. Ring confdential �=�� +� � ∈ ∗ consists of b-bit numbers � � �:where � Z �� transaction (RingCT) is used in [16] � and (�� +1)�� ≤2. For any b-bit numbers � is nonnegative in order to improve the privacy of the users. Intuitively, the integers �� and ��. Te bucket values are mapped by the ring confdential transaction aims to hide the value of the extended Rabin mapping ��.So,thevalueof��(�) is signifed actual amount in a transaction by combining the value of by (4). the current transaction with the value of the predecessor transactions. By doing so, the observer cannot tell the exact � {���� +���� if (�� +1)�� ≤2 value of a transaction. Te value of the transaction can �� (�) = { (4) be the number of coins sent or other data depending on {� else the type of system that applies RingCT [23]. Unlike in the Monero transaction, the value transaction in the Bitcoin is � Intuitively, ��(�) is a permutation function over {0, 1} which publicly available in the plaintext. Te observer might be is also a one-way trapdoor function because only a person able to analyse the transaction values for certain purposes in knows the inverted value of �� for a given input. Terefore, the particular period of time. Terefore, public data will be we can defne the public keys for the prospective members as dangerous if misused maliciously. follows: ��� = ���� [(8) ‖ (2) ‖ (5) ‖ (11) ‖ (12) ‖ (15)] (6) (i) Te patient ←� � � =��(��).

(ii) Hospital ←� � � =��(��):Chiropractor←� � � = Suppose all outputs in Figure 9 exist, whilst the transac- ��(��); whilst, the clinical psychologist ←� � � = tion that provider enables to spend is highlighted in green. ��(��). Whenever the provider creates a transaction, he uses a ←� � =�(� ) RingCT to disguise which input is actually being spent. In (iii) National provider � � � :healthandsocial this regard, the provider combines the current transaction care provider ←� � � =��(��); Laboratories ←� � � = � (� ) with the previous transactions (highlighted in gray). Te � � . confdential ring signature for the transaction is shown in (iv) For additional members of a group, it can be gener- (6). Te provider allows using the RingCT directly without ated at any time as long as the public key of the new permission and node manager involvement. As well as the member is known �{�+1} =�{�+1}(�{�+1}); group signature, the sender is free to use the value of previous transactions in order to disguise the value of the current � =� ⊕� ⊕� ⊕� ⊕� ⊕� ⊕� ⊕...⊕� �� � � � � � � � {�+1} (5) transaction. By leveraging the model, it is possible to create privacy-preserving for users in the decentralized peer-to- As can be seen in (5), a group of ring signature is peer shared storage in healthcare area with the following main constructed based on encryption of the member's public key. objectives: 8 Wireless Communications and Mobile Computing

1. txid (dfa89ad90...) 7. txid (twa845d90...) 13. txid sywi22zmv...)

2. txid (uio38939a...) 8. txid (pklsdf8900...) 14. txid (p4z24f3g...)

3. txid (890dfa8df...) 9. txid (78sd9ad91...) 15. txid (tpa1vh9wz..)

4. txid (dfa89add0...) 10. txid (dfa89ad90...) 16. txid (wj023kt3w...)

5. txid (fga89ad90...) 11. txid (sd71vcjjp...) 17. txid (vq6oxop4...)

6. txid (5fa89ad45...) 12. txid (dzzqemfsa..) 18. txid (z9i5mcsdu..)

Figure 9: Ring confdential transaction based on the prior transactions.

(i) Untraceability with the goal to protect the sender's afer successfully adding the new block as with the blockchain information in the form of the address (public system in general. Algorithm 1 is a description of the whole key). Te observer cannot trace where the coin was system process starting with making public keys through received and where the coin originated from. to data stored in shared storage. Te process sequence for privacy-preserving shared storage in untrusted blockchain (ii) Unlinkability to preserve the recipient's identity. Tis P2Pnetworksasfollows: can be achieved by using a stealth address where the recipient makes two public keys that are used to create (i) Te party possesses a pair of parent keys (������, a one-time address and the other as the view key. �����) that have been generated beforehand based (iii) Confdential Values to disguise the value of a trans- on trapdoor permutation function such as RSA, action. Te value of the current transaction is com- Rabin, or Dife-Hellman algorithm. Te patient is the bined with the values of the transactions that have recipient(����,���), whilst the hospital is the sender been carried out previously so it becomes an obscured (����,���) in this case study. transaction. (ii) Te hospital is the sender of the patient's diagnosis data (in this case). Te hospital constructs a ring 4. Systems Analysis and the Dilemma signature group based on public key from providers and patient that has been known previously. Te Te main protocols for building privacy-preserving for hospital also added its public key to the group. blockchain shared storage model have been discussed in previous section. In this section, we present the relation � =� (� )⊕� (� )⊕� (� )⊕� (� ) between each protocol. Te combination of the protocols ��� � � � � � � � � provides a system that enables protecting the privacy of ⊕� (� )⊕� (� )⊕� (� )⊕... (7) users in the decentralized shared storage. Te procedures � � � � � � are displayed gingerly, starting from generating the members ⊕� (� ) of ring signature through to mechanism of storing data in �+1 �+1 the blockchain shared storage. At the end of this section, we elaborate on some dilemmas. (iii) When the group of ring signature ���� is generated, We demonstrate a case study in which one of the the sender enables to use the signature of each group providers (hospital) wants to store diagnostic data of the member without having to get permission from the patient into the decentralized shared storage. Te provider owner of public keys. In this case study, the hospital has known the address (public key) of the patient beforehand. chooses to use all signatures from group members as In this sense, the hospital acts as the sender whilst the patient shown in (7). Te hospital can add new members at acts as the recipient of the PHI message. Te provider attaches any time as long as the public key is known. the address of the patient in the form of a view key for each transaction so that only patients likely enable to track data (iv) Te patient as the recipient later makes the stealth stored in shared storage using his/her knowledge. Te miners address based on a pair of parent keys. Te patient have a duty to validate the data from the sender before being creates a pair of public keys ����(��,��) and sends saved into shared storage, yet they are responsible for adding it to the hospital via secure channel. Te hospital blocks to the blockchain network. Te miner gets a reward generates a new one-time address based on stealth Wireless Communications and Mobile Computing 9

1: Procedure �ℎ���� �������: 2: Trapdoor Function: (parent keys) ∗trapdoor permutation function e.g. RSA, Rabin 3: ������� ←� ℎ��ℎ(����,���) 4: �������� ←� ℎ��ℎ(����,���) ∗∀party has parent keys 5: �ℎ���������� ←� ℎ��ℎ(����,���) 6: ���.����ℎ������� ←� ℎ��ℎ(����,���) 7: �����������V���� ←� ℎ��ℎ(����,���) 8: ������ ���� ←�ℎ��ℎ(����,���) 9: ������������ ←�ℎ��ℎ(����,���) 10: Te sender creates a group of ring signature: 11: Procedure ��� ������ ��� ∀ ������� ∈ ���������� ∗use public key of the parties as an input 12: Create RS: 13: ���� ←� � �(��)⊕��(��)⊕��(��)⊕...⊕��+1(��+1) 14: AddNewMembers: ∗in case: add new members 15: ������ ���� ←� � � � ��������� ⊕ ������(�+1) 16: end procedure 17: Procedure ������ ������ℎ����(�����������): ∗the recipient creates stealth address 18: ���������: ������� �→ � � � � �� �(��,��) 19: ����(��)�→�������� (V�� �������ℎ�����) 20: (��,��)�����V�� �→ ������ ��� ��� ∗sender creates new OTP for recipient 21: end procedure 22: Procedure ������������ ���� ���������: ∗confdential ring signature as an option 23: �� ∀��� �� ��� ��� �ℎ�� ������ True 24: �������� ←� ������������ ⊕ ��� V.����� 25: ������� �� �ℎ� ����� ←� �������� 26: ���������� False 27: end procedure 28: Procedure ���� �ℎ� ��� ����(���): ∗in this case, the PHI data is from the hospital 29: ��� ��� ∈ ��������� ����(��������) 30: ���� � ←� �ℎ���� ������ ∈ ���� 31: ������ ←� �ℎ���� ���V ��� ∈ ��� � � 32: ��� ←� �����ℎ�� �������� ∗KeyImage: to prevent storing the same data 33: ������� ←� ���� � ‖ ������ ‖ �������� 34: end procedure 35: Procedure ���� �� ������ℎ��� �������: 36: ��� ������� ⊕ ���(���� ������ℎ����) 37: ���� �� ������ℎ��� ������� � 38: ���� ����� ������������� 39: �� ������� �� V���� �ℎ�� ������ True ∗transaction is success 40: ��� �������� �� �� ������� 41: ���������� �→ �ℎ���� ������� 42: ���������� False 43: End

Algorithm 1: Shared Storage of PHI Data.

address received. Only patients can access data stored occurred before such as ���� = ����(3) ‖ ����(7) ‖ in shared storage by using his knowledge ���(��,��). ����(5) ‖ ����(9) ‖ ����(�). ��� ��� = � (�� ) �+� (vii) Te hospital signs the diagnosis data using � � � (8) the member key from the ring signature as follows: ���� �(���, ��,��,��,��,��,��,��,...,��) (v) Te sender unpacks the address received which con- which consists of the public keys of the members tains the public address of the recipient ����(��,��). ����,����,����,����,����,����,and����. Te sender picks the random value �∈[1,�−1]to � (viii) Key image is attached by the sender to prevent double generate one-time public key and attaches � as a view spending. It can be interpreted as a scheme to prevent address in the shared storage. the same data stored twice in blockchain storage. (vi) For certain types of data, the sender can use confden- Te hospital sends the following series of data to tial ring signatures (RingCT) to disguise information the blockchain network to be confrmed by miners: of the data by adding value to transactions that have ����� ‖ ��� ‖ ��� ‖ ������ ‖ ��������. 10 Wireless Communications and Mobile Computing

(ix) Te diagnosis data from the hospital along with of a block also afects the length of confrmation time. When attachments of the fles are then sent to the blockchain a node receives a new transaction, the recipient confrms the network to be verifed by the miners before the validity of the block before accepting it. Te duration of the data are stored in a decentralized shared storage. confrmation process depends on the size of the block. By Whenever the data has been stored successfully, the design, the larger the size of a block is, the longer it takes patient traces one by one the transaction on the to confrm. Terefore, the size of a block plays an important blockchain network until the patient fnds his/her role in the blockchain in general because it directly afects the view keys �� which correspond to the patient. delay time. (x) Te patient with his/her knowledge decrypts and Propagation delay is inseparable from the size of a block. compares the value of the data received with the Tere is a correlation between block size and the propagation � delay until the node receives the block. Te larger the size of a decrypted value ���=���� = ���;� =�. block, the more transactions that can be done, yet it afects the Te Algorithm 1 presents an overall model of the system propagation time and sacrifce the security. Te infuence of which starts with generating key pairs for the parties, creating a block size to the propagation time can be seen in Figure 10 a ring signature group and stealth addresses, signing PHI [25]. Te block size in the transaction is varied up to 350 KB data until data are confrmed and stored in the decentralized in order to fnd out how long it takes for the node to receive shared storage. Te observer is unlikely to track the sender’s the block. To reach 25% of total transaction is visualized with transaction since the sender uses a signature on behalf of a red line, 50% for the green line, and 75% for the blue line. a group in which the sender's signature is obscured. Fur- Te results are in line with the theory which states the larger thermore, the observer cannot associate one transaction with the size of a block, the greater the propagation time. another, so it is indistinguishable whether the transaction We take measurements for orphaned blocks by following was sent by the same sender. In other words, the identity the withholding attack (selfsh mining) strategy that was frst of the sender remains a secret. Likewise, the identity of the discovered in 2014. In this study we do not elaborate on recipient cannot be tracked by the observer nor malicious the details of withholding attack strategy, we recommend providers because the recipient sends the stealth address to that readers refer to the references [26–28]. Te intuition the sender; thereafer, the sender creates a one-time address of this attack is to keep the fnding block secret until to protect the privacy of the recipient. Te value of a PHI data the attacker's network becomes the longest chain in the is kept from being seen because it is combined with the value blockchain network. For a particular condition, this attack of previous transactions through the RingCT.Teidentity does not possess any beneft because the block is stored in of the user along with the information of the transactions the attacker's network which is only known by the attacker on the blockchain network is paramount; therefore, the (nonproft). Te attackers get the reward if only the block decentralized shared storage systems need to manage the foundispropagatedtothepublicandgetconfrmedbyother confdentiality as well as ensure the integrity of the PHI data. miners. Apart from the model that has been described, there is Te simulation is carried out in order to know the another important factor which plays the role to a decen- performance of dishonest miners which follow the selfsh tralized system called transaction propagation. A numer- mining protocol. Te aims of this strategy are to discover ous number of transactions are distributed to the entire the new block till becoming the longest chain and gaining nodes in the blockchain peer-to-peer network, resulting in the revenue afer publishing the block to the public network propagation delay. Te structure of a P2P network allows [29]. In our setting, we arrange the selfsh miners to compete the peer to disseminate information to the other nodes with honest miners in 14 days to solve the proof-of-work that are connected to the sender [24]. Tere are several and discover the new block. Te maximum of mining power studies conducted by researchers to measure how efective of selfsh miner is 0.4 and it is running randomly from the block distribution in the peer-to-peer networks such as 0.0 through 0.4 within 14 days in the simulation as shown experiments in which the goal is to fnd out the number in Figure 11. Tere are 12 nodes of dishonest miners with of successful connections and experiments to determine the diferent mining power from 0.0 to 0.4. We set the maximum efect of the number of nodes against the block. number of mining power at 0.4 of total mining power in One among parameters that afect transaction propa- the network. Based on the simulation result, we conclude gation is block size. Block size can be understood as the that whenever the dishonest miner has 0.322 mining power, maximum limit of a block to be flled up with various it is enough to get the unfair revenue and allows gaining transactions on it. It also can be thought of as a bundle the revenue larger than it should be. In general, there are of transactions, with each block needing to get verifcation 51 new blocks successfully added as well as 4 orphaned before it can be accepted by the network. Each block has blocks recorded. Te average of block generation time is 7.72 its own size depending on the type of transaction called the minutes. block size. Te maximum block size in Bitcoin stands at 1MB. To the extent, we obtained the information related to the Miners enable to choose the number of transactions to be parameters that afect the propagation time in the blockchain processed in a block. If Bitcoin miners commit a transaction peer-to-peer network based on our fndings and the prior that exceeds the maximum limit, then the block will be works of literature. Tere are numerous articles proposing the rejected by the network. Te motivation of block size is to improvement of propagation delays by changing the network prevent the attack such as denial-of-service attacks. Te size topology, minimizing the verifcation time of a transaction, Wireless Communications and Mobile Computing 11

50

45

40

35

30

25 Time (sec) 20

15

10

5

0 0 50 100 150 200 250 300 350 Block size (KB) 25%

50%

75%

Figure 10: Relationship between block size and the time.

50 0.8 40

0.6 30

Reward 0.4

Block Height Block 20

10 0.2

0 0.0

0 200000 400000 600000 800000 1000000 1200000 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Time (s) Mining Power

Figure 11: Block height against time in the P2P network (lef); the performance of selfsh mining attack (right). and reconstructing the message exchange protocol in the will be a lot of orphaned blocks. It can be understood blockchain network. Generally speaking, the presence of because each node will receive many new blocks that having the propagation of delays can cause a lot of damage are spread through peer-to-peer networks. Te tie- in the blockchain network [30, 31]. breaking protocol causes each node to only accept Tere are many considerations to increase the efective- one valid block for the same type of transaction, so ness of the blockchain. Terefore, we select block propagation that it will reject transactions from other nodes that and block size parameters to be discussed as follows: cause the orphaned block to appear. Due to many new orphaned blocks that will emerge, it will motivate (i) Speeding up the block generation.Intheory,itwouldbe rational miners to adopt the selfsh mining strategy remarkably benefcial if the block generation time is that rivals the main network and causes the forked resetting as fast as possible for each transaction so that chain [32]. each user will get faster payments. Te propagation time includes the length of time for the distribution (ii) Decrease block generation. Slowing down the block of transactions in the peer-to-peer network and the generation time for each transaction will reduce the time for verifcation of a block. But the problem is speed of transactions on the peer-to-peer network. that if the block generation time is speeding up, there Positively, it gives some merits such as providing a 12 Wireless Communications and Mobile Computing

better security system. Roughly this happens due to 6. Conclusions a decreasing number of orphaned blocks that will eliminate the selfsh mining attack. Te model of privacy in the blockchain peer-to-peer shared storage has been fully presented. Te idea of ring signature (iii) Increasing the block size. Te bigger the capacity of a combined with several protocols provides the solutions for block, the more the transactions that can be flled up privacy issues on the blockchain transaction. Te identity into the block. It will slow the propagation of every of the sender and the recipient remains hidden, unlinkable, transaction to all nodes in the peer-to-peer network. and untraceable from the observer. Te key image is attached Te slow propagation time will cause new problems, to prevent the same data from being stored multiple times, namely, double-spending attack. Attacker could use and it can be used as well to prevent double spending and the same coin for two or more diferent transactions. data duplication. Based on our fndings and information Slower propagation of blocks on the peer-to-peer from several literature reviews, it can be said that increasing network resulted in the fact that the block cannot be the performance of the decentralized blockchain requires a fully accepted by the nodes. As a result, when the very deep analysis since it is directly related to the security. transaction is received, the block will confrm that Tere are advantages and drawbacks for each decision taken. the block is valid even though the transaction has For future work, a scheme that provides incentives needs to been used and confrmed by other nodes in the same be developed as a motivation to maintain the decentralized network. system.

(iv) Reducing the capacity of block size. Because of only a few transactions that occur within a block, it will Data Availability speed up the propagation time for every transaction. No data were used to support this study. It causes many orphaned blocks to emerge and allow for the occurrence of selfsh mining attacks that harm the system. Yet, this decision gives the advantages Conflicts of Interest such as fast payments and fast transaction. However, it should ensure the immutability of the block and Te authors declare that there are no conficts of interest transaction [33]. regarding the publication of this paper. Acknowledgments 5. Limited and Future Work Tis research was supported by the MSIT (Ministry of We assume that shared storage is interconnected to a Science and ICT), Korea, under the ITRC (Information blockchain network where miners have access into it. In Technology Research Center) support program (IITP-2019- terms of the type of shared storage, we do not defne it 0-00403) supervised by the IITP (Institute for Information in detail; instead we assume the shared storage has all the & Communications Technology Planning & Evaluation) and capacity needed to support the proposed system. One of partially supported by the National Research Foundation of the drawbacks of our system is related to the time needed Korea (NRF) grant funded by the Korea government (MSIT) by the recipient to fnd data in shared storage, because the (No. NRF-2018R1D1A1B07048944). recipient must seek for data one by one based on the key view, so that the patient observes each transaction in the blockchain. In the long run, this might become an obstacle References if the blockchain network is expanding. Tere will be many [1] G. Karame and E. Androulaki, Bitcoin and Blockchain Security, transactions occurring so that it will be difcult for the Artech House Information Security and Privacy Series, 2016. recipient to monitor every transaction which belongs to him/her. [2]R.J.Krawiec,D.Housman,M.Whiteetal.,“Blockchain: Opportunities for health care,” in Proceedings of the NIST Forfuturework,themodelandcapacityofsharedstorage Workshop Blockchain Healthcare,pp.1–16,2016. need to be observed further. Te access control in the shared [3] D. E. O’Leary, “Confguring blockchain architectures for trans- storage is also essential to ensure that system keeps safe. action information in blockchain consortiums: Te case of Incentive mechanisms also need to be considered for the accounting and supply chain systems,” Intelligent Systems in storage providers. It aims to motivate providers to contribute Accounting, Finance and Management,vol.24,no.4,pp.138– to protecting the privacy of the users as suggested by [34–36]. 147, 2017. Furthermore, it is important to carefully consider the type [4] D. Yang, J. Gavigan, and Z. W.Hearn, “Survey of Confdentiality of block to be used including parameters directly involved and Privacy Preserving Technologies for ,” R3, pp. in the system such as block size and type of consensus, to 1–32, 2016. name a few. Additionally, the consensus selection mecha- [5] H. Kopp, D. M¨odinger,F.Hauck,F.Kargl,andC.B¨osch, “Design nism is paramount in the blockchain. For instance, a new of a privacy-preserving decentralized fle storage with fnancial method by expanding the Byzantine consensus via hardware- incentives,”in Proceedings of the 2nd IEEE European Symposium assisted secret sharing can solve the scalability problem in the on Security and Privacy Workshops, EuroS and PW 2017,pp.14– blockchain [37]. 22, April 2017. Wireless Communications and Mobile Computing 13

[6]T.W.Clarke,I.Sandberg,O.Wiley,andB.Hong,“Adistributed [25] Y. Sompolinsky and A. Zohar, “Accelerating trans- anonymous information storage and retrieval system,” Journal action processing. fast money grows on trees, not chains,” of Chemical Information and Modeling,vol.53,no.9,pp.1689– Eprint.Iacr.Org,2014. 1699, 2001. [26] I. Eyal and E. G. Sirer, “Majority is not enough: Bitcoin mining [7] E. Heilman, L. AlShenibr, F. Baldimtsi, A. Scafuro, and S. is vulnerable,” Lecture Notes in Computer Science (including Goldberg, “TumbleBit: an untrusted bitcoin-compatible anony- subseries Lecture Notes in Artifcial Intelligence and Lecture Notes mous payment hub,” in Proceedings of the 2017 Network and in Bioinformatics): Preface,vol.8437,pp.436–454,2014. Distributed System Security Symposium,2017. [27] A. Sapirshtein, Y. Sompolinsky, and A. Zohar, “Optimal selfsh [8]S.Wu,Y.Chen,Q.Wang,M.Li,C.Wang,andX.Luo,“CReam: mining strategies in bitcoin,” in Financial Cryptography and a enabled collusion-resistant e-auction,” IEEE Data Security,vol.9603ofLecture Notes in Computer Science, Transactions on Information Forensics,2018. pp.515–532,Springer,Berlin,Germany,2017. [9]J.BenetandN.Greco,“Filecoin:ADecentralizedStorage [28] E. Heilman, “One weird trick to stop selfsh miners: fresh Network,” Protoc. Labs,2018. bitcoins, a solution for the honest miner (poster abstract),” in [10] S. Wilkinson, T. Boshevski, J. Brandof, and V. Buterin, “Storj a Lecture Notes in Computer Science (including subseries Lecture peer-to-peer cloud storage network,” Technical report, storj.io, Notes in Artifcial Intelligence and Lecture Notes in Bioinformat- 2014. ics): Preface,vol.8438,pp.161-162,2014. [11] B. Fisch, “Poreps: Proofs of space on useful data,” Report [29] S. Rahmadika, B. J. Kweka, H. Kim, and K. Rhee, “A scoping 2018/678, Cryptology ePrint Archive, 2018. review in defend against selfsh mining attack in bitcoin,” IT Converg. Pract,vol.6,no.3,pp.18–26,2018. [12] S. Rahmadika and K. Rhee, “Blockchain technology for pro- viding an architecture model of decentralized personal health [30] A. Gervais, H. Ritzdorf, G. O. Karame, and S. Capkun, “Tamper- information,” International Journal of Engineering Business ing with the delivery of blocks and transactions in bitcoin,” in Management,vol.10,pp.1–12,2018. Proceedings of the the 22nd ACM SIGSAC Conference,pp.692– 705, Denver, Colo, USA, October 2015. [13] R. L. Rivest, A. Shamir, and Y. Tauman, “How to leak a secret,” [31] A. Gervais, G. O. Karame, K. W¨ust, V. Glykantzis, H. Ritzdorf, in International Conference on the Teory and Application of ˇ Cryptology and Information Security,vol.2248ofLecture Notes and S. Capkun, “On the security and performance of Proof of in Comput. Sci., pp. 552–565, Springer. Work blockchains,” in Proceedings of the 23rd ACM Conference on Computer and Communications Security, CCS 2016,pp.3–16, [14] D. Chaum, E. van Heyst, and C. Science, “Group Signatures,”in October 2016. Adv. Cryptology—EUROCRYPT’91, vol. iii, pp. 257–265, 1991. [32] D. T. T. Anh, M. Zhang, B. C. Ooi, and G. Chen, “Untangling [15] N. Van Saberhagen, “CryptoNote v 2.0,” Self-published,pp.1–20, blockchain: a data processing view of blockchain systems,” IEEE 2013. Transactions on Knowledge and Data Engineering,2018. [16] S. Noether, A. Mackenzie, and T. M. Research Lab, “Ring [33] D. Mendez Mena, I. Papapanagiotou, and B. Yang, “Internet of confdential transactions,” ,vol.1,pp.1–8,2016. things: survey on security,” Information Security Journal,vol.27, [17] E. Gallery and C. J. Mitchell, “Trusted computing: Security and no. 3, pp. 162–182, 2018. applications,” Cryptologia,vol.33,no.3,pp.217–245,2009. [34] O. Ersoy, Z. Ren, Z. Erkin, and R. L. Lagendijk, “Transac- [18] W. Dife and M. E. Hellman, “New directions in cryptography,” tion propagation on permissionless blockchains: incentive and IEEE Transactions on Information Teory,vol.22,no.6,pp. routing mechanisms,” in Proceedings of the 2018 Crypto Valley 644–654, 1976. Conference on Blockchain Technology (CVCBT),pp.20–30,June [19] A. A. Korba, M. Nafaa, and S. Ghanemi, “Anefcient intrusion 2018. detection and prevention framework for ad hoc networks,” [35]Y.He,H.Li,X.Cheng,Y.Liu,C.Yang,andL.Sun,“A Information and Computer Security,vol.24,no.4,pp.298–325, blockchain based truthful incentive mechanism for distributed 2016. P2P applications,” IEEE Access,vol.6,pp.27324–27335,2018. [20] S. Rahmadika, P. H. Rusmin, H. Hindersah, and K. H. Rhee, [36]J.Wang,M.Li,Y.He,H.Li,K.Xiao,andC.Wang,“A “Providing data integrity for container dwelling time in the sea- blockchain based privacy-preserving incentive mechanism in port,”in Proceedings of the 6th International Annual Engineering crowdsensing applications,” IEEE Access,vol.6,pp.17545– Seminar, InAES 2016, pp. 132–137, Indonesia, August 2016. 17556, 2018. [21] K. Schmidt-Samoa, “A new rabin-type trapdoor permutation [37]J.Liu,W.Li,G.O.Karame,andN.Asokan,“ScalableByzantine equivalent to factoring,” Electronic Notes in Teoretical Com- consensus via hardware-assisted secret sharing,” Institute of puter Science,vol.157,no.3,pp.79–94,2006. Electrical and Electronics Engineers. Transactions on Computers, [22] A. Bender, J. Katz, and R. Morselli, “Ring signatures: stronger vol.68,no.1,pp.139–151,2019. defnitions, and constructions without random oracles,” Journal of Cryptology. Te Journal of the International Association for Cryptologic Research,vol.22,no.1,pp.114–138,2009. [23] K. Lee and A. Miller, “Authenticated data structures for privacy- preserving monero light clients,” in Proceedings of the 3rd IEEE European Symposium on Security and Privacy Workshops, EURO SandPW2018,April2018. [24] R. Schollmeier, “Adefnition of peer-to-peer networking for the classifcation of peer-to-peer architectures and applications,” in Proceedings of the 1st International Conference on Peer-to-Peer Computing, P2P 2001, pp. 101-102, August 2001. Hindawi Wireless Communications and Mobile Computing Volume 2019, Article ID 8913910, 11 pages https://doi.org/10.1155/2019/8913910

Research Article UAV-Undertaker: Securely Verifiable Remote Erasure Scheme with a Countdown-Concept for UAV via Randomized Data Synchronization

Sieun Kim ,1 Taek-Young Youn ,2 Daeseon Choi ,3 and Ki-Woong Park 4

 SysCore Lab., Sejong University, Seoul, Republic of Korea Electronics and Telecommunications Research Institute, Deajeon, Republic of Korea Dept. of Medical Information, Kongju National University, Kongju, Republic of Korea Dept. of Computer and Information Security, Sejong University, Seoul, Republic of Korea

Correspondence should be addressed to Ki-Woong Park; [email protected]

Received 31 January 2019; Revised 28 March 2019; Accepted 22 April 2019; Published 16 May 2019

Academic Editor: Ilsun You

Copyright © 2019 Sieun Kim et al. Tis is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Unmanned aerial vehicles (UAVs) play an increasingly core role in modern warfare, with powerful but tiny embedded computing systems actively applied in the military feld. Confdential data, such as military secrets, may be stored inside military devices such as UAVs, and the capture or loss of such data could cause signifcant damage to national security. Terefore, the development of securely verifable remote erasure techniques for military devices is considered a core technology. In this study, we devised a verifable remote erasure scheme with a countdown-concept using randomized data synchronization to satisfy securely verifable remote erasure technology. Te scheme allows the GCS (Ground Control Station) to remotely erase data stored in the UAV, even on loss of communication, and returns proof of erasure to GCS afer erasure. Our approach classifes the accumulated data stored in the UAV as a new data type and applies the characteristics of that data type to generate the proof of erasure. We select a small- volume data sample (rather than all of the data) and perform prior learning only on that sample; in this way, we can obtain the probative power of the evidence of erasure with a relatively small amount of trafc. When we want to erase data of 100 Mbytes of remote device, 100 Mbytes of data transfer is required for related work, whereas our system has data transfer according to the ratio of amount of randomly selected data. By doing this, communication stability can be acquired even in unstable communication situations where the maximum trafc can change or not be predicted. Furthermore, when the UAV sends the proof of erasure to the GCS, the UAV does its best to perform the erasure operation given its situation.

1. Introduction Erasing data as needed can safeguard confdential infor- mation stored on remote devices. Generally, if a user sends In 2011, the US military lost control of an American Lockheed an erasure request, the remote device receives it and performs Martin RQ-170 Sentinel unmanned aerial vehicle (UAV) the erasure. However, UAVs remotely performing a mission during a mission over Afghanistan. Te UAV was captured by cannot always receive the signal [7–9]. As such, techniques Iranian forces [1], which successfully extracted information, are needed to completely erase the data stored in remote released a video obtained from the captured UAV, and UAVs following a loss. subsequently constructed a copy of the RQ-170, the Saegheh Recognition that sensitive information has been leaked [2]. As demonstrated by this episode, the thef of military (i.e., the erasure process has not been performed) is critical, UAVs causes signifcant damage, including the loss of secret and so confrmation of erasure must be performed in a data, such as collected information [3]. Consequently, it is verifable manner (verifable erasure). Verifying the result of indispensable to assure the security of the UAV [4–6]. the erasure operation as 1-bit response has low reliability [10]. 2 Wireless Communications and Mobile Computing

Unverifiable Erasure Erasure Request receive erase UAV response False Result GCS Attacker (c)

Blocking Erasure Request

Attack Erasure Request

UAV Attacker GCS

Attacker (d)

Performing Erasure Despite Loss of Control and Sending Proof of Erasure GCS GCS

(a) (b) erase UAV Attacker Proof of GCS response Erasure

(e) Figure 1: Tree situations faced by unmanned aerial vehicles (UAVs) afer attack or capture. (a) UAV on a fight mission; (b) UAVs performing a fight mission is captured by an attacker; (c) receiving unverifable result of erasure; (d) blocked erasure requests by attackers; and (e) performing erasure operation and transmitting proof of erasure following capture.

Obvious proof of erasure must be presented, and there must generatesaproofoferasureuntiltheUAViscompletely be no mechanism by which malicious users can generate false stopped; this proof is repeatedly sent to the GCS. Te GCS evidence. checks that the proof has been received and verifes that Figure 1 indicates three situations that UAVsmay be faced remote data are erased. with afer attack or capture. In conventional remote erasure Te remainder of this paper is organized as follows: systems [11, 12], a Ground Control Station (GCS; that is, Section 2 reviews the related works of verifable remote data the control center that provides human control of UAVs) erasure. Section 3 describes the materials and gives details can erase data by sending an erasure signal to a lost UAV. of our mechanism UAV-Undertaker. Section 4 describes two However, when communication is cut, conventional remote experiment results about overhead in terms of communi- erasure is not available. We devised counter-based erasure, cation and computation. Tis paper ends in Section 5 with which can erase data in an uncontrollable environment. Tis conclusion on our work. study is an extension of our previous work [13, 14], in which we focused on the system design of trustworthy remote 2. Related Works erasure for UAVs. However, our objective here was to devise a mechanism to select random seed data using accumulated To resolve problem of erasure of remotely stored data with data and develop a method for trustworthy remote erasure. verifability, the researchers have devised approaches which Te devised scheme has a count value, which decrements generate a provable value of erasure operation. We analyzed value one by one and erases data when the counter value several of them and identifed the limiting factors in applying reaches zero. Te GCS can retain stored data only through a verifable erasure for UAVs. periodically sending the specifc value to the UAV.If the GCS Perito and Tsudik [15] use a PoSE (Proof of Secure does not send the specifc value to the remote UAV and the Erasure) for secure code updates on embedded devices by counter reaches zero, the UAV erases the data. In terms of verifying the erasure of memory. A verifer sends verifer- verifcation, most existing mechanisms provide nothing or selected randomness of the size of memory to a prover, and only a simple response (erase or not erase). Our remote era- the prover’s memory is flled with the received value. Te sure system provides relatively complex but verifable proof prover returns the very same randomness written on the of erasure through randomized data synchronization. Upon memory to the verifer. As both the verifer and the prover aprocessoferasureofdata,theremoteUAVcontinuously send a random value of memory size to each other, this Wireless Communications and Mobile Computing 3

GCS Authenti- Receive If receive cation msg Proof Success Fail

Idle Aer update Update

Performed Aer If receive interval per receiving proof AeAerr proof sendingdi Send msg Verify Msg Proof

Communication Protocol Erasure Protocol (a) (b)

Figure 2: Overall process of the proposed remote erasure system. (a) Unmanned aerial vehicle (UAV)-Undertaker from the Ground Control Station (GCS) viewpoint; (b) UAV-Undertaker from the UAV viewpoint. If the count value becomes zero, the UAV erases data and repeatedly sendstheproofoferasuretotheuser(i.e.,theGCS).

protocol has a huge overhead concerning communication. and/or disconnection. Also, the sensor system in cloud is sim- In our system, as shown in Figure 9 of Section 4, when a ilar to our system in that it collects data from remote devices small amount of data (i.e., � =2) is selected as a seed and and processes them to achieve specifc goals. However, in the transmitted, the instantaneous amount of transferred data is case of the cloud sensor system, the collected information is signifcantly lower than when live tracking all data (i.e., � limitedtothesensordataandismerelyusedforprovidingthe =100). In other words, our system shares the burden of the monitoring to the user. In the case of our system, the collected required data transfer amount to verify the erasure operation. information is the memory information of the remote device. Terefore, our system can signifcantly reduce instantaneous It can be seen that there is a diference from the cloud sensor amount of transferred data which is a limitation of the system in that it verifes whether or not the erasure operation above study. Dziembowski et al. [16] proposed a scheme that is performed by utilizing this. reduces the communication complexity of PoSE. A verifer sends a seed to a prover. Te prover performs a hash function 3. Proposed System Architecture using the seed and retains the sizable result value using all of the prover’s memory. Te calculated hash value can only Our UAV-Undertaker consists of two parts: the communi- be generated once because it uses a secret key stored in cation protocol and verifable erasure protocol. We confrm the memory, and so it can be proof of erasure. However, whetherornotcontroloftheUAVislostbycommunicating while this scheme reduces the communication overhead of with the UAV. Terefore, it is important to verify that a PoSE, a huge calculative overhead arises. Ammar et al. [17] message originates from the stated sender (i.e., authentic introduced SPEED, which guarantees erasure on embedded communication) and has not been changed. Tis authenti- system. Te protocol is implemented in isolated memory cation role is implemented as the using the Trusted Sofware Module. A verifer’s distance is using a hash chain mechanism [19]. We also implement the measured by the Distance Bounding (DB) protocol, which verifable erasure protocol, which generates proof of erasure establishes an upper bound on the physical distance between to verify that the erasure operation was really performed. We a verifer and a prover. Te prover authenticates the verifer classify the data region according to the number of times through the DB protocol and erases data if verifcation is the data will be written in order to increase the probative passed successfully. To verify erasure, the prover sends the power of proof of erasure with just a small amount of data proof as the message authentication code value of the entire (thus, improving efciency). Te operation result value is memory. However, the scheme is limited in terms of distance created by accurately performing the implemented erasure and so is not applicable for UAVs. operation. Terefore, a cryptographic accelerator, such as the Our approach solves verifable erasure through proof of TPM (Trusted Platform Module), must be used in order to erasure [15–18] that only the user can identify, as in the make the results of the computation trustworthy. above studies. However, our approach creates a proof of Figure 2 shows the overall process of our approach from erasure in such a way that the memory blocks each have both the GCS and UAV viewpoints. From the GCS viewpoint, a dependency on the block, which guarantees more robust the GCS sends an authentication request message to the UAV evidence. Furthermore, in contrast to past studies, which did at intervals and then waits for a message from the UAV. If not consider losses in the communication environment, we the GCS receives a message from the UAV, it verifes that aimed to achieve remote erasure of UAVs afer hijack, loss, the message is authentic; if so, it updates the value to be 4 Wireless Communications and Mobile Computing

Table 1: Notations of the entity and messages.

Defnition of the Entity Symbols GCS Ground Control Station UAV Unmanned Aerial Vehicle Denotation msg = �‖� Message contains two contexts (� and �) msg = �‖(�) Message always contains � context, but � context is optionally contained. Defnition of the Message Symbols

��,� A symmetric key shared between � and � �{�, �} Encrypt � with key � � Value to compute with hash function � Number of hash function calculations � Retransmission bit � Authentication count ℎ(�) Result of � in the hash function ����� Generated random data against replay attacks ℎ� Address of hot data modifed afer previous authentication �� Address of accumulated data modifed afer previous authentication ℎV Set containing the addresses of hot data randomly selected and the values of those data �V Set containing the addresses of accumulated data randomly selected and the values of those data ��� Proof of Erasure Defnition of the Greek Symbols � Ratio of amount of randomly selected data in accumulated data region � Ratio of amount of randomly selected data in hot data region � Number of re-authentication permits � Initial value of the counter

sent next. If communication with the UAV is cut of and the calculations (�) and the value to compute with hash function UAV performs the erasure operation and the GCS repeatedly (�) using a symmetric key already shared between the GCS receives the proof of erasure from the UAV and verifes it. and the UAV; it then sends an initialization request message From the UAV viewpoint, afer decrementing the counter by (Message -) containing that encrypted value to the UAV. one, the UAV waits for a message from the GCS (idle state). When the value is received from the GCS, the UAV computes � A UAV that has successfully authenticated a GCS update ℎ (�), encrypts it, and sends an initialization response mes- of the hash value for the next authentication, initializes the sage (Message -) to the GCS. Te GCS confrms that the decreasing count value to � (i.e., the initial value of the UAV received the correct � and �.TeGCSandtheUAV, counter), and waits for the message to be received again who have verifed each other, start exchanging authentication (idle state). If the UAV does not receive a message from the messages using a hash chain. Te GCS sends messages to GCS before the count value reaches a certain value, the UAV the UAV at regular intervals. When the maximum count repeatedly erases stored data and repetitively sends proof of value is � (seconds) and the number of reauthentication erasure through a low frequency. times is �,theinterval is calculated using (1). Te constant “1” in (1) is added because GCS waits for the interval and .. Communication Protocol. To allow the sender of each then sends the next message. Te constant “2”in(1)is message to be authenticated, hash chains and message multiplied to prevent the timer from being initialized during encryption are used in the communication protocol. To the maximum authentication latency that can occur on our encrypt a message, the UAV must the symmetric key system. securely ofine with the GCS before departure. Our approach � � �����V�� = (�������) is designed to execute the erasure protocol afer times of 2(1+�) (1) requests for reauthentication, because the UAV may be con- fronted with a mere transient communication failure, where Afer the UAV receives a message (Message -)contain- � can be set freely according to the circumstances of mission. ing the hash value from the GCS, the UAV calculates the Table 1 defnes the entity symbols and the message symbols. hashvalueofthereceivedvalueandcomparesthevalue Figure 3 shows the message fow when communication with the previously stored value. If two values are equal, between the UAV and the GCS is performed without prob- the UAV initializes the count value to � and updates the lems. First, the GCS encrypts the number of hash function stored value with the hash value received from the GCS. Wireless Communications and Mobile Computing 5

Number of Hash Function Calculations Hash Value GCS UAV Stored in UAV Stored in UAV 1. Initialization Request Msg( 1 1 - ) 0 0 2 ℎn(x) . Calculate , ℎn(x) Initiate Timer n 4. Verify 3. Initialization Response Msg(1-2) … Msg 1-2 … iinterval 5. ith Authentication Response Msg(1 ℎn−i+1(x) -3) n – i + 1

6. Verify Msg 1-3, n−i Initiate Timer, n – i ℎ (x) Update Hash Value of UAV 8. Verify 7.ith Authentication Request Msg(1-4) Msg 1-4 … intervali …

 K ,x‖ n ‖ Nonce  Message 1-1. GCS UAV : E GCS,UAV GCS  K ,ℎn (x)‖Nonce  Message 1-2. UAV GCS : E GCS,UAV GCS K ,ℎn−i (x)‖r‖ Nonce  Message 1-3. GCS UAV : E GCS,UAV GCS  K ,n−i‖ Nonce ‖ ℎl ‖ al  ‖ (ℎ‖ a) Message 1-4. UAV GCS : E GCS,UAV UAV

Figure 3: Message fow depending on the communication environment. Messages 1-1–1-4 are those sent when communication between the unmanned aerial vehicle (UAV) and the Ground Control Station (GCS) is performed without problems.

Afer confrming that the message is sent from the GCS, the recognizes resending, and confrms that the hash value stored UAV sends a message (Message -) containing the memory in the UAV and received value are equal. If the two values are information along with the n − itotheGCS.ℎV and �V equal, the UAV returns value (n − i), which means the next (changed memory information) contained in Messages - hash value request. At that time, the UAV does not update are not encrypted and transmitted. Te reason is that, in our the stored counter value or hash value. If the two values are system, ℎV and �V can include very large amounts of data not equal, the UAV calculates the hash value of the received and encrypting these data is expected to have a very high value and compares it with the stored value. If the two values computational overhead. Even if the malicious user success- are equal, the UAV resets the counter value and updates the fully acquires the unencrypted ℎV and �V, the user cannot stored hash value. Afer that, the user and the UAV resume generate the false proof because the last hash value used in the normal communication. However, if the next hash value is erasure operation cannot be determined. Tese transactions not received from the GCS and the value of the counter are repeated � times if communication is good. When the inserted in the UAV reaches 0, the UAV performs an erasure number of authentication (�) exceeds � (i.e., the maximum protocol and sends a message (Message -)includingproof number of hash function calculations), the GCS and the of erasure and the memory information changed since the UAV reexchange the new � and � through Message - and most recent successful authentication to the GCS through Message -. a certain frequency. In order for the GCS to verify that the Figure 4 shows the message fow from when UAV could UAV performed an erase operation, the GCS must know the not receive the hash value from the GCS until the counter changed memory information in the UAV. However, afer the value reaches 0, at which point it initiates the erasure protocol. communication between the GCS and the UAV is lost, the If the GCS sends an authentication request message (Message GCS do not know information of data changed in the UAV. -) to the UAV but does not receive the authentication Tus, Message - contains the changed memory information response message from the UAV, the GCS waits the interval since the communication was disconnected. As our approach time and then sends again the message containing the same takes the best efort approach, the UAV repeatedly performs hash value. At this time, the � bit (i.e., the retransmission) the erasure operation afer transmitting the proof of erasure issetto1andincludedinthemessage(Message -). and continues to send the proof of erasure (Message -) If the UAV faced a temporary communication failure, it to the GCS. Te number of times an erase operation is will subsequently receive a reauthentication request message performed depends entirely on a given amount of time to the (Message -) from the GCS; the UAV checks the � bit, UAV afer the erase operation has begun. 6 Wireless Communications and Mobile Computing

Number of Hash GCS UAV Function Calculations Hash Value 1. Authentication Re Stored in UAV Stored in UAV quest Msg(2- n−i+1 1) n – i + 1 ℎ (x) interval 2 . Re-authentication Request Msg( 2-2) interval 3 . Re-authentication Request Msg(2-2)

4. Erasure Operation

5. Erasure Proof Msg(2-3) 6. Erasure Operation … 7. Erasure Proof Msg(2-4) … Best Efort

K ,ℎn−i (x)‖r ‖ Nonce  Message 2-1. GCS UAV : E GCS,UAV GCS K ,ℎn−i (x)‖r=1‖ Nonce  Message 2-2. GCS UAV : E GCS,UAV UAV Message 2-3. UAV GCS : PoE ‖ ℎl ‖ al Message 2-4. UAV GCS : PoE

Figure 4: Message fow depending on the communication environment. If the unmanned aerial vehicle (UAV) does not receive the hash value from the Ground Control Station (GCS) before the counter value reaches 0, the UAV proceeds with the erasure process and sends proof of erasure to the GCS.

C1 C2 C3 A1 A2 A3 A4 A5 A6 H1 H2 H3 … … … … … … ......

… … Cn … … … … … An ...... Hn

Cold Data Region Accumulated Data Region Hot Data Region

Storage of UAV

Figure 5: Unmanned aerial vehicle (UAV) storage divided into three regions: cold data region, accumulated data region, and hot data region.

.. Verifiable Erasure Protocol. For data erasure, our Te selected data is managed by recording the address approach classifes the UAV storage into three data regions value and data type in the seed block table. In the accumulated according to the expected number of writes of the data data region, the number of seed data selects �%ofthenumber � (Figure 5). Te frst region is a cold data region (i.e., data that of modifed data blocks from time n-1,atwhichtheprevious will never be modifed afer the UAV departs), the second seed block was selected to the current time �n,atwhichthe is the accumulated data region (i.e., data that will not be seed block selection occurs. Terefore, when selecting seed modifed afer it has been written; for example, video shot by data, the seed is selected uniformly regardless of the time at the UAV), and the third is a hot data region (i.e., data that which the data were generated. In the hot data region, the is actively modifed). Te mechanism assumes that the GCS number of seed data selects �ℎ� ℎ�� ���� ������ ���� × � % � knows the address values of the three data regions. among modifed data blocks from time n-1,atwhichthe Our proposed proof of erasure generation method previous seed block was selected to the current time �n,at requires the live tracking of accumulated data and hot data which the seed block selection occurs. Te value of � and (except for cold data that does not change because each data � can be directly selected by the user. If the GCS and the point has a dependency on each other in the proof of erasure). UAV are in very good communication (high bandwidth) and However, since tracking all the data causes considerable there is no problem sending and receiving a lot of data; the communication overhead, we randomly select data to act as GCS can increase the probative power of proof by setting the seeds and transmit authentication messages containing just variable (�) high. Conversely, if the GCS and the UAV are in those data (Figures 6 and 7). poor communication (low bandwidth) and cannot exchange Wireless Communications and Mobile Computing 7

t t t Time n-2 n-1 n

A 1 A 2 A 3 A 4 A 5 A 6 … … …

……… … … … A k-1 A k A k+1

H4 … … Hn-2 Hn-1 Hn Hn-2 Hn-1 Hn … … Data generated Data generated t t t t between n-2 and n-1 between n-1 and n Seed Block Table 0 C1 C2 C3 A 1 A 2 A 3 A 4 A 5 A 6 H1 H2 H3 Addrof A3 … … … … … … … A A A H ...... k-1 k k+1 4 … … … Cn … … … … … A n Hn-2 Hn-1 Hn … … … Cold Data Accumulated Data Hot Data 1 Addr of H4 1 Addr of Hn-1 Storage of UAV

Figure 6: Random selection of data from unmanned aerial vehicle (UAV) storage and the seed block table. Te gray block of memory indicates � data selected as seed blocks. Te seed blocks are selected from data generated between time n-1, at which the previous seed block was selected, and time �n at which the seed block selection occurs.

GCS … UAV t n-1 Random Selection of Data

Authentication Request Msg(1-3) Accumulated Data Hot Data

A3 A5 H4 ℎ‖ a Authentication Response Msg(1-4) in Message 1-4. Verify (Several Times of Authentications) Address of A3

Msg 1-4 …

Authentication Request Msg(1-3) t n Random Selection of Data

Ak Hn-1

Authentication Response Msg(1-4) ℎ‖ a in Message 1-4. Verify Msg 1-4 …

Figure 7: Process of sending randomly selected seed data to the Ground Control Station (GCS). Te gray blocks represent randomly selected memory blocks, and t represents an arbitrary time. 8 Wireless Communications and Mobile Computing

Figure 8: Memory state of the unmanned aerial vehicle (UAV) afer the predecessor step to erase accumulated data and hot data. In � accumulated data and hot data regions, the remaining blocks, excluding the seed data blocks, are overwritten with �� ,whichistheresultof the erasure operation of the cold data region.

large amounts of data, the GCS can lower the amount of region and the hot data region are erased. In other words, the communication data by setting the variable (�)verylow.In seed data of hot data are also calculated from (4). other words, the GCS can set the variable according to the � � =ℎ(� ) ⊕� situation where the GCS use our system. � � �−1 (3) If the UAV counter reaches zero, the erasure protocol � � =ℎ(� ) ⊕� (4) proceeds in the following order: erasure of cold data, erasure � � �−1 of accumulated data, and erasure of hot data. For erasure of Te process above is defned as the frst round. Starting cold data, the erasure operation is performed without any from the second round, the erasure process proceeds without predecessor because the GCS already knows all of the values. distinguishing the data region, unlike the frst round. Te � � Te UAV overwrites the frst block (C) of the cold data with UAV wipes  with the result of the XOR operation of the � � � � � � the last hash value received from the GCS. Te UAV performs  block and � block. Afer that, the UAV wipes  � �� � � an XOR operation of the C block of the memory and C with the result of the XOR operation of  block and  . block wiped with the last hash value received from the GCS Tis process continues until the last block of hot data is and wipes the C block with the calculated value. From now overwritten, without any distinction of the data region, to � � on, the wiped C block is denoted by  . Next, the C block complete the second round. To prevent forensic, we proceed is wiped to the value obtained by the XOR operation of the through multiple rounds. We assume that the repeatedly � �  block and the C block. Te repeated wiping of memory wiped memory cannot be recovered, so it is impossible proceeds in the same way (see (2)) until wiping the Cn block to extract information from the memory. Afer the second �� (i.e., the last block of cold data). round process, as proof of erasure, the block �� value is computed using a hash function, and the UAV sends that � �� =�� ⊕��−1 (2) value (proof of erasure) to the GCS as Frequency Shif Keying (FSK) via the low frequency. Te GCS who has received the For accumulated data region and hot data, the GCS cannot proof of erasure from the UAV will be able to calculate the knowallofthestoredvaluesandsopredecessorsarerequired. proof of erasure and confrm the validity of the message. Te address value of the changed data afer the last successful In the frst part of the erasure process, the last hash value authentication time should be recorded in the main memory. sent by the GCS is used as a seed in the erasure process Te remaining blocks, excluding the seed data block, are to enhance security. Te hash value used as the seed is an � overwritten with �� , which is the result of the erasure encrypted value, so an attacker cannot determine this value. operation of the cold data. Te bottom side of Figure 8 Eachvaluecalculatedbytheaboveerasureprocesshasa shows the memory state afer this predecessor step. Ten, dependency on the previous memory data. Terefore, even the UAV performs the XOR operation with the A block if the attacker knows some blocks of memory, a false proof of accumulated data and the A block of accumulated data cannot be generated. and wipes the A block with the result value. If confronted with the seed data (e.g., A) during the above operation, 4. Experimental Results � � the UAV performs the XOR operation with the  block of accumulated data and the hash value of the A block of To test performance of the devised protocol, we implemented accumulated data and wipes the A block with the result value an experimental environment using the T as the UAV (see (3)). In this way, all data stored in the accumulated data fight computer [20]; the T has a 128 Mbytes NOR fash Wireless Communications and Mobile Computing 9

50000

40000

30000 t t n-1 n 20000 W Write roughput (kbytes) roughput Write 10000     

0          ~60 ~90 ~30 ~60 ~90 ~30 ~120 ~180 ~210    ~150 ~120 ~180 ~210    ~150         00 30 60 00 30 60 90 90 120 150 180 120 150 180 Time Interval Time Interval (a) (b) 80000 70000 70000 60000 60000 50000 50000 40000 40000 30000 30000 20000 20000 10000 10000 Amount of Transferred Data (kbytes) Data Transferred of Amount Amount of Transferred Data (kbytes) Data Transferred of Amount 0 0 1st Auth 1st Auth 3rd Auth 3rd 4th Auth 5th Auth 4th Auth 5th Auth 3rd Auth 3rd 2nd Auth 2nd Auth Order of Authentication Message Order of Authentication Message

=2 =50 =2 =50 =10  = 100 =10  = 100 (c) (d)

Figure 9: Amount of data transferred according to the rate at which accumulated data are generated and �. (a) Constant amount of data written to the SD card over time interval; (b) random amount of data written to the SD card over time interval; (c) amount of transferred data when the data is generated as shown in Figure 9(a) on the unmanned aerial vehicle (UAV); and (d) amount of transferred data when the data is generated as shown in Figure 9(b) on the UAV.

memory, SD connector to interface, and SATA interface. at which accumulated data are generated and �.Teamount Trough this implementation, we conduct two experiments of transferred data in the experiment is the amount of data to investigate how practical the proposed scheme is: the that is sent when the UAV sends an authentication response communication overhead according to � and the erasure message to the user. To reproduce an environment similar to latency of the erasure process according to memory size. We a real UAV's storage, we used a data generator called iozone conducted experiments on SD cards inserted in the T. [21] to write dummy data to the SD card. For 3 minutes and Te block size used in the XOR operation was determined 30 seconds, a total of 100 Mbytes of accumulated data was based on how many 512-byte blocks were sequentially read at generated through the data generator. Figures 9(a) and 9(b) atimeandthesizeofoneblockusedinourexperimentwas show the amount of data written to the SD card over the time 1Kbytes, which was determined by reading 512-byte blocks interval. In Figure 9(a), the data generator always writes a twice in succession. constant amount of data, but in Figure 9(b), the amount of data to write largely changes with time. Figure 9(c) shows .. Communication Overhead Experiment. Tis experiment the amount of transferred data when the data is generated measured the amount of data transferred according to the rate as shown in Figure 9(a), and Figure 9(d) shows the amount 10 Wireless Communications and Mobile Computing of transferred data when the data is generated as shown in Figure 9(b). In both experiments, � was set to 2, 10, 50, and 100. Each experiment has two random selections 400 of data, and the random selection of data occurred at the time indicated by �n 1 and �n in Figure 9. For comparison, - 300 the random selection of data occurred at the same time in both experiments. Authentication was performed every 45 seconds. 200 In our system, the amount of transferred data afer a randomselectionofdataoccursisgreatlyincreased.As can be seen from the experimental results, the increased Latency (seconds) Erasure 100 amount of transferred data is highly dependent on �.Te amount of generated data also afects the amount of data 0 transferred, but the amount of data transferred is very small 1G 2G 4G 16G if � is small. In other words, our system makes it possible Memory Size (bytes) to select � depending on the situation, enabling timer- Erasure operation based erasure operation and erasure verifcation even when Transferring proof the maximum transmission amount of the UAV is largely restricted. In addition, even if the amount of generated data Figure 10: Experimental result for erasure latency according to the suddenly increases, a sample of generated data is selected memory size of the unmanned aerial vehicle (UAV). and transmitted, so it is possible to communicate more stably by sending less data during unstable communication situations. since communication was lost. As can be seen from the experimental results, if the UAV timer reaching zero, it has 8 minutes to perform the erasure operation 32 times at 1 Gbytes, .. Erasure Operation Overhead Experiment. In this exper- 17 times at 2 Gbytes, 8 times at 4 Gbytes, and 2 times at 16 iment, we measured the erasure latency according to the Gbytes. In other words, the UAV makes the best eforts in memory size of the UAV, where the erasure latency was the the given situation to erase the data and transfer the proof time from when the embedded timer reaches 0 to when of erasure. the proof of erasure is generated. Our approach divides the memory region into three regions according to the type of data and performs the erasure operation. So, we divided the 5. Conclusions memory used in the experiment into the following regions: 5% of the total memory for the cold data region, 85% for We have proposed a mechanism to provide proof of erasure the accumulated data region, and 10% for the hot data operations afer erasing data stored on UAVs, even if control region. Since we wanted the erasure latency to be afected of a remotely deployed UAV is lost. To do this, we used a only by memory size in this experiment, we equally set all countdown-based approach and hash chain to authenticate the variables except for memory size. In this experiment, the sender of received messages and to trigger the erasure the values of � and � were set to 2. Figure 10 shows each operation even afer communication is lost. Te accumulated number of the proof transmit for 1 Gbytes, 2 Gbytes, 4 data of the UAV is classifed into new data types; the features Gbytes, and 16 Gbytes memory size when a limited time of the data are actively utilized in the erasure operation, and of 8 minutes is given to the UAV that initiated the erasure the proof of erasure operation is transmitted so that the GCS operation. Te data transfer rate via FSK is recommended can verify whether the erasure operation has actually been to be 345 bits per second [22]. Terefore, our experiment performed. also sent 345 bits per second when transferring the proof of Our approach does not track all the data when generating erasure. Te amount of data changed afer communication proof of erasure, and so there is relatively little communica- wasdisconnectedandwasassumedtobe100Mbytes.Te tion overhead; instead, we select seed data to generate proof erasure latency recorded in the graph is the average value of of erasure. By allowing the amount of transferred data to erasure latency measured through three times of the erase generate proof through the value of �, timer-based erasure operation. and erasure verifcation are possible even when the maximum As we might expect, the erasure latency of frst round amount of transmission is greatly limited. Furthermore, since increased with the memory size. In the rest of the experiment, a UAV that had lost control and started the erasure operation except for the 1 Gbytes scenario, the diference between the would not know what situation it was in, we used the best erasure latency of the frst round execution and the erasure efort method to perform the erasure operation. latency of the second round execution was large. Tis is because the number of hash operations increased as memory Data Availability size increased. Te frst transmission of proof of erasure takes 9 seconds longer than the subsequent proof transmission, Te data used to support the fndings of this study are since it contains the address value of the changed data available from the corresponding author upon request. Wireless Communications and Mobile Computing 11

Conflicts of Interest [13] S. Kim, T.-Y. Youn, D. Choi, and K.-W. Park, “Securely control- lable and trustworthy remote erasure on embedded computing Te authors declare that there are no conficts of interest system for unmanned aerial vehicle,” in Proceedings of the rd regarding the publication of this paper. International Symposium on Mobile Internet Security (MobiSec), vol. article 8, pp. 1–9, Cebu, Philippines, 2018. Acknowledgments [14] S. Kim, T.-Y. Youn, D. Choi, and K.-W. Park, “Securely control- lable and trustworthy remote erasure on embedded computing Tis work was supported by the National Research Foun- system for unmanned aerial vehicle,” in Proceedings of the dation of Korea (NRF) [Grant no. NRF-2017R1C1B2003957] Research Briefs on Information & Communication Technology and by the Institute for Information & Communications Evolution (ReBICTE),vol.4,article3,2018. Technology Promotion (IITP) of the Korea government [15]D.PeritoandG.Tsudik,“Securecodeupdateforembedded (MSIT) [Grant no. 2019-0-00426 and no. 2018-0-00420]. devices via proofs of secure erasure,” in Proceedings of the European Symposium on Research in Computer Security,pp. 643–662, Springer, 2010. References [16] S. Dziembowski, T. Kazana, and D. Wichs, “One-time com- putable self-erasing functions,” in eory of Cryptography,vol. [1] D. Cenciotti, Iran Unveils New UCAV Modeled on Captured 6597 of Lecture Notes in Computer Science,pp.125–143,Springer, U.S. RQ-  Stealth Drone, Te Aviationist, 2016, https://theav- Heidelberg, Germany, 2011. iationist.com/2016/10/02/iran-unveils-new-ucav-modeled-on- [17] M. Ammar, B. Crispo, W. Daniels, and D. Hughes, “Speed: captured-u-s-rq-170-stealth-drone/. secure provable erasure for class-1 iot devices,” in Proceedings of [2]S.ShaneandD.E.Sanger,Drone Crash in Iran Reveals Secret the th ACM Conference on Data and Application Security and U.S. Surveillance Effort, New York Times, 2011, https://www.ny- Privacy (CODASPY), pp. 111–118, 2018. times.com/2011/12/08/world/middleeast/drone-crash-in-iran- [18] G. O. Karame and W. Li, “Secure erasure and code update in reveals-secret-us-surveillance-bid.html/. legacy sensors,” in Proceedings of the International Conference [3] L. A. White, “Te need for governmental secrecy: why the on Trust and Trustworthy Computing, vol. 9229, pp. 283–299, us government must be able to withhold information in the Springer International Publishing, 2015. interest of national security,” Virginia Journal of International [19] L. Lamport, “Password authentication with insecure communi- Law,vol.43,p.1071,2002. cation,” Communications of the ACM,vol.24,no.11,pp.770– [4] N.Uchida,S.Takeuchi,T.Ishida,andY.Shibata,“Mobiletrafc 772, 1981. accident prevention system based on chronological changes of wireless signals and sensors,” Journal of Wireless Mobile [20] M. Slonosky, “Trusted boot in COTS computing,” Military Embedded Systems, 2015, http://mil-embedded.com/articles/ Networks, Ubiquitous Computing, and Dependable Applications, trusted-boot-cots-computing/. vol. 8, no. 3, pp. 57–66, 2017. [5] C. Gritti, M. Onen,R.Molva,W.Susilo,andT.Plantard,“Device¨ [21] W. D. Norcott, “Iozone flesystem benchmark,” http://www identifcation and personal data attestation in networks,” Jour- .iozone.org/docs/IOzone msword 98.pdf. nal of Wireless Mobile Networks, Ubiquitous Computing, and [22] L. Deshotels, “Inaudible sound as a covert channel in mobile Dependable Applications (JoWUA),vol.9,no.4,pp.1–25,2018. devices,” WOOT,2014. [6] N. Uchida, K. Ito, T. Ishida, and Y. Shibata, “Adaptive array antenna control methods with delay tolerant networking for the winter road surveillance system,” Journal of Internet Services and Information Security (JISIS),vol.7,no.1,pp.2–13,2017. [7] F. Arena, P. Giovanni, and M. Collotta, “A survey on driverless vehicles: from their difusion to security features,” Journal of Internet Services and Information Security (JISIS),vol.8,no.3, pp. 1–19, 2018. [8]A.J.Kerns,D.P.Shepard,J.A.Bhatti,andT.E.Humphreys, “Unmanned aircraf capture and control via GPS spoofng,” Journal of Field Robotics,vol.31,no.4,pp.617–636,2014. [9] A. Rawnsley, “Iran’s alleged drone hack: tough, but possi- ble,” Wired, 2011, https://www.wired.com/2011/12/iran-drone- hack-gps/. [10] F. Hao, D. Clarke, and A. F. Zorzo, “Deleting secret data with public verifability,” IEEE Transactions on Dependable and Secure Computing,vol.6,pp.617–629,2016. [11]M.D.Leom,K.Kwang,R.Choo,andR.Hunt,“Remotewiping and secure deletion on mobile devices: a review,” Journal of Forensic Sciences,vol.61,no.6,pp.1473–1492,2016. [12] X.Yu,Z.Wang,K.Sun,W.T.Zhu,N.Gao,andJ.Jing,“Remotely wiping sensitive data on stolen smartphones,” in Proceedings of the th ACM Symposium on Information, Computer and Communications Security, ASIA (CCS),pp.537–542,ACM, Japan, 2014. Hindawi Wireless Communications and Mobile Computing Volume 2019, Article ID 7840917, 11 pages https://doi.org/10.1155/2019/7840917

Research Article Authorized Client-Side Deduplication Using CP-ABE in Cloud Storage

Taek-Young Youn ,1 Nam-Su Jho ,1 Kyung Hyune Rhee ,2 and Sang Uk Shin 2

1 Electronics and Telecommunications Research Institute (ETRI), Daejeon 34129, Republic of Korea 2Department of IT Convergence and Application Eng., Pukyong National University, Busan 48513, Republic of Korea

Correspondence should be addressed to Sang Uk Shin; [email protected]

Received 17 January 2019; Revised 11 March 2019; Accepted 15 April 2019; Published 15 May 2019

Academic Editor: Ilsun You

Copyright © 2019 Taek-Young Youn et al. Tis is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Since deduplication inevitably implies data sharing, control over access permissions in an encrypted deduplication storage is more important than a traditional encrypted storage. Terefore, in terms of fexibility, data deduplication should be combined with data access control techniques. In this paper, we propose an authorized deduplication scheme using CP-ABE to solve this problem. Te proposed scheme provides client-side deduplication while providing confdentiality through client-side encryption to prevent exposure of users’ sensitive data on untrusted cloud servers. Also, unlike existing convergent encryption schemes, it provides authorized convergent encryption by using CP-ABE to allow only authorized users to access critical data. Te proposed authorized deduplication scheme provides an adequate trade-of between storage space efciency and security in cloud environment and is very suitable for the hybrid cloud model considering both the data security and the storage efciency in a business environment.

1. Introduction job or department. Tat is, diferent department employees have diferent access rights to information according to the Cloud computing greatly facilitates data providers who want various access control systems used [2]. Each employee can to outsource their data to the cloud without disclosing their only access fles that correspond to its privileges. If the sensitive data to external parties and would like users with deduplication technique does not check fle privileges, it certain credentials to be able to access the data [1]. Tis would violate fle access rights, which would bring some requires that the data be stored in an encrypted form that security problems. Since deduplication inevitably implies supports access control policies so that no one but a user data sharing, control over access permissions in encrypted with a particular type of attributes (privilege information) can deduplication storage is more important than a traditional decrypt the encrypted data. On the other hand, the amount encrypted storage. Terefore, in terms of fexibility, data of data stored in the storage of cloud storage providers (CSPs) deduplication should be combined with data access control is growing very rapidly, especially at the age of big data. techniques. Tat is, even if encrypted, the same data must Terefore, one of the important issues of the CSP is how be stored only once in the cloud, and it must be able to to efciently manage the ever-increasing data. One of the control access by other users based on the policy of the important techniques to solve this problem is deduplication. data owner. Existing secure deduplication techniques do not Deduplication is a special data compression technique to support access policy based encryption. In the current cloud remove redundant copies of repeated data and can be used computing environments, access policy-based encryption to efectively reduce data storage space and communication (typically ABE (Attribute Based Encryption) scheme) and overhead. secure deduplication are widely adopted separately. However, However, in the corporate environment, employees have the standard ABE technique does not support secure dedupli- diferent permissions to information depending on their cation, a technique that helps save storage space and network 2 Wireless Communications and Mobile Computing bandwidth by removing redundant copies of encrypted data information is applied when calculating an authentication in the cloud. Terefore, a design of a cloud storage system tag of the fle for deduplication. Te authentication tag is supporting both of these properties is required. generated by the private cloud functioning as the authorized In this paper, we propose an authorized deduplication server, and the private cloud possesses the privilege private scheme using CP-ABE (Ciphertext-Policy Attribute-Based key corresponding to the user’s privileges, which are not dis- Encryption) to solve this problem. Te proposed scheme tributed to the user. Terefore, users with the same privilege provides client-side deduplication while providing confden- information can generate the authentication tag identifying tiality through client-side encryption to prevent exposure deduplication, so deduplication is possible. Shin et al. applied of users’ sensitive data on untrusted cloud servers. Also, an access privilege to compute a convergent key [9]. Privilege unlike existing convergent encryption schemes, it provides information is applied in RSA blind signature based oblivious authorized convergent encryption by using CP-ABE to allow PRF protocol in order to allow only the authorized user to only authorized users to access critical data. Te proposed access to data. scheme satisfes security requirements and has advantages over existing schemes. Te proposed authorized deduplica- 2.2. CP-ABE. Sahai and Waters introduced Attribute Based tion scheme provides an adequate trade-of between storage Encryption (ABE) which is the public key cryptography space efciency and security in cloud environment and is very of one-to-many algorithm and encrypts the data based on suitable for the hybrid cloud model considering both the data the set of attributes [10]. Tere are two kinds of ABE security and the storage efciency in a business environment. schemes: Ciphertext-Policy ABE (CP-ABE) and Key-Policy Tis paper is organized as follows. Section 2 describes ABE (KP-ABE) [11]. In KP-ABE schemes, the ciphertext is related works. In Section 3, we propose and analyse an associated with a set of attributes while the user’s private authorized client-side deduplication using CP-ABE. And key is generated based on his corresponding access policy, Section 4 describes the optimization and discussions. Finally, while in CP-ABE schemes, a user’s private key is associated Section 5 is the conclusions. with a set of attributes, and ciphertext is encrypted under a specifed access structure [12]. A user is able to decrypt a 2. Related Works ciphertext only if the attributes associated with the cipher- text/private key satisfy the access policy related to his private 2.1. Secure Deduplication. Deduplication can be classifed key/ciphertext. Most ABE schemes [11, 12] are constructed into client-side deduplication and server-side deduplica- based on bilinear pairing. We also use CP-ABE scheme based tion, and client-side deduplication is advantageous from on bilinear pairing in [12]. the viewpoint of the efcient use of bandwidth. Terefore, In ABE scheme, attributes can be defned with diferent many studies on client-side deduplication have been made. types of privilege properties. An access structure ℘ will Despite the many advantages of deduplication technique, contain some authorized sets of attributes. So an access policy deduplication technique for the important data has caused can be defned as an access structure. An access structure can several new security problems. In particular, securing the be represented by an access tree [11, 12]. In this paper, we use confdentiality of outsourced data is a very important issue, the defnitions of an access structure and an access policy in and it may be considered to perform an encryption operation [11, 12], and the identities of parties may be replaced by the before outsourcing. When a conventional encryption is used, attributes. each user has a diferent key, so that diferent ciphertexts are computed for the same plaintext. Terefore, deduplication 3. Authorized Client-Side Deduplication cannot be achieved in this case. To solve this problem, Using CP-ABE Douceur et al. proposed a convergent encryption using a hash value of a plaintext as an encryption key [3]. Here, the In this section, we propose an authorized client-side dedupli- encryption scheme �() is a deterministic algorithm, and a cation scheme based on the CP-ABE. Te proposed scheme convergent key � depends only on the input data fle �.In provides the confdentiality through the client-side encryp- most cases, however, a convergent encryption is vulnerable tion to prevent the exposure of users’ sensitive data on to ofine brute-force attacks because the plaintext space of a untrusted cloud servers. It also enables deduplication of the given ciphertext � is not large enough (messages are ofen encrypted data through the convergent encryption, thereby predictable) [4]. An attacker can perform the encryption saving the storage space in the cloud server. Unlike existing operation for all possible plaintexts at the ofine stage to convergent encryption schemes, it provides an authorized fnd the corresponding plaintext information. To solve this convergent encryption based on CP-ABE to allow only problem, Bellare et al. proposed a technique called DupLESS authorized users to access critical data. that generates a convergent key through the interaction with a key server [5]. Duan et al. proposed a technique for 3.1. Te System and Treat Model. Te proposed scheme generating a convergent key with the help of other users consists of User (U), Authorization Server (AS), and Cloud without a key server [6]. Miao et al. also proposed a method Service Provider (CSP), as shown in Figure 1. Here, the AS can by modifying a single key server in DupLESS as a group of be regarded as a private cloud and the CSP can be considered key servers [7]. as a public cloud. Te AS is an entity that helps a user to In [8], Li et al. frst proposed a deduplication scheme that securely use a CSP. Also, the AS generates and manages to applies the privilege information. In this scheme, the privilege the user’s privilege secret key corresponding to the access Wireless Communications and Mobile Computing 3 privilege and computes an authorized convergent key for a Authorized deduplication Cloud Service fle by applying the privilege through the interaction with a Provider (CSP) user. Te CSP provides data storage services to users, and User (U) deduplication technology is applied to save storage space and cost. Te proposed scheme consists of the system setup pro- Privilege key generation & Authorized convergent cess, the authorized convergent key generation process, and key generation the authorized deduplication process. In the system setup Authorization process, the system parameters are generated and the key Server (AS) pair of each entity is securely generated. Te authorized convergent key generation process performs an interaction Figure 1: System model. for the CP-ABE-based convergent key distribution between the user and the authorization server. Finally, the authorized deduplication process supports the client-side deduplication 3.2.1. Te System Setup Phase. Users and the �� set their own of the encrypted data using the generated convergent key, and keys by the security parameter �.First,the�� selects the the proofs of ownership (PoW) [13] protocol for verifying following as the system-wide public parameters: the fle retention of the user are performed in this process (a concrete PoW protocol is not discussed in this paper, but a (i) A cyclic group � of a prime order � generated by �

PoW protocol such as Merkle tree-based scheme [13] can be (ii) A bilinear paring �:�×��→�� (a group �� of used). order �) We assume that the AS assumes an honest-but-curious �:{0,1}∗ �→ � trust model and the AS and the CSP do not collude. It is (iii) A hash function assumed that the user U can act maliciously. We consider (iv) A universe of privileges ℘={�1,�2,⋅⋅⋅ ,��} that the CSP may act maliciously due to insider/outsider attacks, sofware/hardware malfunctions, intentional saving Te �� selects values �, � ∈ Z� randomly. Ten a master � of computational resources, etc. [14]. According to this secret key (MSK) of the �� is (� ,�) and a public key (PK) � � assumption, we consider the following types of adversaries: is (�, � ,�(�,�) ). In addition, the �� generates a private key (i) Outside adversary: it tries to obtain sensitive informa- and a public key for the signing as follows: tion from the cloud storage or tries to access the fle (i) Private key for the signing, �←�Z� beyond its access privileges. � � (ii) Insider adversary: it can access the cloud storage (ii) Public key for the signing, �=� easily and try to get information out of the user’s encrypted data or fle tags. It is assumed that the public parameters and the public key information are securely distributed throughout the system. We consider the following security requirements that must be Auser� obtains the privilege secret key ��� correspond- satisfed according to the threat model: ing to his access privilege �� ={��}(�� ⊆℘)from the (i) Te confdentiality of data stored in cloud: except for �� over the secure channel. To do this, the user � sends the information about duplication, no information (���,��) to the �� to request the key generation. Afer about the outsourced data is revealed to an adversarial �∈ Z � ∈ Z �∈� choosing � � and � � � for each attribute �,the party. �� constructs the user’s privilege secret key ��� =(�= (�+�)/� � � � (ii) Te unforgeability of the fle tag: a user without appro- � ,{� =� ∙�(�) � ,� � } ) � � ∀�∈�� to securely transmit it priate privileges should be prevented from generating to the user � (see [12] for details on the privilege secret key fle tags. generation for CP-ABE scheme). Te user � securely stores (iii) Secure deduplication: secure deduplication is sup- the privilege secret key ���. Te user also generates a private ported without revealing any information except for key and a public key pair (���, ���)foradigitalsignature the information about duplication. scheme such as DSA [15].

3.2. Te Proposed Scheme. As shown in Figure 2, the pro- 3.2.2. Te Authorized Convergent Key Generation Phase. posed scheme consists of the system setup phase, the autho- Before requesting the fle upload, the user must perform rized convergent key generation phase, and the fle transfer the authorized convergent key generation phase. Te user � phase. In the system setup phase, the user and the AS set generates a convergent key �� for encrypting and uploading their own keys. Te authorized convergent key generation the fle � to the CSP as follows. Tis process modifes the phase generates a user’s convergent key for a fle by refecting RSA oblivious PRF-based convergent key generation method the user’s privilege information. Te authorized fle transfer of [5] and generates a convergent key refecting the user’s phase contains the fle upload and the fle download, and the privilege information by applying the CP-ABE scheme (see fleuploadiscomposedofthefrstuploadmoduleandthe [12] for details on the encryption and decryption process for deduplication module depending on whether or not the fles CP-ABE scheme). Figure 3 shows the authorized convergent are duplicated. Te detailed protocols are as follows. key generation process. 4 Wireless Communications and Mobile Computing

Private Cloud User AS CSP (U)

A. System setup phase Public Cloud

B. Authorized convergent key generation phase

C. File transfer phase

(1) File upload a) First upload procedure b) Deduplication procedure (2) File download Figure 2: Te proposed scheme.

User (U) AS

  Z AU = pi Private key for Sign x←R q x  (a+t)/b  t  t t  Public key for Sign y=g SKU: D=g , Dj =g ∙ H pi ,g ∀j∈A MSK: (ga,b) PK: (g, gb,e(g,g)a) ℎ = Hasℎ(F) Z IDU,m, access structure T r∈R q x  =mx (=(ℎ․ gr) =ℎx ․ yr) m=ℎ·gr  access structure T(⊆AU) for the file F CP-ABE Enc:  ,T Z s∈R q ․ C= · e(g, g)as s b   b s ACKM = T, C, g , Ci i∈T (g ) s s Ci =(g ,H(pi) ) for i∈T CP-ABE Dec with SKU : (a+t)/b bs․ as․ e(g ,g ) TK = e(g, g) = ts․ C e(g, g)  = TK =․ y−r =ℎx Verify the signature : e ℎ, y = e(, g) Convergent Key CK = E>@ ()

Figure 3: Te authorized convergent key generation process.

(i) A user � with access privileges �� ={��} constructs (ii) Te �� responds to the user � by constructing a request message for a convergent key generation for ���� (Authorized Convergent Key Material) as fle � as follows and sends it to the ��: follows: � � � � � � (a) compute ℎ=���ℎ(�),where���ℎ() is a (a) compute � =� (= (ℎ ⋅ � ) =ℎ ⋅�) � cryptographically secure hash function, ���ℎ : (b) perform CP-ABE ���(� ,�)with PK: ∗ � {0, 1} �→ Z�. � (1) select �∈�Z� and compute (� ) (where � is (b) select �∈�Z� � set to the value corresponding to the root (c) compute �=ℎ⋅� nodeofthetreeaccessstructure�) � �∙� (d) construct an access structure �(⊆��) for the (2) compute �=� ∙�(�,�) �� �� fle � (an access structure � is usually expressed (3) output �� =(� ,�(��) ) for the leaf node in a tree form; see [12] for details) and send �∈�(where �� is a share of � corresponding the request message for the convergent key to the leaf node � ofthetreeaccessstructure generation to the ��. �, see [12] for details) Wireless Communications and Mobile Computing 5

� � (c) send ���� = {�, �, (� ) ,{��}�∈�} to the user (i) Te ��� requests the fle transfer to the �. user �1. (ii) Te user �1 computes the ciphertext �� = (iii) Te user � uses ���� to generate a convergent key ���(��, �) by encrypting the fle � using �� and the fle tag ����. the convergent key ��,where���() is an authenticated encryption mechanism (a) perform ��-��� ���(���,����): such as AES in CTR mode [16]. Also, �1 V�� = ���ℎ(��, ��) �� = �(�, �)�∙� = computes (1) compute ��� = ���� (��, ����, ��, V�� ) �(�(�+�)/�,��∙�)/�(�, �)�∙� and ��� , where �������() is a signing algorithm using (�+�)/�∙(�∙�) (�+�)∙� �∙�+�∙� the user’s private key ���.Ten�1 sends � (�, �) � (�, �) � (�, �) � {��, ����, � , V��, ���} to the ���.Te (= �∙� = �∙� = �∙� � (�, �) � (�, �) � (�, �) user �1 stores the fle tag ���� and deletes (1) �, ��, ����,andsoon. � (�, �)�∙� ∙�(�,�)�∙� (iii) Te ��� verifes the signature ��� using = ) �∙� the user’s ���. If valid, store the fle record � (�, �) � {����,��,����,� , V��, ���1}.

�∙� � �� �� where �(�, �) =∏�∈�{�(� ∙�(��) ,� )/ (b) Deduplication module (see Figure 6) �� �� �∙�� �(� ,�(��) )} = ∏�∈�{�(�, �) } (for (i) If the fle � already exists, the ��� has � details of the computation, see [12]) the fle record {����,��,����,� , V��, � (2) compute � =�/�� ���0} from the old user �0. � −� � � � (3) compute �=� ∙� =ℎ (ii) Note that �1 has {����,� ,��,��} for the fle � through the authorized con- (b) verify that the signature of the �� is correct: vergent key generation phase such that �(ℎ, �) = �(�, �) � � {� ,��}={�,��} ̸ . (c) if the signature verifcation passes, the con- (iii) For the deduplication procedure, the ��� �� = (�) vergent key kdf and the fle tag frst performs PoW protocol with �1 for ��� =���ℎ(��) � are computed (where the the ciphertext �� = ���(��, �).Within () function kdf is a cryptographically secure key this protocol, the ��� sends V��, ����, � derivation function). and � to �1. If the user passes PoW protocol, the ��� assigns the fle reference 3.2.3. Te File Transfer Phase. Suppose that a user �1 requests to the user �1 and adds ���1 to the fle the ��� to upload a fle �.Te��� performs duplication record. check on the fle � and approves the upload of the encrypted (iv) Within the PoW protocol, the user �1 fle if there is no duplication. If there is duplication (that recovers �� using ���� and verifes his is, if the same fle is already stored), then perform the own access privileges as deduplication process. At this time, the PoW process is � 1 performed to confrm whether the user holds the fle actually. �� = ( ∙ ) kdf �� �� If the verifcation is passed, the owner of the fle stored in (2) the storage is assigned to the user �1 without uploading the and V�� == ���ℎ (��, ��) . actual fle. If the verifcation passes, �1 keeps the fle (1) File Upload (see Figure 4) tag ���� and deletes all other data from its own storage. (i) Note that the user �1 already has {��� ,�,��,��} � � for the fle through (2) File Download (see Figure 7) the authorized convergent key generation phase. (i) Te user �1 has attributes ��1 ={��} and its (ii) Te user �1 transmits the upload request mes- corresponding privilege secret key ���1 =(�= (�+�)/� � � � {�� ,��� ,�} � ��� � ,{� =� ∙�(�) � ,� � } ) sage �1 � for the fle to the � � ∀�∈��1 . using the fle tag ���� and the access structure (ii) �1 sends the File download request message � . with {���1,����} to the ���. ��� ��� (iii) Te uses the tag � and the access struc- (iii) Te ��� fnds ���� in the fle records and ture � to check whether a duplicate fle is stored load it. Te ��� checks whether ���1 is in the in the storage. If there is no duplication, perform user lists of the record. If exists, the ��� sends � the “First Upload” module. If the duplicate fle {��, ����, � , V��} to the user �1. exists, perform the “Deduplication” module. (iv) �1 performs ��-��� ��� on ���� with (a) First Upload module (see Figure 5) ���1 as follows and obtains �� = kdf(�): 6 Wireless Communications and Mobile Computing

User (U1) CSP

Convergent Key CK = E>@ () File Tag TagF =Hasℎ(CK)

IDU1, TagF, access structure T

Duplicate-check using TagF &T If there is no duplication, perform the [First Upload module] If there exists a duplication, perform the [Deduplication module]

Figure 4: File upload process.

User (U1) CSP

[First Upload module] File request

CT = Enc(CK, F) TK = Hasℎ(TK, CK) r sig = Signssk(CT, ACKM, y ,TK)  CT, ACKM, yr,TK,sig Verify sig using the user’s spk Save Tag F If valid, Delete the file F, CK, ACKM, ... store the fle record  r  TagF,CT,ACKM, y ,TK,IDU1 Figure 5: First upload module.

User (U1) CSP

 r  File record TagF,CT,ACKM,y ,TK,IDUo [Deduplication module] CT = Enc(CK,F) perform PoW protocol PoW protocol (include TK, ACKM, yr) CP-ABE Decrypt ACKM with SKU1: If pass the PoW, (a+t)/b bs․ assign the file reference to the user IDU1 as․ e(g ,g ) TK = e(g, g) = ․ and add IDU1 to the file record e(g, g)ts C 1 CK = E>@  ∙ TK yr Verify TK TK == Hasℎ(TK, CK) Save TagF Delete the file F, CK, ACKM, ... Figure 6: Deduplication module.

�∙� (�+�)/� (a) compute �� = �(�, �) =�(� , 4. Optimization and Discussions ��∙�)/�(�, �)�∙� � (b) compute � =�/�� 4.1. Optimization. One problem to be considered is that, in � −� � (c) compute �=� ∙� =ℎ the process of uploading duplicate fles, the user with the duplicatedflemustbeabletoensurethattheconvergent (v) �1 verifes V��. If pass, �1 obtains the fle �= key can be correctly restored through the ���� which is ���(��, ��). Finally, verify the fle as follows: derivedbythefrstuploaderandisstoredinthe���.One �(���ℎ(�), �) = �(�, �). possible solution is to verify the validity of the ���� in the Wireless Communications and Mobile Computing 7

  User’s attributes: AU = pi User (U1) a+b·t t  t  CSP SKU: g ,g, Hp( i)   File download request IDU1,TagF

Find TagF in the file records and load it Check whether ID is in the user lists of the record  r  U1 CT, ACKM, y ,TK If exists, send CT, ACKM, yr,TK to the user

CP-ABE Decrypt ACKM with SKU1: (a+t)/b bs․ as․ e(g ,g ) TK = e(g, g) = ․ e(g, g)ts C  = TK  = =ℎx yr CK = E>@ () Verify TK: TK == Hasℎ ,CK F=Dec(CK, CT)

Verify the file : e Hasℎ (F),y == e , g

Figure 7: File download process.

PoW protocol because the PoW protocol must be performed relation between the existing access structure � and the newly � in the deduplication process. We do not describe the specifc requested access structure � . PoW protocol in this paper, but as a simple solution to � this purpose, we introduced the V�� value into the PoW (i) In case of � ⊆�: In this case, since the access �� protocol. Te verifcation of the validity of the ���� can structure of the fle to which the duplicate upload � be performed using the value of V�� in the PoW process. is requested is included in the access structure of To do this, the V�� value must be stored together as an the already stored fle, it is not necessary to upload it element of the ���� in the frst upload process. When the because it can be restored by the information of the ��� delivers the V�� value to the user in the deduplication fle record already stored in the CSP.Tatis,thePoW � process, the user can verify that the value of � =�/�� protocol is executed within the deduplication module �∙� (=�/�(�,�) ) derived from his own privilege information is to confrm the retention of the fle, and then the access valid. Ten, it can be confrmed that the convergent key can rights to the fle record stored in cloud are assigned be derived through the ���� stored in the ���.Inorder without uploading the fle. � to support this, one hash computation overhead is added (ii) In case of �⊆�: In this case, since the access in the frst upload process. In the ���, there is additional structure � of the already stored fle is included in � storage overhead of V�� value. Finally, in the deduplication the access structure � of the fle requested to be process, one hash computation overhead for the verifcation uploaded, it is necessary to update the existing fle of V�� value and the cost for evaluating the convergent key record in order to save only one fle in the CSP. To are added. do this, the user uploads related data such as the � � Also, another consideration is the possibility of opti- ciphertext �� computed with the access structure � . mizing metadata for the duplicated fle. If the fle � with Te CSP replaces the information of the existing fle access structure � is already stored in the ��� with ����, � record with the uploaded information. Existing users a user with a diferent access structure � may request the � can restore the fle through the updated fle record. same fle � to be uploaded with ��� (== ��� ).Tatis,if � � (iii) If not for the above two cases: Tis is the case diferent access attributes are specifed for the same fle, there � where the access structure � and � do not satisfy should be some consideration as to whether to deduplicate the inclusion relation. In this case, the user uploads it. Te fle upload process considering this can be performed � {�����,�� } as follows: for the fle upload request of the user, the CSP , and the CSP adds this information � checks whether the corresponding fle tag ��� exists and to the existing fle record corresponding to the fle � ��� performs “First upload module” if it does not exist (that tag �. In this case, when the user requests to is, if there is no duplication). If the corresponding fle tag download the fle, the CSP sends the appropriate � � �� exists (in case of duplicate upload), the access structure � {���� ,� } corresponding to the requesting user. is compared. It works diferently depending on the inclusion Tis allows the user to restore the fle. 8 Wireless Communications and Mobile Computing

4.2. Security Analysis. Our scheme provides an authorized proposed method can ensure security of a plaintext data even deduplication of encrypted data stored in cloud. Te security with a low entropy because �� has encrypted with a key �� of the proposed scheme depends on the security of the used derived by interacting with the AS through the authorized cryptographic primitives such as hash functions, symmetric convergent key generation process. Te convergent key �� and asymmetric encryption algorithms. In the security of the is derived by a value to which the private key of the AS proposed scheme, the AS plays an important role, and so it is is applied, which is encrypted and distributed by CP-ABE assumed that the AS does not collude with an attacker. algorithm with the user’s privilege attributes. As long as an attacker does not know the AS’s private key or cannot (1) Te Proposed Scheme Provides break CP-ABE algorithm, he cannot obtain �� and therefore an Authorized Deduplication the confdentiality of data is preserved if the encryption algorithm ���() is secure. Proof. Te proposed scheme deduplicates the fles with the A legitimate user behaving as a malicious attacker can same contents and the proper privileges. Te same fles always � threaten the AS and the CSP.WhenattackingtheAS,an derivethesamevalue� (=ℎ,whereℎ=�(�)). However, attacker has to address the discrete logarithm problem to � is distributed to the user in encrypted form using CP- obtain the private information of the AS,andsoitcanensure ABE with the access structure � by the AS.Tefletag��� � the security of the AS.Whenattackinganencrypteddata for verifying duplication is derived from the convergent key stored in the CSP, an attacker has to frst pass a proofs of �� which is computed from �.So,whentheusershave ownership protocol for the fle. To do this in the proposed the proper privilege the same convergent key �� is derived scheme, an attacker has to construct a proof for the encrypted and therefore deduplication is possible. An unauthorized user fle in the PoW process. Tus it depends on the security of cannot obtain the same convergent key �� because of the the PoW protocol used (for example the Merkle Tree-based property of CP-ABE. To get ��, an attacker needs to get �. proofs of ownership scheme presented in [13] can be used). To do this, it has to calculate CP-ABE decryption or know Tatis,foranattackerwhodoesnotownthefle,itprovides the AS’s private key �.Ifanattackerdoesnothavetheproper the confdentiality of data stored in cloud. access privileges or know the AS’s private key �, he cannot compute CP-ABE ���() and obtain �.Tus,ourscheme 4.3. Discussions. Li et al. [8] frst proposed the deduplication provides an authorized deduplication under the assumption scheme applying the privilege information. In this scheme, of the security of CP-ABE scheme and the private key. the fle authentication tags for the deduplication are com- When a new user �1 with privileges equal or higher puted by applying the privilege information and these tags are than an old user �0 uploads the same fle that is already generated by the AS.TeAS possesses the privilege private stored in cloud, the CSP performs the deduplication process. key corresponding to the user’s privileges and this privilege Te security of the proposed scheme against ofine brute- private key does not be distributed to the user. force attack is guaranteed by the convergent key generation While Li et al. scheme supports only simple access procedure as in [5]. In order for an attacker with lower attributes, the proposed scheme can support hierarchical permissions than an old user �0 to perform online brute- access attributes, and so direct comparisons are difcult. Li force attacks, it is necessary to forge the tag on the fle. Te et al. scheme [8] has the following shortcomings. First, each unforgeability of the fle tag is guaranteed by (2) below. Tus, access privilege is represented by a private key. If a user theproposedschemeprovidessecurededuplication. has multiple access privileges, the AS needs to keep private keys securely as many as the number of access attributes, (2) Te Proposed Scheme Guarantees the Unforgeability of which may cause a great deal of trouble in the user key the File Tag management in the AS.Also,whenauseruploadsafle� that has assigned � access privileges, the AS needs to generate Proof. In the proposed scheme, the user generates the fle tag � fle tags for � and sends them to the user. Ten, the user ��� =���ℎ( (�)) � kdf through the blind signature scheme must send these tags to the CSP for the fle uploading, so it �(=ℎ∙��) with the AS.Teusersends to the AS,andtheAS causes large network trafc. In addition, Li et al. scheme has �� =�� � blindly signs it, with its own private key .Andthen a disadvantage in terms of the number of keys to be stored in ���() the AS encrypts by CP-ABE with the access structure the cloud server when compared with the proposed scheme. � � . An unauthorized user has to obtain a valid in order to For a duplicated fle, Li et al. scheme has to store keys as many forge the fle tag. Tis requires obtaining the AS’s private key as the number of users. In the case of Li et al. scheme, each or breaking the CP-ABE. Terefore, if the private key of the user � stores the encrypted convergent key �� in the cloud by AS is kept secure and the CP-ABE algorithm is secure, the encrypting the convergent key �� (i.e., �� = ���(���,��)) unforgeability of the fle tag is guaranteed. using his secret key ���, so that the number of keys stored in the cloud is equal to the number of users. However, (3)TeProposedSchemeGuaranteesTatOnlyEligible the proposed scheme only needs to store one encrypted ���� Users Can Access the Plain Data. Tat Is, It Provides the key material, . Te authorized user can restore the ���� Confdentiality of Data Stored in Cloud convergent key via . Meanwhile, Li et al. scheme has another disadvantage that the AS is always involved in the fle Proof. An attacker who colludes with the CSP can access upload process. Tis adds a lot of overhead to the AS.Inthe an encrypted data �� and metadata ����.However,the proposed scheme, the AS participates only in the convergent Wireless Communications and Mobile Computing 9

Table 1: Comparison of the storage overhead on the user and the CSP.

Li et al.’ scheme [8] Proposed scheme

User ��∗ (2∗���∗h len) 1∗ (ep len∗ (��� +2)) + ��∗ (1∗h len)

CSP ���� ∗{���∗h len + 1∗ct len + ��∗ (���∗senc len)}���� ∗{1∗h len + 1∗ct len + 1∗ (ep len∗ (��� +2)) + 1∗ ep len +1∗h len }

Notes.��: the number of fles owned and uploaded by one user; ����: the number of fles that are uploaded by all users and managed on CSP; ��:thenumber of deduplicated fle uploads; ���: the number of access privileges assigned to the fle by the user; ct len: the bit-length of the encrypted fle; ep len: the bit- length of the elliptic curve pairings; h len: the bit-length of the output of the hash function; senc len: the bit-length of the output of the symmetric encryption.

Table 2: Te bit-length overhead on the user and the CSP.

Li et al.’ scheme [8] Proposed scheme User 512,000 bits 28,672 bits CSP 115,360,000 bits (110.02 Mbits) 103,840,000 bits (99.03 Mbits) key generation process and thereafer does not involve in the 100,000 92,820 protocol. Tis reduces the processing burden of the AS. 90,000 Of course, the proposed scheme that supports more 80,000 complex and hierarchical privileges has higher computational 70,000 60,000 complexity than existing schemes. Te application of CP- 50,000 ABE scheme to the convergent key generation process has the 40,000 28,820 advantages of higher security and the utilization of hierarchi- 30,000 cal privileges, but the computational overhead has increased. 20,000 14,420 12,980

Tis computational overhead does not pose a problem for KB) (unit: overhead storage 10,000 12,980 12,980 the practicality of the proposed scheme. According to [17], − it is reported that the encryption and decryption take about 10 100 500 3.3 ms and 6.2 ms, respectively, in the case of the lightweight Nd CP-ABE scheme (the number of user attributes is 600). Tus, Li et al.’ scheme [8] in case of applying the lightweight CP-ABE scheme such as Proposed scheme [18, 19], this amount of computing costs will not be a big � problem for practical use. Figure 8: Storage overhead on the CSP as the � value changes. We compare the storage complexity of the proposed scheme with the existing scheme. Table 1 shows the storage overhead on the user and the CSP. Te proposed CP-ABE-based authorized convergent In order to make the comparison result more clear, sup- encryption scheme can apply much more complex and pose we have chosen the following bit-length as key lengths various attribute privileges than existing schemes. Because it of the cryptographic primitives to ensure 128-bit security (see can utilize a hierarchical access structure that can be used in [20]): 128-bit for the symmetric encryption algorithm, 256- the CP-ABE techniques, it is very suitable for data sharing bit for the hash function and 256-bit for the elliptic curve services in cloud storage. Since the convergent encryption pairings. And let ��=100, ����=1000, ��=10, ���=10, and key generated by the proposed scheme can be derived only ct len=100,000. Table 2 shows the specifc storage overhead by the authorized user, the leakage of the fle information by of both schemes. the unauthorized user in the deduplication process can be Ontheuserside,itcanbeseenthatthestoragecomplexity blocked. of the proposed scheme is reduced by 483,328 bits (Tat is, it can save about 94%). On the CSP, the proposed scheme is 5. Conclusion about 11,520,000 bits small, so it can save about 10%. Tese results show that the proposed scheme has advantages in In this paper, we proposed the authorized deduplication terms of storage space complexity. scheme using CP-ABE. Te proposed scheme provides client- Te advantages of the proposed scheme are further side deduplication while providing confdentiality through clarifed by the results of the comparison below. Figure 8 client-side encryption to prevent exposure of users’ sensitive shows the storage overhead on the CSP when only the �� (the data on untrusted cloud servers. Also, unlike existing con- number of deduplicated fle uploads) value is increased in the vergent encryption schemes, it provides authorized conver- above comparison. Figure 9 also shows the storage overhead gent encryption by using CP-ABE to allow only authorized when the ��� (the number of access privileges assigned to the users to access critical data. Te proposed CP-ABE-based fle by the user) value increases. If the number of duplicate authorized convergent encryption scheme can apply much fles or the number of access privileges increases, we can see more complex and various types of attribute privileges than that the proposed scheme is more advantageous than the existing schemes. Also, it satisfes security requirements. And scheme of [8]. the proposed scheme has advantages over existing scheme in 10 Wireless Communications and Mobile Computing

200,000 192,000 19,000 180,000 18,000 18,260 160,000 17,000 140,000 16,340 128,000 16,000 120,000 15,000 100,000 14,000 14,420 80,000 13,620 64,000 13,000 13,300 60,000 12,980 40,000 12,000 20,000 3,584 3,904 4,224 11,000 storage overhead (unit: Byte) (unit: overhead storage − KB) (unit: overhead storage 10,000 10 20 30 10 20 30

Nap Nap Li et al.’ scheme [8] Li et al.’ scheme [8] Proposed scheme Proposed scheme (a) On the user (b) On the CSP

Figure 9: Storage overhead as the ��� value changes.

terms of the AS’s burden and storage overhead. Te proposed [6] Y. Duan, “Distributed key generation for encrypted dedupli- scheme is very suitable for data sharing services in the cation: Achieving the strongest privacy,” in Proceedings of the enterprise’s hybrid cloud storage model. 6th edition of the ACM Workshop on Cloud Computing Security (CCSW’14),pp.57–68,ACM,Scottsdale,Arizona,USA,2014. [7]M.Miao,J.Wang,H.Li,andX.Chen,“Securemulti-server- Data Availability aided data deduplication in cloud computing,” Pervasive and Mobile Computing,vol.24,pp.129–137,2015. No data were used to support this study. [8]J.Li,Y.K.Li,X.Chen,P.P.C.Lee,andW.Lou,“Ahybrid cloud approach for secure authorized deduplication,” IEEE Conflicts of Interest Transactions on Parallel and Distributed Systems,vol.26,no.5, pp. 1206–1216, 2015. Te authors declare no conficts of interest. [9] T.-Y.Youn,K.-Y.Chang,K.H.Rhee,andS.U.Shin,“Authorized client-side deduplication using access policy-based convergent encryption,” Journal of Internet Technology,vol.19,no.4,pp. Acknowledgments 1229–1240, 2018. Tis work was supported by Electronics and Telecommunica- [10] A. Sahai and B. Waters, “Fuzzy identity-based encryption,” in tions Research Institute (ETRI) grant funded by the Korean Proceedings of the 24th Annual International Conference on the government [18ZH1200, Core Technology Research on Trust Teory and Applications of Cryptographic Techniques (Advances in Cryptology – EUROCRYPT 2005,vol.3494,pp.457–473, Data Connectome]. Springer-Verlag, LNCS, Aarhus, Denmark, 2005. [11]V.Goyal,O.Pandey,A.Sahai,andB.Waters,“Attribute- References based encryption for fne-grained access control of encrypted data,” in Proceedings of the 13th ACM Conference on Computer [1]H.Cui,R.H.Deng,Y.Li,andG.Wu,“Attribute-basedstorage and Communications Security (CCS ’06),pp.89–98,ACM, supporting secure deduplication of encrypted data in cloud,” Alexandria, Virginia, USA, November 2006. IEEE Transactions on Big Data, 2017, Early Access. [12] J. Bethencourt, A. Sahai, and B. Waters, “Ciphertext-policy [2] P. Mell, J. Shook, R. Harang, and S. Gavrila, “Linear time attribute-based encryption,” in Proceedings of the IEEE Sym- algorithms to restrict insider access using multi-policy access posium on Security and Privacy (SP ’07), pp. 321–334, IEEE, control systems,” Journal of Wireless Mobile Networks, Ubiqui- Berkeley, CA, USA, 2007. tous Computing, and Dependable Applications (JoWUA),vol.8, [13] S. Halevi, D. Harnik, B. Pinkas, and A. Shulman-Peleg, “Proofs no.1,pp.4–25,2017. of ownership in remote storage systems,” in Proceedings of [3]J.R.Douceur,A.Adya,W.J.Bolosky,D.Simon,andM. the 18th ACM Conference on Computer and Communications Teimer, “Reclaiming space from duplicate fles in a serverless Security, CCS’11, pp. 491–500, ACM Press, 2011. distributed fle system,” in Proceedings of the 22nd International [14] T.Youn,K.Chang,K.Rhee,andS.U.Shin,“Efcientclient-side Conference on Distributed Systems (ICDCS 2002), pp. 617–624, deduplication of encrypted data with public auditing in cloud IEEE, Vienna, Austria, 2002. storage,” IEEE Access,vol.6,pp.26578–26587,2018. [4] D. Harnik, B. Pinkas, and A. Shulman-Peleg, “Side channels in [15] C. F. Kerry, A. Secretary, and C. R. Director, “FIPS PUB 186- cloud services: Deduplication in cloud storage,” IEEE Security 4 federal information processing standards publication digital &Privacy, vol. 8, no. 6, pp. 40–47, 2010. signature standard (DSS),” National Institute of Standards and [5] S. Keelveedhi, M. Bellare, and T. Ristenpart, “Dupless: Server- Technology (NIST),2013. aided encryption for deduplicated storage,”in Proceedings of the [16] M. Dworkin, Recommendation for Block Cipher Modes of Oper- 22nd USENIX Security Symposium (USENIX Security 13),vol.12, ation Methods and Techniques, NIST, USA, 2001, NIST-SP-800- pp. 179–194, USENIX, Washington, DC, USA, 2013. 38A. Wireless Communications and Mobile Computing 11

[17] V. Odelu, A. K. Das, M. Khurram Khan, K.-K. R. Choo, and M. Jo, “Expressive CP-ABE scheme for mobile devices in IoT satisfying constant-size keys and ciphertexts,” IEEE Access,vol. 5, pp. 3273–3283, 2017. [18] Y. Zhang, D. Zheng, and X. Chen, “Computationally efcient ciphertext-policy attribute-based encryption with constant- size ciphertexts,” in Provable Security,vol.8782,pp.259–273, Springer, 2014. [19] H. Tsuchida, T. Nishide, and E. Okamoto, “Expressive cipher- text-policy attribute-based encryption with fast decryption,” Journal of Internet Services and Information Security (JISIS,vol. 8, no. 4, pp. 37–56, 2018. [20] E. Barker and A. Roginsky, “Transitioning the use of cryp- tographic algorithms and key lengths,” Draf NIST Special Publication 800-131A Revision 2, National Institute of Standards and Technology (NIST),2018. Hindawi Wireless Communications and Mobile Computing Volume 2019, Article ID 6974809, 11 pages https://doi.org/10.1155/2019/6974809

Research Article A Multiuser Identification Algorithm Based on Internet of Things

Kaikai Deng , Ling Xing , Mingchuan Zhang , Honghai Wu , and Ping Xie

School of Information Engineering, Henan University of Science and Technology, Luoyang 471023, China

Correspondence should be addressed to Ling Xing; xingling [email protected]

Received 18 January 2019; Revised 2 April 2019; Accepted 11 April 2019; Published 2 May 2019

Guest Editor: Vishal Sharma

Copyright © 2019 Kaikai Deng et al. Tis is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

With the rapid development of the Internet of Tings (IoT) in 4G/5G deployments, the massive amount of network data generated by users has exploded, which has not only brought a revolution to human’s living, but also caused some malicious actors to utilize these data to attack the privacy of ordinary users. Terefore, it is crucial to identify the entity users behind multiple virtual accounts. Due to the low precision of user identifcation in the many-to-many mechanism of user identifcation, a random forest confrmation algorithm based on stable marriage matching (RFCA-SMM) is proposed in this study. It consists of three key steps: we frst employ the stable marriage matching model to calculate the similarity between multiple users and utilize a scoring model to calculate the overall similarity of the users, afer which candidate matching pairs are selected; second, we construct the random forest model that exploits a user similarity vector training set; aferward, the candidate matching pairs combine the secondary confrmation of the random forest model, which both improve the precision of the many-to-many user identifcation and protect private user data in the IoT. Extensive experiments are provided to demonstrate that the proposed algorithm improves precision rate, recall rate, and F-Measure (F1), as well as Area Under Curve (AUC).

1. Introduction identify users of multiple virtual accounts [3, 4], allowing user data to be better protected in the IoT era. As 4G/5G technology continues to evolve, it provides people User identifcation is also referred to as user matching. with more efcient performance and increased speed in Many studies have addressed the user identifcation problem order to meet higher standards of data services for more by examining user profle information attributes, primarily users. At the same time, the higher speed and more reliable the user’s personal information and published content, which transmission of mobile communication further promote the includes username, geographical location, blog posts, etc. development of the Internet of Tings (IoT) in the large- [5–13]. Although missing data is an issue in the process of scaleera.Teamountofuserdataisconstantlyincreasing flling in these attributes, they can still be flled in through in the IoT context. Leveraging user data to analyze the the use of appropriate methods. Moreover, some attributes social behavior of users can enable the provision of a safer play an extremely important role in the process of user social environment. According to statistics, 42% of users identifcation. Terefore, user identifcation based on user use multiple social networks simultaneously [1]. Te IoT attribute information can better accomplish the work of iden- integrates diferent social methods to meet people’s diferent tifcation. Some of the research on this topic focuses on the needs to the greatest extent possible [2]. For instance, RenRen use of network topology for user identifcation. Tis research and Sina Microblog are services used to share personal mainly relies on the user’scircle of friends to identify a specifc statuses and publish blogs anytime and anywhere in China. user [14–19]. Te similarity between user accounts is judged However, as there is no direct link between user data on by analyzing nodes between users. However, due to the het- these services, a complete social network map is difcult erogeneity of the network structure in practical applications, to obtain. Multiuser identifcation is therefore employed to this method requires improvement in terms of its precision. 2 Wireless Communications and Mobile Computing

Tis study is divided into three sections: user profle data, 2.1. User Profle Information-Based User Identifcation. user-generated data, and user-associated data, according to Research based on user profle information for the purpose the type of IoT data involved. User profle data refers to of solving user identifcation problems primarily focuses the data that the user needs to enter or select when they on personal information. Te classifcation model is register for their accounts. User-generated data refers to the constructed, afer which the corresponding matching data that is the user’s published content. User-associated strategy is used to complete user identifcation. Raad et data refers to the data that users associate with other users. al.[20]proposedtheFriendofaFriend(FOAF)attribute Among the numerous user attributes, some attributes are matching strategy. Ye et al. [21] proposed an objective more important to the task of user identifcation, such as weighting method for user attributes to integrate user web links, blog posts, etc. Conversely, some attributes such as attribute information and complete the calculation of gender, age, etc. are less important in this context. Terefore, user profle similarity. Cortis et al. [8] proposed an reasonable allocation of corresponding weights to users can identity recognition algorithm that assigns weights to also improve the precision of user identifcation to a certain individual attributes of user profle information and then calculates the similarity among attributes with reference extent. to both grammatical and semantic aspects. Able et al. [22] A stable marriage matching algorithm is mainly used to aggregated user profle information in order to match fnd and solve a stable matching pair problem and has been users. Zamani et al. [23] took the user's unit, interests, widely used in the felds of economics and computing for and other attribute information into full consideration, this purpose [20]. Stable matching refers to cases in which integrating the similarity of multiuser attributes via the equal thetwoidentifeditemsaremutuallyoptimalchoicesrather evaluation model and complex mixed training model; this than the frst choice in the current match and where there improves the possibility of correctly identifying users owing is no better match to any one element in the unmatched to the personalized characteristics of many users’ attribute elements. Meng Bo proposed that the ranking-based cross- information. Terefore, leveraging user attribute information matching algorithm could also adopt the concept used in to achieve user identifcation is a good choice. the stable marriage matching algorithm. Te algorithm uses profle attribute similarity (PAS) and user surrounding score 2.2. Network Structure-Based User Identifcation. Network (USS) to select candidate matching users and then uses the structure-based studies on user identifcation mainly focus user matching score (UMS) to determine the users that match on recognizing identical users by examining the user’s circle with the candidate users. In order to further improve the of friends. Te user’s friend relationships are easy to obtain, matching precision, a cross-matching process inspired by the the problem of malicious imitation and forgery is less likely stable marriage matching algorithm is added. Finally, the to occur, and the importance of information coupling of matching user pairs are obtained by a simple pruning process. local topology on network development has been certifed. However,theideabehindthestablemarriagematching Narayanan et al. [14] proved for the frst time that user algorithmisthatallusersshouldmatch;thus,itisdifcult identifcation can be accomplished by relying on the topology to guarantee the precision of the fnal result. Te existing structure of the network; however, the precision of the user identifcation algorithm, which is based on supervised matching results required improvement. Bartunov et al. learning, is guaranteed in accuracy, but can only achieve one- [15] proposed the construction of the objective function to-one user matching. by combining attribute information and network structure Accordingly, a random forest confrmation algorithm information and then optimized the function to obtain basedonstablemarriagematching(RFCA-SMM)ispro- the optimal matching pair. Cui et al. [16] integrated user posed in this study. Te proposed algorithm combines stable profle information similarity and graph similarity to achieve marriage matching and a scoring model to obtain the overall mapping from an email network to a Facebook network. similarity of users in addition to candidate matching pairs. Liu et al. [17] proposed the HYDRA approach to model- ing user behavior by employing user attributes and user- Te similarity calculation of user attribute data is used to generated content. Korula et al. [18] abstracted the problem construct the user similarity vector, while the training set of user identifcation into a mathematical defnition, arguing of the user similarity vector is leveraged to generate the that diferent social networks are generated by user graph random forest model. Te fnal matching results of candidate structure through probability and that the selection process matching pairs can be obtained by means of random forest of graph edges is one of approximate probabilities. Tan et confrmation. al. [19] modeled users’ social relations and mapped users to low-dimensional space to improve the efciency of user 2. Related Works identifcation in the network. However, there is heterogeneity between nodes in the actual network structure, and the Multiuser identifcation technology is signifcant in both infuence of this heterogeneity is ignored in the calculation research and practice in many important felds. Current process; therefore, the precision of this method in the context studies of multiuser identifcation can be divided into three of user identifcation requires improvement. categories according to the way feature information is used: user identifcation based on user profle information, network 2.3. User-Generated Information-Based User Identifcation. structure, and user-generated information. User identifcation based on user-generated information Wireless Communications and Mobile Computing 3 mainly relies on the content published by users. Now that user with missing data has a low relational degree with other the Internet of Tings (IoT) has a close relationship with users, the data will not be flled. Te relevant formula is as our daily lives, people can immediately post their own follows: dynamic content and comment on the content posted by the friends around them. As user behavior habits are not ��� =3∑ ��� +2∑ ��� + ∑ ��� (1) easy to change and hide [24], these habits can correspond strongly with the characteristics of the users themselves. (2) Speculated flling: the missing data is inferred from Terefore, the use of data mining algorithms to discover other attributes. Tis method is mainly used for user gender these hidden association rules [25] can greatly improve the flling. Te blog posts published by the user best refect recognition rate of user identifcation. Goga et al. [10] used the characteristics of the user’s personality; thus, by using the geographic locations, timestamps, and writing habits of the user’s blog post information, the Bayesian classifcation users’ published statuses for user identifcation purposes. Li model [30] can be employed to accomplish user gender et al. [13] proposed a supervised machine learning algorithm speculation. to solve the user identity recognition problem based on user- Te Bayesian classifcation model is constructed as fol- generated content. In recent years, the development of mobile lows: communication technology has made a great contribution to �(�� |�)�(�) the incorporation of geo-tagging when users publish their �(�|��)= (2) statuses. As the user’s track of action is not easy to imitate, �(��) the application of geographical location attributes to user identifcation opens up a new method of identifying users. where �(� | ��) denotes the probability that the user is Cao et al. [26] proposed an identifcation method for process- identifedasmalewhentheword�� occurs, �(�� |�) ing multisource data by utilizing the cooccurrence frequency denotes the probability of the word �� occurring in all males, of two user trajectories. Hao et al. [27] proposed that user �(��) denotes the probability of the user being male, and trajectories are transformed into sequences composed of �(�) denotestheprobabilityoftheoccurrenceofword��. multiple grids, which are in turn transformed into vectors Given the complexity of calculating �(�� |�),thisarticle by using a TD-IDF model, afer which the similarity of user calculates the conditional independence na¨ıve hypothesis. trajectories is calculated via cosine similarity. Han et al. [28] Teformulaisasfollows: proposed that each geographic coordinate point should be � represented as a corresponding semantic position. Te user’s �(�� |�)=∏�(�� |�) trajectory can thus be represented by the text composed of �=1 (3) the semantic position, with the LDA model being used to =�(� |�)�(� |�)⋅⋅⋅�(� |�) represent the user’s topic distribution; fnally, the similarity 1 2 � of the user trajectories is calculated to determine whether � the two users are the same. Terefore, the analysis of user �(��)=∏�(��)=�(�1)�(�2)⋅⋅⋅�(��) (4) behavior information for multiuser identifcation is ideal in �=1 this context. Terefore, the Bayesian classifcation model constructed 3. Data Preprocessing is as follows: �(�|�) 3.1. Filling Missing User Data. Data flling is commonly � applied to user profle data processing. When a user registers �(� |�)�(� |�)⋅⋅⋅�(� |�)�(�) (5) an account, data may be missing for various reasons such = 1 2 � �(� )�(� )⋅⋅⋅�(� ) as, e.g., privacy protection. Terefore, an appropriate flling 1 2 � method should be adopted for each diferent type of data from each dimension, as follows: Te statistical results of the training set can be used to (1) Similarity flling: flling in user data by utilizing the derive the probability required for the calculation in the relational degree [29] between other users and users with Bayesian classifcation model. Te prediction results of the missing data. For example, user � and user � are friends, and corresponding attributes can be obtained via this model. they will generate social behaviors such as comments, reposts, and thumb up on social networks, where the comments 3.2. User Data Similarity Calculation. In view of the problem indicate that the content posted by the friend on the social that the user data in the IoT has a diferent format, the user network is explained, reposts indicate that friends have data format needs to be generalized before the similarity similar interests, and thumb up indicates that they agree with betweeneachattributeinthisstudycanbecalculated;this the content posted by friends. Te relational degree between processed data is more suitable for similarity calculation. Te users is calculated through the behaviors of “comment ���”, relevant calculation methods are as follows: “repost ���”, and “thumb up ���” between users. Te three (1) Dice coefcient [31]: when calculating strings, they types of user behavioral information are sequentially assigned can be divided into two categories. When calculating the the weights “3”, “2”, and “1”. Select � users with the highest multivalued strings �� and ��, the sum of the two times relational degree and take the mode number for flling. If the of the intersection information and divided by the sum of 4 Wireless Communications and Mobile Computing

the elements of �� and �� yields the two strings of Dice Step 2. Calculatetheinversedocumentfrequency(IDF)of coefcients, which are calculated as follows: each word in the document; � � � � ��� ∩��� ������� (��,��)=2 � � (6) � � � � � ��� = log ( ) (10) ���� + ���� �+1 Example. In two multivalued attribute strings “vivid music movie” and “movie travel”, the intersection information is where � denotes the total number of documents in the {“movie”}, so the similarity is 2/5=0. 4. corpus, � denotes the number of documents containing a For single-valued attribute strings, the Dice coefcient is word in the document, and 1 is added to avoid cases in which calculated as above, except that the intersection information the denominator is 0. is diferent. Step 3. Calculate the TF-IDF of each word in the document; Example. For the single-valued strings “johe” and “joh”, the intersection information is “jo, oh”, so the similarity is 4/5 = � � 0.8. �� − ��� = �� × ��� = × log ( ) (11) (2) Levenshtein distance [32]: the number of character � �+1 edit steps required to calculate the equality of two strings is used as an operational cost to measure the diference between Step 4. Select keywords in each document to construct a term strings. Te formulae for calculating the similarity of strings frequency vector for calculating similarity. �� and �� are as follows: Step 5. Calculate the similarity value by cosine similarity. �(�,� ) ������� (� ,� )=1− � � � � � � � � (7) (5) User blog data similarity calculation: frequent item (�� � , �� �) max � �� � �� sets of user blog data are mined by means of frequent pattern miningtocalculateusersimilarity.Duetothediference where �(��,��) denotes the distance between the strings �� in the amounts of user-published content, the one-item set and �� and max (|��|, |��|) denotes the maximum value of is also used as a calculation indicator in this study. “1” is characters contained in the strings �� and ��. added to avoid a high-frequency item set in the calculation (3) Cosine similarity [33]: this is mainly used to cal- of similarity. Te formula is as follows: culate the vector composed of user attributes. Assuming that � and � are two �-dimensional vectors, such that � is [� ,� , ..., � ] [� ,� , ..., � ] � 1 2 n and B is 1 2 n , then the cosine value �i ��� = ∑ ((1 + ��� )×(1+��� )) � � � i i (12) of angle between and is the similarity value between � ∈�∩� vectors. Te formula is as follows: i � ∑�=1 �� ×�� cos �= where �� � denotes the support degree count of the frequent √ � 2 √ � 2 (8) � ∑�=1 �� × ∑�=1 �� item set �� of user �, ��� denotes the support degree count � � � � of the frequent item set � of user ,and �� denotes the item Te closer the angle between two vectors is to 1, the higher set number of ��. Te similarity threshold is set at 5,000 based the similarity between two users is. Te closer the angle on historical data. If ��� > 5000, return “1”; otherwise, return between the two vectors is to 0, the smaller the cosine value “0”. oftheincludedangleisandthelowertheusersimilarityis. (6) State timestamp similarity calculation: the time points By comparing the size of cosine values, it can be determined of users’ publishing status also have certain personalized whether the two accounts are identical. characteristics. Te average dynamic number can be obtained (4) Term frequency-inverse document frequency (TF- by dividing the dynamic number generated by users in a IDF)[34]:thisismainlyusedtomeasuretheimportanceof certain period of time by the total dynamic number. Te aver- a certain word in the document and is ofen used to deal age dynamic number is then used to form a user timestamp withmultiwordattributefeldssuchaspersonalprofles.Te vector of 24 dimensions. Te similarity is calculated; users are specifc steps are as follows. determined to be the same user when Sim<0. 1 according to the statistical results. Te formula is expressed as Step 1. Calculate the term frequency (TF) of each word in the document; 24 � � � ��� (�, �) = ∑ �� −� � �� = (9) � � �� ��� (13) � �=1

where � denotes the number of occurrences of a certain word and � denotes the total number of words in the where ���, ��� denote the average dynamic number of the �th document. time period of users � and �. Wireless Communications and Mobile Computing 5

250 120 100 200 80 150 60 100 40 Number of users of Number Number of users of Number 50 20

0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

same user same user diferent user diferent user (a) URL distribution map (b) User name distribution map 60

40

20 Number of users of Number

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

same user diferent user (c) Interests distribution map

Figure 1: User attribute similarity distribution.

4. Multiuser Identification Method of user attributes can be obtained. As diferent user attributes have diferent infuences on the degree of user recognition, 4.1. Building User Similarity Vectors. Research and analysis of itisnecessarytocalculatetheweightoftheattributeitems. user data in the Internet of Tings (IoT) context can assist Figure 1 shows the performance of a single attribute in user in meeting people’s network needs. However, some malicious identifcation. It can be clearly seen from Figure 1 that the users will employ the user data to attack normal users via URL and user name have diferent distributions of similar- the IoT. Terefore, it is necessary to identify and analyze IoT ity between the same user and diferent users when user users. matching is performed. As these attribute items are highly In this study, the profle information and behavior distinguishable, the weight allocation should be relatively information of user data are utilized to achieve multiuser large. When users match in terms of their interests, the identifcation. Afer flling in user profle information, the similarity distribution between the users is small, meaning precision of user identifcation can be increased to a certain thattheweightdistributionshouldalsobesmall.Again,as extent. User behavior information has the characteristic of each attribute has diferent efects on user identifcation, it is being personalized, which allows for highlighting of the user’s therefore necessary to assign corresponding weights to each own behavior habits and is thereby conducive to the improve- attribute. ment of user identifcation precision. A reasonable similarity calculationmethodisusedforthedataofeachdimension 4.3.WeightAllocationAlgorithm.Afer the user data is of the user. Te data for each dimension is provided with a preprocessed, multiple user attributes are determined. When threshold value when calculating similarity. Afer comparing determining the weighting coefcient of the similarity judg- the calculated similarity of attributes with the set threshold, ment of each attribute in the user data, the traditional expert qualifed results return “1”; otherwise, return “0”. Tus, user subjective weighting method encounters the problem of poor similarity vectors composed of “0” and “1” can be formed and robustness, while the objective weighting method relies too used for the input of subsequent algorithms. much on the existence of a large amount of sample data, which is poor in universality. Terefore, this article proposes 4.2.WeightAnalysisofUserAttributes.By calculating the the posterior probability-based information entropy weight similarity between user attributes, the whole similarity vector allocation algorithm. 6 Wireless Communications and Mobile Computing

� Input: Source network account user data information vector ��,userdatavector{��}�=1 for all accounts in the target network, user data vector �� to be matched account in the target network Output: Te fnal similarity ������ =��(�)�(��,��) between the two accounts �� and �� � 1: foreach �� in {��}�=1 2: for i=1 to n �� �� �� 3: Calculate the similarity �(��,��)=(V1 , V2 , ..., V� ) of account A and B by using formula (6) (7) (8) (11) (12) (13) 4: end 5: for i=1 to n 6: Te attribute weights of the user data are assigned using equation (15) 7: end 8: Calculate the fnal similarity ������ =��(�)�(��,��) between the two accounts �� and �� 9: return ������

Algorithm 1: User data similarity calculation.

In information theory, the entropy value refects the 4.4. Random Forest Confrmation Algorithm Based on degree of information disorder. Te smaller its value is, Stable Marriage Matching themoreorderlytheinformationis,andthemorevaluable this attribute is; on the contrary, the more disordered the 4.4.1. Similarity Score. In order to improve the efciency of information is, the lower the value of this attribute is. the similarity calculation between users, this study adopts a Terefore, information entropy can be used to evaluate stable marriage matching algorithm to perform the many-to- the efectiveness of the attributes used. According to the many matching calculation. Te overall similarity of users is defnition of information entropy, for any random variable, evaluated by means of the similarity score of user matching. the formula is as follows: Te relevant formula is as follows: 20 �� � (�) =−∑ � (�) � (�) ����� = ∑��V (16) log (14) � �∈� �=1

where Score denotes the fnal Score of the match, �� denotes �(�) �� where is the possible value probability for the attribute. the weight of the �th attribute of the user, and V� denotes the In order to make the probability description of attributes similarity of the �th attribute of user A and user B. Te higher more precise, more efective weights are assigned to each theScoreis,themorelikelyitisthesameuser. attribute. On the basis of information entropy, the posterior probability of user attributes is further calculated, which 4.4.2. Stable Marriage Matching Algorithm. Te scoring for- aids in improving the precision of user identifcation. By mula can evaluate the overall similarity of two users based on combining the posterior probability and information entropy, user data information. Te higher the score, the more likely the attribute weight of the user account is �(�),suchthat it is that the two users are the same user. Te stable marriage matching algorithm uses the similarity score between users � (�) =−�(� |�)∑ � (�) (� (�)) � log (15) toselectcandidatematchingpairs.Ifwecalculateuserdata �∈� for all accounts, then the computational complexity will be high. Terefore, it is necessary to obtain V� by fltering the target account in another network according to the condition where �(�� |�)is the posterior probability of the attribute. Te user data information contains � attribute items. C: flter accounts by username. Te specifc steps involved are � � � as follows. Te data information vector is �� =(�1,�2, ..., ��),where �� (� = (1, 2, ..., �)) � � represents the th attribute informa- Step 1. Each user on social network M and the user on social tion of account �. Te user similarity vector is defned as �� �� �� �� network N are matched by scoring formula. �(��,��)=(V1 , V2 , ..., V� ),whereV� represents the similarity between the �th attribute of user �� and user ��’s Step 2. Te user on M is matched with the user on N who attribute information. If the similarity exceeds the threshold, ranks frst according to the score. If the user on N has already output “1”; otherwise, output “0”. Accordingly, the user sim- matched other users, the user will compare the user who ilarity vector is a vector composed of “1” and “0”. Terefore, has already matched himself with the user who is requesting the fnal user similarity vector is ������ =��(�)�(��,��).Te matching with himself. Finally, the user with the highest score process is summarized in Algorithm 1. will be selected as the other half of the matching pair. Wireless Communications and Mobile Computing 7

Test data set X

Training Classifcation data set result C1(X) D1 Decision tree 1 Training Classifcation data set result C2(X) D Training 2 Voting for the Final Bagging Decision tree 2 best classifcation data set D classifcation result

Training Classifcation data set result CK(X) DK Decision tree k

Figure 2: Random forest model construction process.

Step 3. Afer Step 2 is complete, some users will still fail to Step 1. Acquire the training set. be matched successfully. A user who does not match will (a) Te original input training data set D comprises 20 be matched with the highest ranked user among all users prediction attributes and a classifcation label �.Te20pre- who have not rejected themselves, afer which Step 2 will be dictionattributesaretheusersimilarityvector�(��,��)= �� �� �� repeated until all users match. (V1 , V2 , ..., V� ) obtained in the above, while the classifca- tion label is � = (1 ‖ 0). A class label of “1” means that the 4.4.3. Random Forest Algorithm. Users through the stable users are the same, while “0” means that they are not the same. marriage matching algorithm output is a matching pair. (b) Te original training data set D is sampled by random However, such results cannot be used directly, as it would be sampling with K playback times via the Bagging method, and easy to obtain poor matches if this was the case. In order to anewtrainingsubsetUofKisobtained. solve this problem, a second confrmation of random forest is established in order to eliminate the negative infuence of Step 2. Generate the decision tree. wrong matching results on the fnal results as far as possible. (a) Te number of prediction attributes in the training Tere are many algorithms based on supervised learning, sample is 20, and �=√20 ≈ 5 attributes are randomly including logistic regression, SVM, Adaboost, etc. Random selected from the 20 prediction attributes to form a random forest is selected as the fnal quadratic confrmation algorithm feature subspace ��, which is the split attribute set of the in this study for the following reasons. current node of the decision tree. During the generation of (1) Tere are 20 data dimensions used in this study, the random forest model, F remains unchanged. among which there may be linear correlation dimensions. (b) According to the decision tree generation algorithm, Te dimension of linear correlation not only plays no positive each node is split by selecting the optimal split attribute from role in the training of the supervised learning model, but also the random feature subspace ��. impacts the efect of other nonlinear correlation dimensions. (c) Each tree grows completely without pruning. Finally, In general machine learning model training, data dimension- according to each training set ��, the corresponding decision ality reduction will be processed, and data dimensionality tree is generated as ℎ�(��). reduction is a tedious process. However, random forests are (d) Combine all the generated decision trees together to not sensitive to multicollinearity, and the results are robust to generate a random forest model {ℎ1(�1), ℎ2(�2), ..., ℎ�(��)}. missing and unbalanced data. Each test tree ℎ�(��) is tested using the test set sample (2) While overftting is always discussed in machine X to obtain a corresponding classifcation result learning, it is not easy for the random forest model to pro- {�1(�), �2(�),...,��(�)}. duce the overftting phenomenon owing to the randomness (e) Using the plurality voting method, according to the involved. classifcation result output by the K-tree decision tree, the Figure 2 provides an overview of the construction process classifcation result with a large number of decision trees is of the random forest validation model. Te specifc steps are used as the fnal classifcation result corresponding to sample as follows. Xofthetestset. 8 Wireless Communications and Mobile Computing

Begin

User similarity Stable marriage vector training Score model matching set

Generate a random forest model

Random forest Candidate confrmation matching pairs

Final match pairs

End

Figure 3: Random forest confrmation based on stable marriage matching.

Figure 3 describes the process of RFCA-SMM. Te input Input: V data of the algorithm is the user attribute data in the IoT. By To be matched account �, Candidate matching V � using the stable marriage matching algorithm in combination account ����ℎ, the set of accounts � that have not been � with the scoring formula, the overall similarity between users matched in the social network M, the set of accounts � that can be obtained in order to select the candidate matching have not been matched in the social network N Output: Te fnal match pairs set R pairs with the user similarity vector training set as input. 1: R=� Te random forest model is constructed, and the candidate 2: Initialize unmatched queue matches obtained are used as input data to confrm and 3: while �� =�̸ and �� =�̸ do identify in the random forest. If the identifcation result 4: if V�=NULL then for the candidate matching pair is not the same user, the 5: V�=UserSelect(��,��) candidate matching pair is marked as “unmatched”; by 6: V����ℎ=UserMatch(V�,��,��) contrast, if the candidate match pair contains two instances 7: end if ofthesameuserintherandomforestconfrmation,thefnal 8: (V�,V����ℎ)= Secondary confrmation (V�,V����ℎ,R) match result is generated. Te algorithm fow of RFCA-SMM 9: end while is represented by Algorithm 2. 10: pruning process 11: return R 5. Analysis of Experimental Results Algorithm 2: RFCA-SMM. In order to verify the efectiveness of the proposed algorithm, [35] provides fve open datasets of foreign mainstream social networks. AUC: the area under the Receiver Operating Character- In this study, precision rate, recall rate, F-measure (F1), istic(ROC)curveisdirectlycalculated.TeROCcurveis andAUCareusedasevaluationcriteria.Terelevantformu- defned as the X-axis by the False Positive Rate (FPR), while lae are as follows: theTruePositiveRate(TPR)isdefnedastheY-axis.Te formulae for these two values are as follows: �� ��������� = �� �� + �� (17) ��� = �� + �� (20) �� ������ = �� �� + �� (18) ��� = �� + � � (21) 2 × ��������� × ������ �1 = where �� denotes the number of the same users that are ��������� + ������ (19) correctly matched, �� denotes the number of users that are Wireless Communications and Mobile Computing 9

Table1:Comparisonofseveraltypesofsupervisedlearning. Table 2: Comparison of RFCA-SMM and RCM results.

Algorithms Precision Recall F1 AUC Algorithms Precision Recall F1 AUC Random RFCA-SMM 0. 962 0. 871 0. 914 0. 961 0. 961 0. 881 0. 919 0. 961 Forest RCM 0. 934 0. 875 0. 904 0. 912 SVM 0. 950 0. 860 0. 903 0. 945 Logistic 0. 935 0. 900 0. 917 0. 965 matching pairs, which decompose the seed user’s identif- Adaboost 0. 910 0. 870 0. 890 0. 900 cation into a step-by-step iterative process. In the iteration process of each step, the calculation process of the algorithm 1 is divided into three substeps: account selection, account matching, and cross matching. Accumulate the results of each iteration to form a result set. Te advantage of this algorithm 0.9 istocomparetheresultsobtainedeachtimeandselectthe user account with a high score as the fnal result. However, the precision of the RCM algorithm is largely afected by Values 0.8 the number of seed users (that is, the number of users who are known to match pairs). If it is not possible to know in advance which accounts are the same user (that is, there is 0.7 untagged identity match), then the RCM algorithm needs to Precision Recall F1 AUC be improved in terms of precision. Since the algorithm for user identifcation in this study is untagged, the experimental Random Forest results of the two algorithms are analyzed using an unmarked SVM data set, as shown in Table 2 and Figure 5. Logistic It is clear from Figure 5 that the proposed RFCA-SMM Adaboost algorithm is superior to the RCM algorithm in terms of precision,F1,andAUCinthisstudy,althoughitsperfor- Figure 4: Comparison of several types of supervised learning. mance is slightly lower than that of the RCM algorithm in terms of recall rate. Te reason is that the proposed algorithm performs user-generated data processing, user unmatched and not the same, �� denotes the number of users attribute weight distribution, and secondary confrmation that are matched but are not the same, and �� denotes the based on supervised learning compared with the RCM number of users that are not matched but are the same users. algorithm.ItcanbeseenfromFigure5thatthefnaluser matching results of the proposed algorithm achieve some improvement in the evaluation index compared with the 5.1. Comparison of Random Forest Model and Other Super- RCM algorithm, mainly because the proposed algorithm vised Learning Models. Tis article uses the random forest (unlike RCM) does not take the social network structure into supervised learning model for the confrmation of candidate account [36]. In summary, the algorithm proposed in this matching pairs. In order to verify the efectiveness of the study improves the precision of user identifcation to a great proposed algorithm in obtaining the matching results, the extent. random forest model and other supervised learning models are analyzed with reference to the evaluation indicators outlined above. Te ratio of training set data to test set data 6. Conclusions is 3:1 and the number of users is 1000 pairs. Te results are Te key features of 4G/5G technology, namely, low energy presented in Table 1 and Figure 4. consumption and low delay, have laid the foundation for ItcanbeseenfromFigure4thatthesesupervisedlearning the development of the Internet of Tings (IoT). Given the algorithms have relatively good results; among them, the various other advantages of this technology, it can efectively best performing algorithms are Random Forest and Logistic. promote the rapid development of the IoT industry chain. Random Forest performs slightly better in precision rate and Since most of the information in the IoT is related to F1, while Logistic performs slightly better in recall rate and private user data, we propose a random forest confrmation AUC. However, considering the modeling scenarios of these algorithm based on stable marriage matching (RFCA-SMM). two supervised learning algorithms, Random Forest has an Te candidate matching pairs are obtained by combining the advantage over Logistic. Terefore, the efectiveness of the stable marriage matching algorithm with the scoring model. random forest model is also proven. In order to demonstrate the efectiveness of the random forest model, we analyze the random forest model and several other 5.2. Comparison of RFCA-SMM and RCM Algorithm Results. supervised learning models on the evaluation indicators and Tis section presents a comparative analysis of the RFCA- fnally the second confrmation of candidate matching pairs SMM and Ranking-based cross-matching (RCM) algorithms. via the random forest. Moreover, we conduct a comparative Te purpose of the RCM algorithm is to accurately fnd more analysis of RFCA-SMM and RCM. Te experimental results 10 Wireless Communications and Mobile Computing

1

0.95

0.9

0.85 Values 0.8

0.75

0.7 Precision Recall F1 AUC

RFCA-SMM RCM Figure 5: Comparison of RFCA-SMM and RCM results.

show that the proposed algorithm can provide excellent [5]J.Liu,F.Zhang,X.Song,Y.Song,C.Lin,andH.Hon,“What’s performance with precision rate, F1, and AUC reaching in a name?: an unsupervised approach to link users across 96.2%, 91.4%, and 96.1%. communities,”in Proceedings of the the Sixth ACM International Conference, pp. 495–504, Rome, Italy, 2013. [6] R. Zafarani and H. Liu, “Connecting users across social media Data Availability sites: a behavioral-modeling approach,” in Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Tedatausedtosupportthefndingsofthisstudyare Discovery and Data Mining (KDD),pp.41–49,USA,2013. available from the corresponding author upon request [7] O. Goga, D. Perito, H. Lei, R. Teixeira, and R. Sommer, “Large- scale correlation of accounts across social networks,”Tech-nical Conflicts of Interest report, 2013. [8] K. Cortis, S. Scerri, I. Rivera, and S. Handschuh, “An ontology- Te authors declare that there are no conficts of interest based technique for online profle resolution,” in Social Infor- regarding the publication of this paper. matics, Lecture Notes in Computer Science, pp. 284–298, Springer International Publishing, Berlin, Germany, 2013. Acknowledgments [9] P. Jain, P. Kumaraguru, and A. Joshi, “@ i seek “f. me”: Identifying users across multiple online social networks,” in TisresearchissupportedbyNationalNaturalScienceFoun- Proceedings of the the 22nd International Conference on World dation of China (Grant no. 61771185, Grant no. 61772175, and Wide Web Companion, pp. 1259–1268, Rio de Janeiro, Brazil, 2013. Grant no. 61801171), Science and Technology Research Project ofHenanProvince(Grantno.182102210044andGrantno. [10]O.Goga,H.Lei,S.H.K.Parthasarathi,G.Friedland,R. Sommer, and R. Teixeira, “Exploiting innocuous activity for 182102210285), Key Scientifc Research Program of Henan correlating users across sites,” in Proceedings of the 22nd Inter- Higher Education (Grant no. 18A510009), and Postdoctoral national Conference on World Wide Web (WWW),pp.447–457, Science Foundation of China under Grant 2018M632772. 2013. [11] X. Kong, J. Zhang, and P. S. Yu, “Inferring anchor links across References multiple heterogeneous social networks,” in Proceedings of the the 22nd ACM international conference (CIKM), pp. 179–188, [1] Y. Zhang, J. Tang, Z. Yang, J. Pei, and P. S. Yu, “COSNET: San Francisco, Calif, USA, 2013. connectingheterogeneoussocialnetworkswithlocalandglobal [12] J. Haupt, B. Bender, B. Fabian, and S. Lessmann, “Robust consistency,”in Proceedings of the 21st ACM SIGKDD Conference identifcation of email tracking: a machine learning approach,” on Knowledge Discovery and Data Mining (KDD),pp.1485– European Journal of Operational Research,vol.271,no.1,pp. 1494, 2015. 341–356, 2018. [2] X. Zhou, X. Liang, H. Zhang, and Y. Ma, “Cross-platform [13] Y. Li, Z. Zhang, Y. Peng, H. Yin, and Q. Xu, “Matching identifcation of anonymous identical users in multiple social user accounts based on user generated content across social media networks,” IEEE Transactions on Knowledge and Data networks,” Future Generation Computer Systems,vol.83,pp. Engineering, vol. 28, no. 2, pp. 411–424, 2016. 104–115, 2018. [3] M. M. Mostafa, “More than words: social networks’ text mining [14] A. Narayanan and V. Shmatikov, “De-anonymizing social net- forconsumerbrandsentiments,”Expert Systems with Applica- works,” in Proceedings of the 30th IEEE Symposium on Security tions,vol.40,no.10,pp.4241–4251,2013. and Privacy, pp. 173–187, IEEE, Berkeley, Calif, USA, 2009. [4] T. Tuna, E. Akbas, A. Aksoy et al., “User characterization for [15]S.Bartunov,A.Korshunov,S.Park,W.Ryu,andH.Lee, online social networks,” Social Network Analysis and Mining, “Joint link-attribute user identity resolution in online social vol.6,no.1,p.104,2016. networks,” in Proceedings of the 6th SNA-KDD Workshop,2012. Wireless Communications and Mobile Computing 11

[16] Y.Cui, J. Pei, G. Tang, W.-S.Luk, D. Jiang, and M. Hua, “Finding [31] G. Kondrak, D. Marcu, and K. Knight, “Cognates can improve email correspondents in online social networks,” World Wide statistical translation models,” in Proceedings of the Conference Web,vol.16,no.2,pp.195–218,2013. of the North American Chapter of the Association for Computa- [17] S. Liu, S. Wang, F. Zhu, J. Zhang, and R. Krishnan, “HYDRA: tional Linguistics, pp. 46–48, Edmonton, Canada, 2003. large-scale social identity linkage via heterogeneous behavior [32] L. Yujian and L. Bo, “Anormalized Levenshtein distance metric,” modeling,” in Proceedings of the ACM SIGMOD International IEEE Transactions on Pattern Analysis and Machine Intelligence, Conference on Management of Data (SIGMOD), pp. 51–62, USA, vol.29,no.6,pp.1091–1095,2007. 2014. [33] H. Nguyen and L. Bai, “Cosine similarity metric learning for [18] N. Korula and S. Lattanzi, “An efcient reconciliation algorithm face verifcation,” in Asian Conference on Computer Vision, for social networks,” in Proceedings of the VLDB Endowment, vol. 6493 of Lecture Notes in Computer Science, pp. 709–720, vol. 7, pp. 377–388, 2014. Springer, Berlin, Germany, 2010. [19]S.Tan,Z.Guan,D.Cai,X.Qin,J.Bu,andC.Chen,“Mapping [34] K. S. Jones, “A statistical interpretation of term specifcity and users across networks by manifold alignment on hypergraph,” its application in retrieval,” Journal of Documentation,vol.28, in Proceedings of the 28th AAAI Conference on Artifcial Intelli- no. 1, pp. 493–502, 2004. gence (AAAI),vol.14,pp.159–165,2014. [35] M. Yan, J. Sang, and C. Xu, “Unifed video recommen- [20] H. Kobayashi and T. Matsui, “Successful manipulation in dation via cross-network collaboration,”in Proceedings of the the stable marriage model with complete preference lists,” IEICE 5th ACM on International Conference on Multimedia Retrieval, Transaction on Information and Systems,vol.92,no.2,pp.116– pp.19–26,NewYork,NY,USA,2015. 119, 2009. [36] G. Rossetti and R. Cazabet, “Community discovery in dynamic [21] E. Raad, R. Chbeir, and A. Dipanda, “User profle matching networks: a survey,” ACM Computing Surveys (CSUR),vol.51, in social networks,” in Proceedings of the 13th International no.2,pp.1–37,2018. Conference on Network-Based Information Systems (NBiS),pp. 297–304, 2010. [22] Y. Na, Z. Yinliang, D. Lili, B. Genqing, E. Liu, and G. J. Clapworthy, “User identifcation based on multiple attribute decision making in social networks,” China Communications, vol.10,no.12,pp.37–49,2013. [23] F. Abel, E. Herder, G.-J. Houben, N. Henze, and D. Krause, “Cross-system user modeling and personalization on the social web,” User Modeling and User-Adapted Interaction,vol.23,no. 2-3, pp. 169–209, 2013. [24]G.Wang,X.Zhang,S.Tang,H.Zheng,andB.Y.Zhao,“Unsu- pervised clickstream clustering for user behavior analysis,” in Proceedings of the the 2016 CHI Conference on Human Factors in Computing Systems, pp. 225–236, San Jose, Calif, USA, 2016. [25]K.A.Alam,R.Ahmad,andK.Ko,“Enablingfar-edgeanalytics: performance profling of frequent pattern mining algorithms,” IEEE Access,vol.5,no.99,pp.8236–8249,2017. [26] W.Cao,Z.Wu,D.Wang,J.Li,andH.Wu,“Automaticuseriden- tifcation method across heterogeneous mobility data sources,” in Proceedings of the 32nd IEEE International Conference on Data Engineering (ICDE), pp. 978–989, IEEE, 2016. [27]T.Hao,J.Zhou,Y.Cheng,L.Huang,andH.Wu,“User identifcation in cyber-physical space: a case study on mobile query logs and trajectories,” in Proceedings of the the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 1–4, Burlingame, Calif, USA, 2016. [28] X. Han, L. Wang, S. Xu, G. Liu, and D. Zhao, “Linking social network accounts by modeling user spatiotemporal habits,” in Proceedingsofthe15thIEEEInternationalConferenceon Intelligence and Security Informatics (ISI), pp. 19–24, IEEE, 2017. [29] X. L Li, Y. L. Han, D. Y. Zhang, and X. G. Xu, “An evalu- ation algorithm for importance of dynamic nodes in social networks based on three-dimensional grey relational degree,” in Proceedings of the International Conference of Pioneering Computer Scientists, Engineers and Educators,pp.201–212, Springer,Singapore,2018. [30] M. Almishari and G. Tsudik, “Exploring linkability of user reviews,” in Computer Security – ESORICS 2012,vol.7459of Lecture Notes in Computer Science,pp.307–324,SpringerBerlin Heidelberg, Berlin, Heidelberg, Germany, 2012. Hindawi Wireless Communications and Mobile Computing Volume 2019, Article ID 5121054, 14 pages https://doi.org/10.1155/2019/5121054

Research Article Threat Assessment for Android Environment with Connectivity to IoT Devices from the Perspective of Situational Awareness

Mookyu Park ,1 Jaehyeok Han,1 Haengrok Oh,2 and Kyungho Lee 1

1 School of Information Security, Korea University, Seoul 02841, Republic of Korea 2Agency for Defense Development (ADD), Seoul 05771, Republic of Korea

Correspondence should be addressed to Kyungho Lee; [email protected]

Received 18 January 2019; Revised 13 March 2019; Accepted 25 March 2019; Published 11 April 2019

Academic Editor: Ilsun You

Copyright © 2019 Mookyu Park et al. Tis is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

As smartphones such as mobile devices become popular, malicious attackers are choosing them as targets. Te risk of attack is steadily increasing as most people store various personal information such as messages, contacts, and fnancial information on their smartphones. Particularly, the vulnerabilities of the installed operating systems (e.g., Android, iOS, etc.) are trading at a high price in the black market. In addition, the development of the Internet of Tings (IoT) technology has created a hyperconnected society in which various devices are connected to one network. Terefore, the safety of the smartphone is becoming an important factor to remotely control these technologies. A typical attack method that threatens the security of such a smartphone is a method of inducing installation of a malicious application. However, most studies focus on the detection of malicious applications. Tis study suggests a method to evaluate threats to be installed in the Android OS environment in conjunction with machine learning algorithms. In addition, we present future direction from the cyber threat intelligence perspective and situational awareness, which are the recent issues.

1. Introduction Communication (NFC) smart key application in 2019 to control the vehicle through smartphones [5]. Te Internet of Tings (IoT) technology is being applied to Te combination of IoT and AI technology is making various felds such as public safety, transportation, industrial, cyberspace’s infuence expand to the real world. Te cyber- and healthcare. Te expansion of this technology contributes attack in the past occurred only in cyberspace, and the to the quality of human life. Te Gartner report predicts damage was less likely to occur in the environment where it that the IoT market will grow at an average annual rate is not directly related to it [6]. However, the damage caused of 28.8 percent from about $ 300 billion in 2015 to more than $ 1 trillion in 2020 [1]. In addition, the report pub- by the cyber-attack is currently spreading from damage to lished in 2019 predicts that this IoT technology will be sofware and hardware, causing a wide range of secondary combined with artifcial intelligence technologies to develop damage such as leakage of personal information, revision into “Intelligent Tings”. With the expansion and evolution of direction of national policy, and manipulation of public of these IoT technologies, many companies are developing opinion. A typical threat is a mobile , such as application services that can connect their IoT products and . Tirty-fve percent of attackers use these methods smartphone platforms [2]. Typically, Apple developed a “W1” of attack, and the main attack targets include personal chip to build the IoT ecosystem and began embedding it and confdential information of high-ranking government in IoT products and the iPhone [3]. Samsung is making ofcials. In addition, RSA’s Current State of , eforts to install “Tizen”, a proprietary , on published in 2016, claimed that the mobile fraud targeting smartphones, smart cars, and smart home appliances [4]. fnancial information in 2015 increased about 173% from 2013 In addition, Hyundai plans to mass-produce the Near Field [7]. 2 Wireless Communications and Mobile Computing

Global security companies have recently stated that the threats related to these IoT devices can be classifed into possibility of a cyber-attack on mobile devices is increas- data leakage, unsecured Wi-Fi, network spoofng, ing. Tey argued that North Korea is concentrating its attacks, spyware, etc. cyber-attacks on mobile environments that are vulnerable to security. Palo Alto Networks announced that the attack Data Leakage. Mobile applications ofen cause unintended target of malicious application disguised at Google Play data leakage. Currently used applications are granted full Market is Samsung’s smartphone users using Korean. McAfee privileges by the user, but the security of this application is pointed out the “Lazarus”, which is supposed to be a hacker not perfect [9, 10]. Tey tend to be free of charge in the ofcial organization in North Korea, behind the attacks of malicious App Store. Such information leakage may be prevented Android applications disguised as biblical appli- through text data. However, about 51.6% of the programs have cations. In addition, South Korea’s security companies are the risk of information leakage during Android applications, behind a spy app that seized information, text messages, and and the attacker can access them through the access token contact information on about 10 high-ranking ofcials in the [11].Tatis,whenausertransmitsdatatoanIoTdevice national defense, diplomacy, and security felds in 2016 [8]. through a smartphone, privacy information may be exposed Te reason for this increase is that personal information or to an attacker. fnancial information is ofen stored on smartphones, and smartphones are at the center of connection and control of Unsecured Wi-Fi. In most public places (e.g., airports, parks, IoT devices. etc.), users can use wireless hotspots, but tend to use free Changes in the way in which personal information is Wi-Fi networks. However, these free Wi-Fi generally do stored and the efcient use of IoT equipment are reasons not provide secure security. N. Sombatruang et al. have to increase the frequency of attacks on smartphones. Tese confrmed that personalized photos, e-mail, documents, and mobilethreatscancausedamagetobothcyberspaceand login credentials are transmitted without encryption in pack- the real world. To provide situational awareness to decision- ets on mobile devices that are connected to unsecured Wi-Fi makers, it is necessary to extend threat detection for a mobile [12]. If the smartphone is storing data transmitted from IoT malicious application to evaluate threats. Tis study proposes devices, it shows that information could be leaked through a method of assessing threat based on the extracted features thefakeWi-Fi. through Android malware detection using basic machine learning using the risk model, factor analysis of information Network Spoofng. Network spoofng means that an attacker risk (FAIR). Tis paper describes the relationship between sets up a rogue access point in a public place like a cofee mobile threats and cyberspace, threat assessment, and limi- shop, library, or an airport. Tis method of attack ofen tation of detection in the next sections. In allowsattackerstogiveusersacommonnamebyassigning addition, this paper proposes frames from the threat assess- a common name to access points such as “Free Airport Wi- ment procedure from the perspective of situational awareness Fi”or“CofeeHouse”.Manyuserscreatean“account”and section, and the results are described in the result of threat enter a password to access the free service, at which time users assessment for Android malware application section. tend to use the same e-mail and password combination for multiple services [13]. Tis could allow an attacker to expose access to e-mail, e-commerce, and IoT equipment. 2. The Relationship between Mobile Threats and Cyberspace Phishing Attacks. Mobile devices are always exposed to phishing attacks because they are almost powered on. Mobile With the development of IoT adventure, the border between devices are more likely to be a threat than desktops. Even if cyberspace and the real world disappears, and the frequency they receive e-mail, they are less likely to receive warnings of cyberattack has a negative efect on the real space. In through security programs. Such an attack can provide a path general, cyberspace is defned as a virtual space that enables for malware to enter the user’s mobile device [14, 15]. Tis communication of the information environment by over- may provide the attacker with the rights to the mobile device coming the temporal and spatial limits of reality through a and the IoT device connected to the device or the personal virtual network environment composed of electronic devices information stored in the device. and electronic spectrum. Tis environment is interdependent with the Internet, network, embedded processing, human, Spyware. Mobile spyware is an application that monitors society, and policy, so the damage from cyber-attacks is likely and records user’s privacy and personal information without to expand in the future. user’s permission. Tis spyware is installed when a user installs another application, redirects to a malicious website, 2.1. Mobile Treats to IoT Devices. Mobile devices such as or physically unlocks the computing device. When a mobile smartphone tablet PCs can be used as access roads to IoT device is infected with spyware, it can eavesdrop on nearby equipment. In recent years, many companies have named conversations or access data stored or transmitted to the IoT devices as smart devices and are controlling them device. For example, if there is a smartphone near the through applications installed on smartphones. Tat is, if the keyboard,itispossibletodetectwhatwasenteredintothe smartphone is vulnerable to security, data of IoT equipment computer through the accelerometer sensor of the smart- storing personal privacy information may be leaked. Mobile phone [16]. In addition, an application having a backdoor Wireless Communications and Mobile Computing 3

Cyber Eco-System

Government Layer Govt, Cyber Laws, Regulations, etc. Element of IoT Devices

Organization Layer Org, Cyber Policies, Procedures, etc. Service Damage to People Supervisory Layer People, Supervisory, Temporal individual or society Semantics Computation Real World Identification Communication Persona Layer User Ids, E-mails, Phone Numbers, etc. Sensing Soware Application Layer Browsers, Office Products, etc.

Operating System Layer Win, , Android, iOS, etc. Damage could occur by setting various Ability to set attack routes elements of IoT equipment as target of attack Machine Language Layer 010111010111010001011

Cyberspace using various layers

Mobile reat to IoT Devices Te expansion of the IoT environment makes environment IoT the of expansion Te Logical Layer OSI 7 Layers Data Leakage cyber threats more likely to be projected into real space real into be projected to likely more cyber threats Physical Layer Hardware, Cables (fiber, copper, etc.) Unsecured Wi-Fi Network Spoofing Phishing Attacks Geographic Layer Geographic Location & Dependencies Spyware

Figure 1: Relationship between the element of IoT devices and cyberspace.

may induce the use of push notifcation services to leak a consists of 15 layers including the cyberspace layer concept. user’s personal information [17]. Tese hierarchies suggested that the impacts in cyberspace could afect social or policy systems (see Figure 1). Tis means that the threat of cyberspace through mobile 2.2. Extending the Mobile Treats to the Real World with devices can be visualized through IoT equipment. Te reason IoT Devices. Te mobile threats can afect not only mobile why IoT equipment can be visualized as a threat is due to devices, but also people using IoT equipment and IoT the factors of IoT. IoT equipment consists of identifcation, technology. In other words, it means that the mobile device sensing, communication, computation, service, and seman- acts as a medium that connects cyberspace and real space. tics [21]. Tis concept can be explained through the cyberspace layer described in US Joint Publication (JP) 3-12R Cyber Opera- Identifcation. It is an explicit identifcation of each object in tion. According to this document, cyberspace is composed the logical layer and is provided by addressing the device, of a physical layer, a logical layer, and a persona layer. such as an electronic product code (EPC) and ubiquitous Te physical layer is a layer where physical devices such code, by naming the device or by addressing a specifc object. as IoT devices, routers, switches, and mobile devices are IPv6 is used to assign a unique address to each entity [22]. geographically and physically. Te logical layer is a layer Mobile devices could be exposed to threats if they are made where network or communication devices are located in the into fake access points with the name of such equipment. physical layer. Te persona layer is a layer that represents a human being in a virtual space through an electronic Sensing. It refers to the process of collecting information from service provided at the physical layer and logical layer and objects through RFID tags, smart sensors, and so on. Since corresponds to an IP address, an e-mail, and an ID [18]. most IoT devices are connected to the mobile device, if the Tese cyberspace layers are subdivided and expanded mobile device is exposed to spyware, phishing attacks, etc., into the basic concepts of physical layer, logical layer, and an attacker can gain access to private data [23]. Or when a persona layer. David Clark subdivided cyberspace into a user accesses unsecured Wi-Fi, the ID and password of the physical and a logical layer, as well as an information layer that IoT device may be exposed and illegally monitored or the represents the information fow in cyberspace and a top layer- personal information may be leaked to the outside. people layer that allows human decisions to be projected into cyberspace. Te characteristic of this concept is that the ser- Communication.ItisoneofthemainpurposesofIoT vice of cyberspace has a dependency between the layers, thus technology because various IoT devices are connected to each providing a basis for the threat of cyberspace to be refected in other. Tis element can send and receive messages, fles, and the real space [19]. Te US Department of Homeland Security other information stored on IoT devices and utilize com- has defned a cyber ecosystem that extends the concept of the munication technologies such as Near Field Communication cyberspace layer. Te cyber ecosystem is a notion that social (NFC), Bluetooth, Wi-Fi, and Long-Term Evolution (LTE) organizations such as private companies and governments [24]. However, if mobile devices connected to communi- as well as human beings construct virtual ecosystems by cation are exposed to threats such as spyware, there is a interacting with electronic devices and electronic services possibility that private information of IoT equipment articles such as hardware and sofware [20]. Tis concept refects connected to mobile devices may be exposed. the situation in which the gap between cyberspace and real space is decreasing due to the development of IoT and Computation. It is an element that uses sensors to perform AI (Artifcial Intelligence) technology. Te cyber ecosystem calculations on information collected from objects and is 4 Wireless Communications and Mobile Computing developed to perform processing in IoT applications. Typ- of perception, comprehension, and projection. “Perception” ically, this is the operating system. Mobile devices such recognizes the status and attributes of related elements in as smartphones are divided into Android and iOS. Te the current environment. “Comprehension” recognizes the vulnerabilities of OS are used for spyware or malware. In current situation through elements collected at the ”Percep- other words, when a smartphone with an IoT application tion”. ”Projection” assesses how the information analyzed in installed on a smartphone is attacked, IoT devices may also “Comprehension” will afect the state of the future operating be exposed to threats [25]. environment (see Figure 2) [28]. In cyberspace, the SA model is being developed through Service. It is provided by IoT applications and can be divided the basic concept of this Endsley’s model. J. Okolica et al. into four types. Te identity-related service is used to get proposed a cyber situational awareness model (CSAM) with the ID of the object that sent the request. An information business continuity planning (BCM). Tis model updates aggregation service aims to collect all information from an cyberspace assets or systems in real time and predicts future object. Te collaboration service makes decisions based on threats through sense, evaluate, and assess [30]. G.P. Tadda the information gathered and sends appropriate responses to and J.S.Salerno developed the Situational Awareness Refer- the devices. Finally, ubiquitous services are used to respond ence Model (SARM) to improve understanding of various to devices immediately, regardless of time and location [26]. data in cyberspace. Te SARM can actively respond to However, malicious applications such as these can be used for changing threats in real time [31]. N. Evancich et al. proposed data leakage and user surveillance. Efective Cyber Situational Awareness (ECSA) focusing on . ECSA is divided into three stages: “Network Semantics. It is an element that reduces the gap between Awareness”, “Treat Awareness”, and “Operational Aware- cyberspace and real space, ensuring the convenience of ness”. “Network Awareness” is a step where a decision maker users in data collection and utilization of IoT equipment. recognizes the characteristics of the network’s assets and Tis element is responsible for gathering information from security. Te “Treat Awareness” step detects an attack vector IoT equipment and making appropriate decisions to send a that has entered the network. “Operational Awareness” is a responsetothemobiledevice[27].Sincethedecision-making measure of damage to network operation capability due to power is on the human side, the mobile device is used to the threat. ECSA is an improved situational awareness model remotely control it, but if the mobile device itself is exposed than CSAM in decision-making, collaboration, and resource to threats, semantics can be exploited by an attacker. management [32]. Te SA models, which are being studied recently, seek 3. Threat Assessment to recognize changes in the current situation by detecting threats. From the viewpoint of SA, threat assessment is a link To respond to an attack, the manager performs a risk between policy and technology to utilize technologies such as assessment on the assets he holds. Te risk assessment the prediction of a future threat, prediction of attack path, and consists of threats, vulnerabilities, assets, and missions (or identifcation of countermeasure, as well as threat detection countermeasure). If the assets to be protected are clear, the in the perception of awareness stage [33]. Treat assessment administrator will focus on the threats associated with them. in cyberspace is an engineering methodology that detects, Recently, Cyber Treat Intelligence (CTI) has been used to identifes, and prioritizes cyber threats in order to apply counter APT attacks. However, in the case of a threat, an countermeasures that reduce vulnerabilities to cyber-attacks. attacker needs to organize the threat into tactics, techniques, Many kinds of research and developments have been made and procedures (TTP) because the attacker sets the attack in the evaluation of cyber threats. Te CyberPrep Working method through several diferent cases. Tis section describes Group has established cyber-aware enterprise transformation the relevance of threat assessment and situational awareness strategies that refect the understanding of APT attacks, orga- (SA) for evaluating cyber threat information and the Factors nizational responses, and cybersecurity investment strategies. Analysis of Information Risk (FAIR) model applied in this MITER’s center for resiliency experimentation developed study. cyber resiliency engineering to develop a methodology for processes, personnel, and individual systems that support 3.1. Treat Assessment in Situational Awareness. Situational resilience strategies and techniques for mission functions awareness (SA) is the process of recognizing an environmen- against cyber threats. In addition, this threat assessment tal element in a threat situation or the time and space where uses System/Acquisition Mission Assurance Engineering a specifc event occurs and establishing countermeasures. (SAMAE) applying System Development Lifecycle (SDLC) to Tese procedures are used as a framework to recognize, judge, analyze APT attack knowledge [34]. Tis threat assessment and respond to threats such as terrorism, security, and cyber includes the following elements in common. security. Te SA is used in command and control systems that are used for decision support in the military sector. (i) Identify and prioritize high-risk tactics, techniques, Te basic concept of SA extends from Endsley’s model. and procedures (TTP) that cyber-assets can be Te Endsley’s model is a process for understanding the afected perceptions of the environment, understanding the changes in the current situation, and foreseeing the consequences (ii) Identify and prioritize efective countermeasures of future projections. Te Endsley’s model has three phases against identifed TTP. Wireless Communications and Mobile Computing 5

Individual Factors Goals Task & Environmental Factors Preconceptions Workload Knowledge Stressors Performance of Experience System Design Action Training Complexity Abilities

State of the Environment

Decision Situational Awareness

Perception Comprehension Projection of Elements of current of future in current Status Status Status Level 1 Treat Assessment Level 2 Level 3

Treat Assessment corresponds to Perception (Level 1) and Comprehension (Level 2) in terms of situational awareness of Endsley's model.

n

hensio k SA Model Characteristic ision ection formance Dec ctions ompre Proj Per A Perception C -Making Feedbac Situational Awareness Model(SAM) Cognitive decision making

Cyber Situational Awareness Model(CSAM) Business continuity planning and CSA

Situation Awareness Reference Model(SARM) Situational awareness

Change of SA Model of Change Effective Cyber Situational Awareness(ECSA) CSA in computer networks

Te SA model has been studied and improved based on Endsley's model to suit the environment and purpose.

Strong Coverage Technical Process Week Coverage Cognitive Process

Figure 2: Change and development of situational awareness model [29].

(iii) Recommend a countermeasure to reduce the possi- In this model, the part that can be utilized as a frame of bility of cyber asset attacks threat assessment is LEF. Te LEF is the frequency with which the threat agent is likely to be harmed by the threat agent Tere are limitations in the risk assessment method of in a certain period or situation and consists of Treat Event measuring the risk by setting the range of the cyber asset in Frequency (TEF) and Vulnerability (VUL) [36]. the situation where the boundary between cyberspace and the real space is unclear such as IoT environment. Since the Treat Event Frequency (TEF). It is the frequency with which IoT environment provides a diverse attack path, the threat athreatagentislikelytoactonanassetwithinaspecifed assessment can contribute to improving the risk management period, although the threat agent may act on the asset, but approach to assets [35]. not on the success of the attack. A typical example is a hacker who has not successfully attacked a web server. Such an attack is not a loss event, but can be considered a threat event. Tis 3.2. FAIR Model for Treat Assessment. Te factor analysis of TEF consists of Contact and Action, and the attacking action information risk (FAIR) model is a risk management model is based on the contact of the threat agent. Contact refers developedbyJ.A.Jones.Tismodelfocusesontheprobability to the frequency with which a threat agent may contact an of an emerging threat event and measures threat and asset assetwithinacertainperiod.Typesofcontactareclassifedas by frequency and size to measure risk. Tis FAIR model “Random”, “Regular”, and “Intentional”. Action refers to the measures the risk by combining Loss Event Frequency (LEF) probability that a threat agent will perform an actual attack and Loss Magnitude (LM) (see Figure 3). Tis model refects on an asset in a situation where a contact of the threat agent in detail the probabilities and probabilities. Te likelihood occurs.Te preconditions for the action are intelligent threats is expressed as 100% or 0% when a threat condition is such as “Tinking” threat agents and malicious programs. possible in a binary condition, and the probability refects continuity between absolute certainty and impossibility. By Vulnerability (VUL). It means the probability that an asset is using this probabilistic approach, it is possible to balance the unable to resist the action of the threat agent. Te vulnerabil- compensation probability with the understanding of the loss ity occurs because there is a diference between the power of probability, and it has an advantage that the range of the the threat agent and the ability of the asset to resist it. Tat acceptable level of the decision maker (or manager) can be is, the vulnerability is relative to the type of threat. Tese quantifed with respect to the risk level. VULs are measured as a combination of Treat Capability 6 Wireless Communications and Mobile Computing

Contact

reat Event Frequency (TEF) reat Score Very High HH CCC Action High M H H C C Loss Event Frequency Moderate M M H H C (LEF) Low L MM HH reat Very Low LLM MM

Capability Frequency Event reat VL L M HVH (TCap) Vulnerability

Vulnerability Te Loss Event Frequency (LEF) component of the FAIR RISK (VUL) model could be used as an assessment factor for threats reat ID reat Name reat Category Control Provide type information of the asset reat Community Strength that is the attack target for the threat assessment reat Score (CS) Asset Class(id, Name, ...) Confidentiality Impact Integrity Impact Availability Impact Risk Score Loss Magnitude Primary Loss (LM) Event Frequency (PLEF) Primary Loss (PL) Te Loss Magnitude (LM) components of the FAIR model are used Primary Loss to measure the magnitude of damage to the asset. Magnitude (PLM) Secondary Loss (SL)

Secondary Loss Event Frequency (SLEF)

Secondary Loss Magnitude (SLM)

Figure 3: Using the FAIR model for threat assessment.

(TCap) and Control Strength (CS). TCap means the level of or legitimate and illegal gains against cyber-attacks, such as expected threat agent power that could have a negative impact Anonymous, APT18 (Wekby), APT19 (Codoso), and APT28 on the asset. Since all the threats of TCap are not created the (Tsar). Tese organizations receive informal support, so if same, the threat agent does not perform the same function in the sponsor organization becomes a state or government by one threat community. Also, although the TCap may be high cyber threat TTP in the future, the threat community may for an attack target in which a threat agent is set, it may be be changed to Nation-States. Individuals consist of outsiders, incompetent for other objects. CS refers to the strength with insiders, etc. and perform cyber-attacks based on personal which an asset possesses resistance against threats, measured beliefs or retaliation. A typical example is Edward Snowden. against the threat agent or threat community. If the CS is set Trough the classifcation of TCom, threat profling stores to a small number of controls, the probability value of each and manages attack patterns and characteristics of malicious control can be calculated independently. attackers with motive, primary intent, sponsorship, preferred Te elements that measure these threats can create a general target characteristics, preferred targets, capability, profling list considering the assets and threat agents that can personal risk tolerance, and concern for collateral damage. be the targets of the attack at the end. Te attackers are called the threat community (TCom), which is divided into Nation- 4. Limitation of Mobile Malware Detection States, Groups, and Individuals according to the “NIST SP 800-30: Risk Management Guide for Information Technology Research and development for the detection of malicious Systems”. Nation-States refers to threat agents who conduct behavior by attackers are one of the important technologies cyber-attacks by government or government support. Typical in the feld of security. As the usability of mobile device examples are APT37 (North Korea), APT32 (Vietnam), and increases in IoT environment, threatening behavior is chang- APT33 (Iran). Group refers to threat agents for political ideals ing not only damage to one device, but also damage to the Wireless Communications and Mobile Computing 7 whole environment of the society. From this point of view, AndroidMarket.Teyusedlogisticregression(LR),SVM, most malicious behavior detection studies tend to focus on and RF to construct a small, efcient classifer that could automation technology such as training set confguration and detect malware applications early in the sandbox [45]. W.Niu detection algorithm, such as machine learning [37]. et al. proposed a method to detect the advanced persistent Z. Aung and W. Zaw proposed a semisupervised algo- threat (APT) malicious code command and control (C and rithm to detect malware in Android. Tis research extracted C) domain with high accuracy by analyzing the mobile DNS permission features from Android’s apk fles and clustered log. Te study scored the domain through Alexa rankings and fles suspected of malicious activity. In addition, this research VirusTotal and identifed the C and C domain of malware classifed clusters of malware using three methods: Decision using the Global Abnormal Forest (GAF) algorithm. As a Tree Algorithm, Random Forests (RF), Classifcation and result, they confrmed that the proposed method achieves Regression Tree (CART). Tey found that for 500 sample more than 99% accuracy compared with the local outlier Android applications, the RF algorithm showed a high factor (LOF), �-NN, and iForest algorithm [46]. accuracy of 91.8% [38]. D. J. Wu et al. proposed a static Most studies are focused on improving the efectiveness analysis-based mechanism to detect Android malware. Te or accuracy of malware detection. However, in recent years, proposed model clusters the information related to per- the focus of CTI on security has focused on preventing missions and intents in the manifest fle by setting them accidents. Detection of malicious behavior of an attacker is as features. Finally, this research applied the classifcation also an important technology, but whether or not the detected algorithm to detect Android malware. Tey experimented result can express a criterion to respond to a threat is one of with Expectation Maximization (EM), clustering method of the important challenges. Recently, MITRE has emphasized K-means algorithm, k-nearest neighbor (�-NN) and Na¨ıve that it is the task of the current security feld to recognize Bayes classifcation algorithm and found that the algorithm threats while presenting the method of Treat Assessment combining K-means and KNN is 97.87% [39]. N. Khan et and Remediation Analysis (TARA) [34]. In addition, Fireeye, al. proposed an efcient method for detecting malicious a global security frm, publishes intelligence reports quarterly Java scripts in web applications. In this research, feature and claims that measurement and evaluation of cyber threats subset uses the wrapper method to reduce dimensions and are essential [47]. supervised learning algorithms to achieve high accuracy. Tey applied support vector machine (SVM), Na¨ıve Bayes, decision tree, �-NN, and RF algorithms and found that the 5. Threat Assessment Procedure from the �-NN algorithm achieves about 98.3% accuracy [40]. Perspective of SA H. S. Ham et al. conducted research to detect Android malware from the viewpoint of a secondary impact such In terms of the situational awareness model, threat measure- as privacy infringement and information leakage caused by ment tends to depend on the human cognitive judgment. combination of IoT and smartphone. Tey used a linear Tis study measures the LEF factors of the FAIR model as SVM algorithm to detect Android malware. As a result, this a probability distribution to minimize qualitative judgments study showed that the linear SVM algorithm yielded 95.7% and provide objective indicators to decision-makers. Tese higher than other machine learning classifcation algorithms probability distributions aim at securing objectivity of threat (average precision = 68.7%) [41]. J. Sahs and L. Khan detected assessment by detecting actual malware data. Te proposed Android malware using a one-class SVM algorithm. Tey SA model consists of three levels. Te frst level is Malware used the permissions of the application as a feature, resulting Awareness, which detects the attacker’s malicious behavior in about 50% true negative (TN) and about 90% true positive using supervised learning. Te second level is Treat Aware- (TP) for 91 malicious Android applications out of 2081 [42]. ness, which maps the features of detected malicious activity M. G. Schultz et al. studied the malware detection of Android to the LEF factors of the FAIR model and then evaluates malwareasafeatureoftheDLLandtherawhexadecimal the threat. Tis level also clusters the threats to the number representation of the system call, string, and binary generated of classes required by the decision maker. Te fnal level is by the program. Tis study compared the accuracy of the Decision-Making Awareness, which provides decision-makers signature method, RIPPER, Na¨ıve Bayes, and Multi-Na¨ıve with an optimal level of threat (see Figure 4) [48]. Bayesalgorithm.Asaresult,theMulti-Na¨ıve Bayes algorithm showed a higher accuracy of 97.76% than the other algorithms [43]. 5.1. Malware Awareness. As the expansion of the IoT envi- K. Riad and L. Ke detected malicious applications in the ronment increases the usability of mobile devices, cyber- Google Play Store. Tis research proposes a RoughDroid attacks on smartphones are increasing. Particularly, this is algorithm, a foppy analysis technology that can discover done by a malicious application of attack on the mobile Android malware applications directly on smartphones. Tey environment, and most attacks are an unknown attack. extracted the features of the XML manifest fle and Dex fle of However, existing rule-based algorithms have limitations the Android application as features and obtained the accuracy in actively detecting new threats (e.g., unknown malicious of 95.6% [44]. I. Martin et al. analyzed indirect functions and applications). Te Malware Awareness Level aims to detect metadata to identify patterns in Android malware applica- threats using machine learning. Te features of the high tions. Te research focused on malware detection, such as detection result (e.g., ROC, True Positive, Accuracy, etc.) application developers and certifcate issuers published on the obtained through the classifcation of the threat are used as 8 Wireless Communications and Mobile Computing

LEVEL 1 High Accuracy

Select features with high accuracy True Positive ↑ False Positive ↓ Determine classifcation algorithms with high accuracy ROC Curve ≈ 1 LEVEL 1 LEVEL 1 LEVEL 1 Android Malware Feature Selection Classification Application Algorithm Permission KNN Timestamp Cosine KNN Cost of Attack Cubic KNN Certificate Weighted KNN Select features Linear SVM for threat detection Quadratic SVM Cubic SVM Provide algorithms suitable for threat detection Logistic Regression

Contact LEVEL 3 reat Event Provide a semantic threat class Frequency (TEF) LEVEL 2 reat Score Very High Set features with high accuracy Action HH CCC High as input values Loss Event M HH CC Moderate of LEF factors of FAIR model Frequency MM HH C Low (LEF) L MMHH TCap Very Low LLM MM

Treat Assessment Frequency Event reat VL L M H VH Vulnerability LEVEL 2 Vulnerability (VUL) Optimize threat class CS LEVEL 3

Figure 4: Situational awareness model for measuring threat using Android malware detection.

the probability distribution of the FAIR model’s LEF factors Te EM algorithm is a process of iteratively fxing one of �� for threat measurement. and ���, and the convergence value is calculated through it. Te TC value is the minimum value for m. Terefore, if the � � 5.2. Treat Awareness. Te probability distribution of fea- function TC is diferentiated to �, � is used as a value to tures that resulted in high detection results at the Malware cluster the threat class. Tis can be expressed as [50] Awareness Level becomes the input value of the LEF factors. ∑ � � Tis study matches the detected results to TEF and VUL � = � �� � � ∑ � (3) (TCAP, CS). Tere is a limit to the difculty in clearly � �� distinguishing the threat class from the LEF measured by the joint probability of TEF and VUL. To solve this problem, this 5.3. Decision-Making Awareness. Even if the threat class research uses the K-means clustering algorithm. Based on the is clustered through the K-means algorithm at the Treat matched results, the decision maker clusters � features into � Awareness Level, there may be a limit to the semantic deci- threat classes with � dimensions. For example, in the case sion. To overcome these limitations, this level is optimized of fve ratings from “Very High” to “Very Low”, � is 5.Te by clustering threat class using Gaussian Mixture Model probability value of the features matched to TEF and VUL (GMM). In this paper, we assume that the distributions is ��.Inthiscase,��� =1if the � th �� corresponds to of the threat class are combined into the number of � � (� = 1, 2, 3, ⋅ ⋅ ⋅ , �) th threat cluster, and 0 is otherwise. Te Gaussian distributions. Te GMM can statistically deduce the distortion measure function for clustering the threat class is characteristics of these � Gaussian probability distributions. TC.Tiscanbeexpressedas[49] Te value of � is the number of threat classes set at Treat � � 2 Awareness Level. Te GMM performance measure is a log {1, if �=argmin ��� −��� likelihood function that estimates the parameter maximizing ��� = { � (1) �� 0, the probability of the function at the Treat Awareness { Otherwise Level. Te parameter �� of the log likelihood function, which � � optimizes the threat at the Decision-Making Awareness level, � � 2 � Σ �� = ∑∑��� ��� −��� (2) is the mean �,covariance � of the Gaussian distributions �=1�=1 and the probability ��. Tis is equivalent to

To minimize the threat class function TC value, we use the �(��,�,�,Σ)=∑ ∑�(�� |�,Σ ) ln ln � � � (4) iterative method EM (Expectation Maximization) algorithm. � � Wireless Communications and Mobile Computing 9

Since the GMM optimization algorithm is also difcult to measurement. If this space is used, it is efective, for high- estimate by jointly updating, this level uses the EM algorithm, dimensional calculation because it has the advantage of which is an alternative update method. Tis Decision-Making expressing more than four dimensions. Weighted k-NN is Awareness level minimizes the log likelihood function to a method of clustering by assigning the distance weight as provide an optimized threat class for decision-makers. It a reciprocal of the square. Tis method has been used for fxes ��,calculates��, Σ�,orfxes��, Σ� and computes ��. learning data near distance for new input data has more Equations to optimize the threat class are 5, 6, and 7. � is a infuence on the decision than neighboring data far away. parameter that makes optimization calculations easier with [54]. latent variables [51, 52]. Support Vector Machine (SVM). It is a method of classifying ∑� �(��� =1|��)��� �� = (5) through a set of hyperplanes. Tis algorithm classifes the ∑� �(�� =1|��) hyperplane into data that makes the classifer error small. � Tat is, the categorized application data calculates the hyper- ∑� �(��� =1|��)(��� −��)(��� −��) plane with the greatest distance from the nearest malicious Σ� = (6) ∑� �(�� =1|��) application data. To calculate this, the algorithm classifes the data by defning a kernel function �(�, �).Tesumof ∑� �(��� =1|��) �(�, �) represents the degree of proximity of the training �� = (7) � data � and its corresponding data point �� and measures the relative proximity of these points. SVM uses a linear (1st 6. Result of Threat Assessment for Android order kernel function) as a technique to fnd the hyperplane Malware Application that maximizes the margin while classifying the data well. However, since it is difcult to classify data completely To maximize the availability of IoT devices, companies devel- through any straight line, various kernel functions should be oping IoT devices are encouraging applications to be installed used for accurate classifcation. In this study, kernel functions on user’s smartphones to help connect their IoT devices. were modifed from frst to third order, and kernel functions However, when a users’ smartphone is exposed to a malicious were transformed into Gaussian functions [55]. application, the personal information stored on the user’s smartphone and the privacy information collected through Logistic Regression. It refers to the use of the relationship the IoT device may be leaked to the attacker. Terefore, between dependent and independent variables as a concrete when detecting the malicious behavior of the smartphone function for future prediction models. Tis algorithm is a in the center of the multiple link system, it is important to classifcation technique in which the dependent variable is the process of assessment by CTI as well as the process of categorical data and the result of the data is classifed into producing it. Tis section measures the threat of a malicious a specifc class when the input data is given. Tis study application in the Android environment by applying the classifed malicious application features into categorical data threat assessment procedure of the SA model perspective [56]. proposed above. Te Malware Awareness of this study derives its results through the Confusion Matrix, which evaluates the per- 6.1. Detection of Android Malicious Application. Te data of formance of machine learning algorithms. Te Confusion this study refer to the malicious application data obtained Matrix is represented by true positive (TP), false negative by static analysis from J. Jang et al. Tis data consists of (FN), false positive (FP), and true negative (TN). TP is a mea- APIs related to malicious behavior of Android applications, sure of normal activity of an application as normal behavior. and a list of system commands. Te total number of data is FN is a measure of normal behavior as malicious behavior. FP 2000 (1500 normal applications, 500 malicious applications) is a measure of malicious behavior as normal behavior. TN [53]. Based on this data, this study applied classifcation is an indicator of malicious behavior as malicious behavior. algorithms (e.g., KNN, SVM, Logistic Regression, etc.) with Trough these indicators, true positive rate (TPR), positive predictive value (PPV), false negative rate (FNR), accuracy high accuracy during related studies. Te Confusion Matrix �1 was used for the measured results. (ACC) and -score can be measured (see from (8) to (12)) [57]. �-Nearest Neighbor (�-NN). It is a nonparametric method used for classifcation or regression. Tis algorithm consists �� � = :����������V ���� (8) of the nearest training data in the feature space with . TPR �� + �� In �-NN, objects are classifed by majority vote of objects assigned to the most common items among nearest neighbors �� PPV = :��������� (9) of �. Tis study utilized various k-NN algorithms. Te basic �� + �� k-NN algorithm utilizes the Euclidean distance measure. �� Cosine k-NN is a method of using cosine similarity to cluster = :�������� FNR �� + �� (10) the similarity between vectors measured using the cosine of the angle between two vectors of inner space. Cubic k- �� + �� = (11) NN is a clustering method using Minkowski space distance ACC ��+��+��+�� 10 Wireless Communications and Mobile Computing

1.00

0.80

0.60

0.40

0.20

0.00 KNN Cosine Cubic Weighted Linear Quadratic Cubic Gauss Logistic KNN KNN KNN SVM SVM SVM SVM Regression TPR Accuracy PPV F1 Score FNR Figure 5: Detection results of Android malware using machine learning.

Table 1: Detection results using k-NN, SVM, and Logistic Regression Algorithms.

Weighted Linear Quadratic Cubic Measure �-NN Cosine �-NN Cubic �-NN Gauss SVM LR �-NN SVM SVM SVM TPR 0.876 0.907 0.839 0.917 0.990 0.980 0.980 0.853 0.990 PPV 0.99 0.98 0.99 0.99 0.99 0.99 0.99 0.99 0.99 FNR 0.124 0.093 0.161 0.083 0.010 0.020 0.020 0.147 0.010 ACC 0.925 0.940 0.900 0.950 0.990 0.985 0.985 0.910 0.990 F1 0.930 0.942 0.908 0.952 0.990 0.985 0.985 0.917 0.990

2�� �1 − = : these two factors is expressed as the frequency of malicious Score 2�� + �� + �� (12) application’s threats in the Android environment. However, �������� ���� �� ��������� ��� �������V��� the frequency of these threats is not an indication of the success of the attack. Previously, the features used for Te Android malicious application is categorized into 136 detection at the Malware Awareness Level were identifed features corresponding to permission and intent, resulting in as “permission” and “intent”. Te features of “permission” high accuracy (see Table 1; Figure 5). In particular, the algo- have features that allow access to most data, such as camera, rithms that show the highest outcome among the algorithms location, call history, etc. in the Android environment. used are SVM series, such as linear SVM, Quadratic SVM, Typical examples are android.permission.GET ACCOUNTS, Cubic SVM, and logistic regression. Te features (such as android.permission.BLUETOOTH ADMIN, android.permis- “permission” and “intent”) that show high detection results sion.READ SMS,etc.Inthecaseof“intent”,mostoftheless are used in Treat Awareness as factors of threat in FAIR relevant features (android.intent.action.MAIN, android.in- model. tent.action.BOOT COMPLETED, android.intent.action.DA- TA SMS RECEIVED, etc.) that access malicious application 6.2. Treat Assessment of Android Malicious Application. were excluded from the threat assessment process. Tere is a Detection of malicious Android applications can achieve research result that features of “permission” act as a features accurate results using previously studied machine learning of the malicious applications as mentioned in Limitation of algorithms. However, in the situational awareness (or deci- Mobile Malware Detection. Based on the detection results, the sion maker) viewpoint, there is a limit in choosing the TEF is set to the probability distribution of the features of the countermeasure through the detection of the threat. In other “permission” series and refected in the threat assessment. words, decision-makers must consider the cost of ignoring or responding to threats to assets they hold. To maximize Vulnerability (VUL) of Malicious Application. VUL is caused the efective use of machine learning, the threat assessment by the diference between the power of threat (TCap, Treat method proposed in this study utilized the factors of LEF of Capability) and the defending ability (CS, Control Strength) the FAIR model. of the target asset. Te threat in this process is whether the application installed in the Android environment has Treat Event Frequency (TEF) of Malicious Application.Te access to the device’s data. Terefore, TCap can be confgured TEF of the FAIR model consists of the factors of Contact with the probability of “permission” features included in and Action. Contact indicates the possibility of an attack, the Android malicious application. In other words, the usually measured in frequency. Action means that the attack- more access privileges a malicious application has, the more ing action works in a real environment, and it is mainly robust it is. Tese threat capabilities have various malicious expressed by probability. In other words, the combination of “permissions” that an attacker could illegally acquire various Wireless Communications and Mobile Computing 11

Table 2: Utilizing “Zerodium Payouts for Mobiles” for Index of Control Strength (CS).

Range of Price Android Vulnerabilities Price Measure Up to $500,000 Email App RCE+LPE 9.5 SMS/MMS RCE+LPE 9.18 Signal RCE+LPE 8.22 Viber RCE+LPE 7.58 Up to $1,500,00 Chrome RCE+LPE 6.94 Documents RCE+LPE 6.62 Media Files RCE+LPE 6.3 Baseband RCE+LPE 5.98 Up to $100,000 LPE to Kernel 5.34 Wif RCE+LPE 4.7 Up to $50,000 Chrome UXSS/SOP 4.38 LPE to Root 4.04 RCE via MitM 3.72

pieces of information. CS is the defensive ability of assets. 7. Summary and Discussion Tis is the same as the ability to fnd vulnerabilities in the Android environment and to test them against an exploit Te expansion of the IoT environment is making connec- kit. Tese indices can be set based on the price traded in tivity between cyberspace and real space stronger. Also, for Android malware. “Zerodium” is a representative company efcient control of the IoT environment, many companies that deals with mobile malware. Trough “Zerodium Payouts are concentrating on developing mobile devices such as for Mobiles”, this study sets up CS for the Android OS (see smartphones (hardware development). However, the security Table 2) [58]. of the application installed on the smartphone is weak. Tis For the threat assessment, this study assumed that each results in a higher attack rate of malicious attackers. Te probability distribution of TEF and VUL for the Android malicious application is being updated on the market (e.g., Malicious application was generated under independent con- Google Play Store) without being verifed by the adminis- ditions. Trough this, not only the TEF and VUL factors, but trator. Many studies have focused on the detection of these also the threat class (LEF) can be measured. In the case of such malicious applications, and their accuracy and efciency are a combination, it is difcult to intuitively judge the current approaching commercialization. However, there is a limit situation from the viewpoint of the decision maker. For this to utilize the detected result of a malicious application for reason, the above-mentioned method adds an optimization decision-making from the manager’s point of view. Also, step through GMM. Tat is, at the Treat Awareness Level, existing risk assessment studies are concentrated on owning the decision maker can classify three (Low, Moderate, and assets, so there are limitations that simplify the threat, and High), fve (Very Low, Low, Moderate, High, and Very High), there are few studies evaluating threats in connection with or overall seven classes by changing the � value against the research on threat detection through machine learning. Tis threat, and visualization is possible through the optimization study proposes a method to extend and assess threat detection process through GMM. Te advantage of subdividing this using machine learning for applications installed in the threat class is the ability to determine the time and cost Android OS. Te proposed scheme is Malware Awareness of investing in countermeasures to counter the threat (see (Level 1) aimed at detecting malicious behavior for Android Figure 6). application, Treat Awareness (Level 2) for rating it, and A representative example from the point of view of threat Decision-Making Awareness (Level 3) for optimizing threat assessment is TARA. In this case, the basic strategy is to class. measure threats and identify sophisticated countermeasures. Te reasons for approaching from the viewpoint of SA are In particular, this method assumes that each countermeasure also related to CTI which is a recent issue. Te availability of is independent when selecting a countermeasure. Tis has CTI is an essential element of threat assessment. Te TARA the limitation that the selected countermeasures can reduce developed by MITER also emphasizes the ability to identify the efciency of the other countermeasures. However, if the countermeasures through threat assessment. In addition, elements of the threat are identifed and probabilistic, such as cyber-SA framework research from the “Cybaware” project the approach presented in this study, a conditional approach has resulted in an asset, confguration, impact, threat, and to the countermeasure of the threat is possible. In addition, visualization as key areas of research [59]. In particular, the when combining with recent threat detection technology threat area has identifed and evaluated the types of attackers using machine learning, deep learning, artifcial intelligence, (TTP, Tactics/Techniques/Procedures) and objectives and etc., it is possible to contribute to the efciency of decision- developed countermeasures as research results. Terefore, making by securing the real time of threat evaluation. the proposed approach can contribute to threat detection, 12 Wireless Communications and Mobile Computing

High Very Low High 0.15 Low 0.15 0.1 0.1 Very High 0.05 0.05 0 0.4 0 0.4 Degree of Treat Degree of Treat 0.3 0.3 −0.05 Moderate 0.2 −0.05 Low 0.2 0 0 Moderate 0.2 0.1 0.2 0.1 0.4 0.4 0 0.6 0 0.6 0.8 Vulnerability 0.8 Vulnerability 1 −0.1 1 −0.1 (VUL) (VUL) Treat Event Freq. (TEF) Treat Event Freq. (TEF) (a) Number of threat class: 3 (b) Number of threat class: 5

0.15 0.15 0.1 0.1 0.05 0.05

0 0.4 0 0.4 Degree of Treat Degree of Treat 0.3 0.3 −0.05 0.2 −0.05 0.2 0 0 0.2 0.1 0.2 0.1 0.4 0.4 0 0.6 0 0.6 0.8 Vulnerability 0.8 Vulnerability 1 −0.1 1 −0.1 (VUL) (VUL) Treat Event Freq. (TEF) Treat Event Freq. (TEF) (c) Number of threat class: 7 (d) Number of threat class: 10

Figure 6: Distribution of threats according to clustering of threat class.

production, measurement, and evaluation of CTI in the [5] P.R. Krishna, “Role of Internet of Tings (IoT) in Entrepreneur- security feld. ship,” 2017. [6] W. C. Ashmore, “Impact of alleged Russian cyber attacks,” Data Availability ARMY Command and General Staf Coll Fort Leavenworth Ks School of Advanced Military Studies,2019. Te data used to support the fndings of this study are [7] P. B. Pathak and Y. M. Nanded, “A dangerous trend of cyber- available from the authors upon request. crime: ransomware growing challenge,” International Journal of Advanced Research in Computer Engineering & Technology Conflicts of Interest (IJARCET),vol.5,pp.371–373,2016. [8]M.Baezner,“Cyberdisruptionandcybercrime:Democratic Te authors declare that there are no conficts of interest Peoples Republic of Korea,” ETH Zurich,vol.9,2018. regarding the publication of this paper. [9]A.Shabtai,L.Tenenboim-Chekina,D.Mimran,L.Rokach,B. Shapira, and Y. Elovici, “Mobile malware detection through Acknowledgments analysis of deviations in application network behavior,” Com- puters & Security,vol.43,pp.1–18,2014. Tis work was supported by Defense Acquisition Program [10]X.Yu,Z.Tian,J.Qiu,andF.Jiang,“Adataleakageprevention Administration and Agency for Defense Development under method based on the reduction of confdential and context Contract UD160066BD. terms for smart mobile devices,” Wireless Communications and Mobile Computing,vol.2018,ArticleID5823439,11pages,2018. References [11] F. Liu, C. Wang, A. Pico et al., “Measuring the insecurity of mobile deep links of Android,” in Proceedings of the26th [1] J. Rivera, Gartner Identifes Te Top 10 Strategic Technology USENIX Security Symposium (USENIX Security ’17),pp.953– Trends for 2014,Gartner,Inc,2016. 969, 2017. [2] K. Panetta, Inc. Retrieved at October,Gartner,Inc.,2018. [12] N. Sombatruang, Y. Kadobayashi, M. A. Sasse, M. Baddeley, [3] N. Hunn, “Te Market for Hearable Devices 2016-2020,” Tech- and D. Miyamoto, “Te continued risks of unsecured public nical Report, 2016. wi-f and why users keep using it: evidence from japan,” in [4] O. Gadyatskaya, F. Massacci, and Y. Zhauniarovich, “Security Proceedings of the 2018 16th Annual Conference on Privacy, in the Firefox OS and Tizen mobile platforms,” Te Computer Security and Trust (PST), pp. 1–11, IEEE, Belfast, Northern Journal,vol.47,no.6,pp.57–63,2014. Ireland, August 2018. Wireless Communications and Mobile Computing 13

[13] M. La Polla, F. Martinelli, and D. Sgandurra, “A survey on [31]G.P.TaddaandJ.S.Salerno,“Overviewofcybersituation security for mobile devices,” IEEE Communications Surveys & awareness,” in Advances in Information Security,vol.46,pp.15– Tutorials,vol.15,no.1,pp.446–471,2013. 35, Springer, 2010. [14] N. Virvilis, N. Tsalis, A. Mylonas, and D. Gritzalis, “Mobile [32]N.Evancich,Z.Lu,J.Li,Y.Cheng,J.Tuttle,andP.Xie, devices: A phisher’s paradise,” in Proceedings of the 2014 “Network-Wide Awareness,” in Cyber Defense and Situational 11th International Conference on Security and Cryptography Awareness,pp.63–91,Springer,2014. (SECRYPT-2014), pp. 1–9, IEEE, Austria, August 2014. [33]P.Barford,M.Dacier,T.G.Dietterichetal.,“CyberSA: [15] K. L. Chiew, K. S. C. Yong, and C. L. Tan, “Asurvey of phishing Situational awareness for cyber defense,” in Cyber Situational attacks: Teir types, vectors and technical approaches,” Expert Awareness,vol.46,pp.3–13,Springer,2010. SystemswithApplications,vol.106,pp.1–20,2018. [34] C. Atapour, I. Agrafotis, and S. Creese, “Modeling Advanced [16] S. Na and T. Kwon, “RIK: A virtual keyboard resilient to spyware Persistent Treats to enhance anomaly detection techniques,” in smartphones,” in Proceedings of the 2014 IEEE International Journal of Wireless Mobile Networks, Ubiquitous Computing, and Conference on Consumer Electronics, ICCE 2014, pp. 21-22, Dependable Applications,vol.9,no.4,pp.71–102,2018. January 2014. [35] J. Wynn, J. Whitmore, G. Upton et al., “Treat assessment reme- [17] J. Cho, G. Cho, S. Hyun et al., “Open sesame! design and imple- diation analysis (TARA): Methodology description version 1.0,” mentation of backdoor to secretly unlock android devices,” Mitre Corp Bedford Ma, Article ID MTR110176, 2011. Journal of Internet Services and Information Security (JISIS),vol. [36] J. Jones, “An introduction to factor analysis of information risk 7,no.4,pp.35–44,2017. (fair),” Norwich Journal of Information Assurance,vol.2,no.1, [18] Joint Chief of Stafs, Joint publication 3-12 (R) Cyberspace 2006. Operations, pp. 2-4, 2013. [37] I. Kotenko, I. Saenko, and A. Branitskiy, “Applying big data [19] D. Clark, “Characterizing Cyberspace: Past, Present And processing and machine learning methods for mobile internet Future,” MIT CSAIL,vol.1,pp.1–4,2010. of things security monitoring,” Journal of Internet Services and [20] R. Philip, Enabling Distributed Security in Cyberspace,Depart- Information Security (JISIS),vol.8,no.3,pp.54–63,2018. ment of Homeland Security, 2011. [38] Z. Aung and W. Zaw, “Permission-based android malware [21] M. Burhan, R. A. Rehman, B. Khan, and B.-S. Kim, “IoT detection,” International Journal of Scientifc & Technology elements, layered architectures and security issues: A compre- Research,vol.2,no.3,pp.228–234,2013. hensive survey,” Sensors,vol.18,no.9,2018. [39] D.-J. Wu, C.-H. Mao, T.-E. Wei, H.-M. Lee, and K.-P. Wu, [22] N. Koshizuka and K. Sakamura, “Ubiquitous ID: standards “DroidMat: android malware detection through manifest and for ubiquitous computing and the internet of things,” IEEE API calls tracing,”in Proceedings of the 7th Asia Joint Conference Pervasive Computing,vol.9,no.4,pp.98–101,2010. on Information Security (AsiaJCIS ’12),pp.62–69,August2012. [23] I. Andrea, C. Chrysostomou, and G. Hadjichristof, “Internet of [40] N. Khan, J. Abdullah, and A. S. Khan, “Towards vulnerability Tings: Security vulnerabilities and challenges,” in Proceedings prevention model for web browser using interceptor approach,” of the 20th IEEE Symposium on Computers and Communication, in Proceedings of the 9th International Conference on IT in Asia ISCC 2015, pp. 180–187, July 2015. (CITA ’15),pp.1–5,August2015. [24] G. Aloi, G. Caliciuri, G. Fortino et al., “Enabling IoT inter- [41] H. S. Ham, H. H. Kim, M. S. Kim et al., “Linear SVM-based operability through opportunistic smartphone-based mobile android malware detection for reliable IoT services,” Journal gateways,” Journal of Network and Computer Applications,vol. of Applied Mathematics, vol. 2014, Article ID 594501, 10 pages, 81, pp. 74–84, 2017. 2014. [25] A. D. Schmidt, H. G. Schmidt, L. Batyuk et al., “Smartphone [42] J. Sahs and L. Khan, “A machine learning approach to android malware evolution revisited: Android next target?” in Proceed- malware detection,” in Proceedings of the 2012 European Intelli- ings of the 4th IEEE International Conference on Malicious and gence and Security Informatics Conference, EISIC ’12, pp. 141–147, Unwanted Sofware, pp. 1–7, 2009. August 2012. [26] M. Gigli and S. Koo, “Internet of Tings: Services and Applica- [43] M. G. Schultz, E. Eskin, E. Zadok, and S. J. Stolfo, “Data tions Categorization,” Advances in Internet of Tings,vol.1,no. mining methods for detection of new malicious executables,” 2,pp.27–31,2011. in Proceedings of the IEEE Symposium on Security and Privacy, [27] P. Barnaghi, W. Wang, C. Henson, and K. Taylor, “Semantics for pp. 38–49, May 2001. the internet of things: early progress and back to the future,” [44] K. Riad and L. Ke, “RoughDroid: Operative scheme for func- International Journal on Semantic Web and Information Systems, tional android malware detection,” Security and Communica- vol.8,no.1,pp.1–21,2012. tion Networks, vol. 2018, Article ID 8087303, 10 pages, 2018. [28] M. R. Endsley, “Toward a theory of situation awareness in [45] I. Mart´ın,J.A.Hern´andez, A. Mu˜noz et al., “Android mal- dynamic systems,” Human Factors: Te Journal of the Human ware characterization using metadata and machine learning Factors and Ergonomics Society,vol.37,no.1,pp.32–64,1995. techniques,” Security and Communication Networks,vol.2018, [29] T. Pahi, M. Leitner, and F. Skopik, “Analysis and assessment Article ID 5749481, 11 pages, 2018. of situational awareness models for national cyber security [46] W. Niu, X. Zhang, G. Yang et al., “Identifying APT malware centers,” in Proceedings of the 3rd International Conference on Domain based on mobile DNS logging,” Mathematical Problems Information Systems Security and Privacy, ICISSP 2017,pp.334– in Engineering,vol.2017,ArticleID4916953,9pages,2017. 345, February 2017. [47]L.Qiang,Y.Zeming,L.Baoxu,J.Zhengwei,andY.Jian, [30] J. Okolica, J. T. McDonald, and G. L. Peterson, “Developing sys- “Framework of Cyber Attack Attribution Based on Treat tems for cyber situational awareness,” 2nd Cyberspace Research Intelligence,” in Interoperability, Safety and Security in IoT,pp. Workshop, pp. 46–56, 2009. 92–103, Springer, 2016. 14 Wireless Communications and Mobile Computing

[48] M. Park, J. Seo, J. Han et al., “Situational awareness framework for threat intelligence measurement of android malware,” Jour- nal of Wireless Mobile Networks, Ubiquitous Computing, and Dependable Applications,vol.9,no.3,pp.25–38,2018. [49] A. K. Jain, “Data clustering: 50 years beyond K-means,” Pattern Recognition Letters,vol.31,no.8,pp.651–666,2010. [50] J. Wu, “Cluster Analysis and K-means Clustering: an introduc- tion,” in Advances in K-means Clustering,pp.1–16,Springer, 2012. [51] D. Reynolds, “Gaussian mixture models,” Encyclopedia of bio- metrics, pp. 827–832, 2015. [52] I. D. Dinov, in Expectation Maximization and Mixture Modeling Tutorial, pp. 5–11, 2008. [53]J.-W.Jang,H.Kang,J.Woo,A.Mohaisen,andH.K.Kim, “Andro-AutoPsy: anti-malware system based on similarity matching of malware and malware creator-centric information,” Digital Investigation,vol.14,pp.17–35,2015. [54] N. S. Altman, “An introduction to kernel and nearest-neighbor nonparametric regression,” Te American Statistician,vol.46, no.3,pp.175–185,1992. [55] A. J. Smola and B. Sch¨olkopf, “A tutorial on support vector regression,” Statistics and Computing, vol. 14, no. 3, pp. 199–222, 2004. [56] C.-Y. J. Peng, K. L. Lee, and G. M. Ingersoll, “An introduction to logistic regression analysis and reporting,” Te Journal of Educational Research,vol.96,no.1,pp.3–14,2002. [57] D. M. Powers, Evaluation: from Precision, Recall And F-Measure to ROC, Informedness, Markedness and Correlation, 2011. [58] J. Meakins, “A zero-sum game: the zero-day market in 2018,” Journal of Cyber Policy,pp.1–12,2018. [59] R. A. Kemmerer, Cybaware: A Cyber Awareness Framework for Attack Analysis, Prediction, and Visualization, 2011. Hindawi Wireless Communications and Mobile Computing Volume 2019, Article ID 4064201, 12 pages https://doi.org/10.1155/2019/4064201

Research Article Radio Environment Map Construction by Kriging Algorithm Based on Mobile Crowd Sensing

Zhifeng Han , Jianxin Liao ,QiQi , Haifeng Sun, and Jingyu Wang

Institute of Network Technology, Beijing University of Posts and Telecommunications, 100088, China

Correspondence should be addressed to Zhifeng Han; [email protected]

Received 25 October 2018; Revised 28 December 2018; Accepted 13 January 2019; Published 3 February 2019

Academic Editor: Ilsun You

Copyright © 2019 Zhifeng Han et al. Tis is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

In the IoT era, 5G will enable various IoT services such as broadband access everywhere, high user and devices mobility, and connectivity of massive number of devices. Radio environment map (REM) can be applied to improve the utilization of radio resources for the access control of IoT devices by allocating them reasonable wireless spectrum resources. However, the primary problem of constructing REM is how to collect the large scale of data. Mobile crowd sensing (MCS), leveraging the smart devices carried by ordinary people to collect information, is an efective solution for collecting the radio environment information for building the REM. In this paper, we build a REM collecting prototype system based on MCS to collect the data required by the radio environment information. However, limited by the budget of the platform, it is hard to recruit enough participants to join the sensing task to collect the radio environment information. Tis will make the radio environment information of the sensing area incomplete, which cannot describe the radio information accuracy. Considering that the Kriging algorithm has been widely used in geostatistics principle for spatial interpolation for Kriging giving the best unbiased estimate with minimized variance, we utilize the Kriging interpolation algorithm to infer complete radio environment information from collected sample radio environment information data. Te interpolation performance is analyzed based on the collected sample radio environment information data. We demonstrate experiments to analyze the Kriging interpolation algorithm interpolation results and error and compared them with the nearest neighbor (NN) and the inverse distance weighting (IDW) interpolation algorithms. Experiment results show that the Kriging algorithm can be applied to infer radio environment information data based on the collected sample data and the Kriging interpolation has the least interpolation error.

1. Introduction However, the primary problem of constructing REM is how to collect large scale of data. Currently, most of the REMs In the IoT networks, 5G technology is characterized by are aimed at small scale and applied to specifc applications. higher bit rates with more than 10 Gigabits per second as And the universal methods to build a REM are by deploying well as by more capacity and very low latency, and it will sensors in a certain environment to collect the sensing data. leverage novel technological concepts to meet the “anywhere However, the REM is applied to dozens of diferent kinds of and anytime” requirements of IoT devices. With the rapid networks and applications, which makes the networks and development of IoT devices amount, the demand for wireless applications have to collect data separately [2, 3]. Besides, the spectrum resources is increasing. In order to dynamically same data can hardly be shared and reused among diferent plan the spectrum resources to improve the utilization of applications, resulting in duplication of data collection and radio resources to provide well control of IoT devices’ access a waste of resources. Terefore, it is of great signifcance control, we can build radio environment map (REM) to to construct a large scale and universal REM, which can collect and understand the radio information. REM can ofer integrate data sources of radio environment and avoid the multidomain environmental information, such as geographi- cost of the reconstructing database [4]. Mobile crowd sensing cal features, available services, spectral regulations, locations (MCS) is an efective solution to solve this problem, which is and activities of radios, relevant policies, and experiences [1]. a novel emerging paradigm that leverages the smart devices 2 Wireless Communications and Mobile Computing carried by ordinary people to collect information and has Te rest of the paper is organized as follows. In Section 2, facilitated many sensing applications, such as environment the related works are introduced. Section 3 outlines the archi- monitoring, trafc detection, social interaction, and public tecture of the REM based on MCS. In Section 4, the Kriging information sharing [5]. MCS can be applied to collect interpolation algorithm is introduced. Te simulation results radio environment information in the sensing area. In order are illustrated in Section 5. Section 6 presents the conclusion. to characterize environmental information comprehensively, recruiting adequate ordinary users with smart devices to 2. Related Works participate in radio environment information collection is needed. Compared with the traditional data collection tech- Building REM needs a large number of sensors and kinds of nologies, MCS collects the environment information by built- radio environment information, which is a great challenge. in sensing modules in the mobile terminals, and it has the At present, data collection methods for REM can mainly be properties of mobility, the ubiquity of nodes, the powerful categorized into three types. First is integrating or accessing storing, and computing ability [6, 7]. the related information directly from existing databases, esti- Wireless network signals are all electromagnetic waves, mating radio propagation characteristics by sofware tools, whose transmission and attenuation are complex process. and leveraging cognitive radios devices or networks to sense Terefore, in this paper, we only analyze the transmission data. Gathering data from the existing database is a relatively and attenuation processes of electromagnetic waves in space convenient way, while the data updating time depends on entropy in ideal conditions. Under ideal conditions, the prop- the updating period of the underlying database. Moreover, agation process is free from any obstruction and without any the historical information is not stored in the underlying multipath propagation. Ten the propagation model of space database. Riihijarvi¨ et al. take vantage external datasets to electromagnetic waves is a free space propagation model. build REM, but the update cycle of the external datasets is According to the pattern of wireless electromagnetic wave very long which makes datasets unable to meet the real- transmission in free space, spatial interpolation algorithm time requirement of REM [8]. Constructing REM in this way can be applied to restore the uncollected radio environment makes it difcult to satisfy the upper-layer applications with information data of the sensing area. Te Kriging inter- the requirement for real-time and historical information. polation algorithm has been widely used in geostatistics Second, the way to characterize and estimate the properties principle for spatial interpolation but is not broadly used in of radio transmission based on sofware is to calculate the wireless network area. Kriging spatial interpolation algorithm signal attenuation by modeling so that we can better plan estimates unknown point data and not only considers the the radio environment [9, 10]. Te model in [11] clearly relative positions of estimated points and known sample gives a solution to the signal difraction problem caused by points, but also considers the relative positional relationship the occlusion, but this requires an accurate vector model between all sample points. In this paper, we proposed to use of all three-dimensional structures, with limited data and Kriging interpolation algorithm to infer the uncollected radio resolution in most experimental environments. It cannot be environment information based on the collected sample data. applied to applications that require high accuracy. Te above- In this paper, we proposed to apply the MCS to collect mentioned estimation method usually provides limited data, the radio information. Furthermore, to address the problem bad accuracy of the data. Tird, the method based on wireless of the incomplete radio environment information caused device or external network mainly uses the information by the inadequate sensing data, we proposed to apply the sensing ability of heterogeneous spectrum sensor network to Kriging interpolation algorithm to infer the uncollected radio collect data [12, 13]. In terms of data collection, MCS refers to environment information with the collected sensing data. the sensing paradigm in which mobile users with sensing and Our contributions are as follows: computing devices are tasked to collect and contribute data in order to enable various applications [14]. It combines people- centric sensing and crowdsourcing so that a great number of (i) We propose a REM prototype system based on MCS, ordinary users with smart devices can cooperate with each where the ubiquitous, massive, and high dimension other to form a sensing network and deliver the sensing REM-related data can be sensed and collected by the tasks [5]. Ten participants can upload the sensing data to terminals carried by mobile users. the MCS platform. Te development of MCS has resulted in various novel sensing applications. Some typical examples (ii) We propose to apply Kriging interpolation algorithm include the air quality inspection application Common Sense to infer the uncollected radio environment informa- for air quality monitoring [5] by the University of California tion caused by the target area being not covered by the Berkeley, the Creek Watch application to evaluate city water participants. resources [15] by IBM, and the Nericell system [16] by Microsof to monitor road and trafc condition implemented (iii) We set up experiments to collect the sample of the by piggybacking on smartphones that users carry with them radio environment information and infer the missing in normal course. MCS has attracted much attention from radio information of the target area. Te results researchers due to advantages such as ubiquitous sensor show that the Kriging interpolation algorithm can nodes, good participant mobility, low maintenance cost, and infer the missing radio information and has the least rich sensing data types. Hence, MCS can be applied to collect interpolation error. the radio environment information. Wireless Communications and Mobile Computing 3

MCS can be applied to collecting large scale data due last, the visualization layer shows the REM relating results in to the properties of mobility, the ubiquity of sensing nodes. the forms of the feld strength map, heat map, and some other However, limited by the budget, there are no enough par- maps. ticipants recruited to join the data collection, which makes Our proposed architecture involves various functional the radio environment information incomplete. To build blocks, communicating via well-specifed interfaces. To the REM, complete radio information is needed. A lot of establish a complete radio environment map, the funda- works have been done about inferring the missing data mental problem is the collection of a large number of data according to the sample data in many research felds. Talvitie with complex types and data processing and visualization. et al. investigate spatial interpolation and extrapolation algo- Our system consists of fve diferent function modules: data rithms for construction of fngerprint databases [17]. Lacking sensing, data collection, data processing, data analysis, and knowledgeaboutthebeaconlocations,measurementatan visualization; each of them has its own function. unknown point is interpolated based on actual measurements Data sensing module is operated by the MCS network, in the surrounding. Tere are several interpolation algo- which is organized by mobile terminals carried by mobile rithms considered in [17], which include linear interpolation users. When a mobile user receives a data sensing task, it based on Delaunay triangulation, the nearest neighbor (NN), will determine whether the user is involved in the task. If and the inverse distance weighting (IDW) to name a few. Te so, it will collect the required data by the sensing module results show that location accuracy is enhanced by utilizing embedded in the terminal. Moreover, it also uploads data constructed databases comparing to the incomplete database. to the web server by diferent types of network accessing Grimoud et al. use an iterative process to obtain the REM technologies like Wi-Fi/3G/4G. Our system includes per- based on Kriging interpolation to reduce the measurement ception of user-uploaded data and calls, mobile phone map data required [18]. Umbert et al. apply Kriging and a modifed API, real-time construction of heat map, and signal strength version of the Inverse Distance Weighted (IDW) algorithm map. Users can use wireless detection real-time view of to build a REM of an outdoor TV spectrum resources [19]. the environment in which the radio spectrum resources are Hence, there is a spatiotemporal correlation between radio used. environment data, and the Kriging interpolation algorithm Data collection mainly includes area partition, incentive can be applied to infer the missing radio environment infor- mechanism, nodes selection, task distribution, data storage, mation data. Here, we use Kriging interpolation algorithm and data distribution. Te area partition is designed to to infer the missing radio environment information data identify whether a sensing task refers to a geographical according to the collected sample data. location or is based on some social relationships. In our system, we divided it into regional division and business 3. The REM Based on MCS Architecture division. Te incentive mechanism is used to reduce the cost of the platform as well as attracting enough sensing users. In this section we will introduce the REM based on MCS. Furthermore, node selection mechanism needs to select the First, we present the system architecture and discuss the appropriate nodefor the data sensing and also needs to assign system components of the radio environment information the sensing nodes to the corresponding sensing tasks if there data collection platform. Second, we introduce the data is more than one task. collection process used to collect the radio environment Data processing module mainly includes two modules: information data. data preprocessing (fltering and cleaning) and data fusion, which is implemented by the MapReduce workfow. Te data 3.1. Radio Environment Information Collection processing fow is as follows. Firstly, the Avro in the data System Architecture fusion module compresses various types of formats of the data and merges massive small fles into large fles to improve 3.1.1. Radio Environment Information Collection Platform. the efciency of MapReduce. Secondly, as the raw data is Figure 1 shows the overview of our system based on MCS. varying in data types, the data cleaning and fltering can As shown in Figure 1, from bottom to upper layer, the play an important role to remove the noise and interference system includes data sensing layer, data collection layer, such as error data. Tirdly, these data are processed by data processing layer, data analysis layer and visualization sever cluster, and the processing results are stored in data layer. In the data sensing layer, a large number of mobile center. terminals constitute the mobile crowd sensing network, and Data analysis is responsible for the statistical analysis and they play the role of data sensing by running our data calculation afer the data preprocessing. In order to exhibit collecting APP named wireless detect. Te mobile terminals the radio environment on the map, it needs to perform upload the sensing data to our cloud servers via Wi-Fi/3G/4G analysis and calculation to get the related parameters such networks. Te data collection layer is mainly responsible for as the channel occupation, frequency band occupancy, and receiving data, node selection, task allocation, and making background noise intensity. incentive mechanism to recruit enough interested nodes Te visualization module is responsible for the REM- to participate in the sensing tasks. Te data preprocessing related data parameters exhibition. We designed the visual- includes arranging the data format and data fusion. Te data ization for the REM properties. Te system can show the analysis layer is responsible for the statistical analysis and Wi-Fi signal coverage, cellular signal coverage heat map, calculation of the radio environment relevant parameters. At and Wi-Fi channel occupation ratio map. Te visual REM 4 Wireless Communications and Mobile Computing

Visualization

Termodynamic Field Strength Map Spectrum Occupancy Map Diagram

Data Analysis

Fusion Filtering Cleaning Data Processing Hadoop

HDFS Data Data Data HBase

Data Collection

Data Sensing

Figure 1: Radio information collection architecture based on MCS.

makes it easy to identify the radio environment of the target In our REM task assignment stage, the system is responsi- area. ble for recruiting and selecting participants for the MCS task. Correspondingly, this stage includes participants recruiting, 3.2. Radio Environment Information Collection Process. In participants’ selection, and incentive mechanism. We choose [20], the author proposed 4W1H model in mobile sensing the well suited participants to join the sensing task to collect anddividedtheMCSlifecycleintofourphases,whichis the radio environment information, and reward them for the shown in Figure 2: task creation, task assignment, individual high quality sensing data. Te purpose of participants recruit- task execution, and crowd data integration according to the ing is to encourage enough people to join the sensing task and MCS life cycle. Next, we will discuss the following key design get more radio environment data. However, limited by the issues: REM task creation, REM task assignment, participants budget of the platform or the human mobility, only part of recruiting, and participants’ selection. the participants can join the radio environment information Te task creation specifes the sensing timing and cov- sensing task. Ten the radio environment information data is erage area for the REM. In our system, the web server incomplete; we will talk about the solution later. releases the sensing tasks to the users who are interested in In sensing task execution, participants conduct sensing the data collection task. REM supports long spatiotemporal tasks and upload the sensed data to the MCS platform. information for the upper-layer applications, so the sensing Te participants receive the sensing tasks and then collect time is continuous. the radio environment data. Te selected participants are Wireless Communications and Mobile Computing 5

Task creation Task assignment Individual task execution Crowd data integration Individual task MCS (user id, task id, task 1 region id, start time, end time, Data stream (activity, image, voice location,...) MCS task sensor types, ...) Individual task app requirement Sense Compute Upload Cloud server (when, what, where) Query Individual task app MCS ofine task 2 Data analysis Individual task app visualization

Individual task app

MCS task 3

Figure 2: Te four-stage life cycle of the mobile crowd sensing process [20]. distributed in the target places collecting data. Afer radio Table 1: Description of parameters in the equations. environment data collection the participants upload the data Symbols Descriptions to the MCS platform server by cellular networks (3G/4G) or �� Random variable WLAN. ��,� Diference between two neighbor points During the data integration, the main issue is to achieve �[��,�] Mathematical expectation the MCS task that is to process and analyze the raw data �(ℎ) Variogram function received from mobile terminals and visualize the required �(ℎ) Number of pairs of observations results eventually. ℎ Te separation distance ��,� Variogram value � A set representing the area of interest 4. Missing Data Inference by Kriging �, � Fitting parameters with a strict constraint �� � 4.1. Related Defnition. In this section we will introduce how � RSS value received from the th beacon to infer the missing radio environment information data by Kriging interpolation algorithm. Te whole process is distribution of {��} at times �1 +�,⋅⋅⋅,�� +�.Ten,{��} is shown in Figure 3, which can be divided into two steps. said to be strictly (or strongly) stationary if, for all �,forall�, First, we analyze the distribution of sample points distributed and for all �1,⋅⋅⋅��, ��(�� +�,⋅⋅⋅ ,�� +�)=��(�� ,⋅⋅⋅�� ). in the sensing area and propose a variogram model to 1 � 1 ��� refect the spatial structure characteristics and distribution 4.2. Problem Formulation characteristics of the variables. Second, we use Kriging algorithm to calculate the missing data according to the (1) Analysis of the distribution of sample points: collected sample radio environment information data. Tird, Given the sample radio environment information some basic concepts and related defnitions are introduced, � data �� of the area �⊂�, we need to use the and all the parameters are listed in Table 1. variogram function to analyze the distribution of sample data in the sensing area. If the sample data Defnition 1 (variogram). In spatial statistics the theoretical only depends on the distance ℎ between the sample variogram 2�(�1,�2) is a function describing the degree of data points, we can use Kriging interpolation to infer spatial dependence of a spatial random feld or stochastic � more data. process �(�). Given an area of interest �⊂�,themeanof RSS value at a location �� is considered as a random variable (2) Kriging interpolation process: (RV) ��. Ten, the mean of RSS values over the area can be Given the sample radio environment information � represented by a random feld (RF), which is a collection of data �� of the area �⊂�, we need to infer the spatial RVs, {�� |�� ∈�}. complete radio environment information data �� to build the REM. Defnition 2 (stationary process). Formally, let {��} be a � {� ,⋅⋅⋅ ,� } stochastic process and let � �1+� ��+� represent the 4.3. Te Variogram. First, some basic concepts and defni- cumulative distribution function of the unconditional (i.e., tions of the Kriging interpolation algorithm are introduced with no reference to any particular starting value) joint as follows: 6 Wireless Communications and Mobile Computing

Whole Kriging interpolation process

Output No Stability ? end Sample radio Information data Input Kriging Output Variogram Interpolated interpolating Input data

Input

Figure 3: Te whole process of inferring more radio information data by Kriging interpolation algorithm.

Te variogram is ofen applied in the statistical process ℎ, which means the value is relative to the relative position of geolocation-related information. variogram can describe and does not depend on the absolute position. According to the structural change and distribution of variables in the the above two conditions, it can be concluded that the target geospatial space. Assume that the value of sample data point � regionalization variable is strictly second-order stationary in in the sensing area is �(�) and the sample data value of point the target area. However, since it is difcult to satisfy a strict �+ℎis �(� + ℎ). Ten half of the variance of the diference second-order stationary state in real life, the condition that between the values at the two points is defned as the variation satisfes the strict second-order stationary state is weakened of �(�) at position �. Te function can be expressed as to obtain an intrinsic assumption, also called an intrinsic [�(�) − �(� + 1 assumption. Similarly, when the increment � (�, ℎ) = {[� (�) −�(�+ℎ)]2} ℎ)] of the target regionalization variable �(�) satisfes the 2 var following two conditions, it is said to satisfy the intrinsic 1 assumption: = (� [� (�) −�(�+ℎ)]2 (1) 2 (i) First, the sample radio environment information data 2 − {� [� (�) −�(�+ℎ)]} ) have to satisfy the rules, which are listed as follows.

where �(�, ℎ) is the variogram and �(�) and �(� + ℎ) are � [� (�) −�(�+ℎ)] =0 (4) sample attribute values of the variables at points � and �+ℎ in the target area, respectively. ℎ is the distance between � �+ℎ �{[�(�) − �(� + ℎ)]2} (ii) Second, the variance function of the target region- and ,and is the mathematical [�(�) − �(� + ℎ)] expectation. In the variogram function, when the increment alized increment exists and is �(�) of the variable [�(�)−�(�+ℎ)] of the target area satisfes stable; the variance function of the increment does � the following two conditions, it is said that �(�) satisfes the not depend on ,then second-order stationarity. var [� (�) −�(�+ℎ)] (i) First, the sample radio environment information data 2 2 have to satisfy the rules, which are listed as follows: =�[� (�) −�(�+ℎ)] − {� [� (�) −�(�+ℎ)]} (5) � [� (�)] =�[� (�+ℎ)] =�, ∀�,∀ℎ (2) =�[� (�) −�(�+ℎ)]2 ,∀�,∀ℎ. means the target regionalization variable does not have obvi- ous characteristics in terms of space and fuctuates around m. When the variation of the target area satisfes the weak second-order stationary or intrinsic assumption, due to (ii) Second, in the entire target area, the covariance �[�(�)−�(�+ℎ)]=0, the half-diference function can be function of �(�) exists and is stable; namely, expressed as follows. cov [� (�) ,�(�+ℎ)] =�[� (�) � (�+ℎ)] 1 2 (�, ℎ) = � [� (�) −�(�+ℎ)] (6) −�[� (�)] � [� (�+ℎ)] 2 (3) =�[� (�) � (�+ℎ)] −�2 At this time, the increment [�(�) − �(� + ℎ)] of �(�) is only related to the distance between two points. Te above =�(ℎ) . variogram function is a theoretical variogram function. In actual operation, multiple sample data needs to be divided Te attribute value �(� + ℎ) in the target area has no into multiple pairs for calculation, like {�(��), �(�� + ℎ)}(� = relationship with the position point � and is only related to 1,2,...,�(ℎ)),where�(ℎ) is the number of pairs of points in Wireless Communications and Mobile Computing 7

∗ the sample data divided by ℎ. Te extremes of the variogram �[� (�0)] = � [� (�0)] . (10) can be calculated in the following way. Ten, �(ℎ) ∗ 1 2 � (ℎ) = ∑ [� (� )−�(� +ℎ)] (7) � 2� (ℎ) � � �=1 ∑�� =1. (11) �=1 In (7), ℎ is the distance between the sample point and the point to be estimated, and �(ℎ) is the number Ten under the condition that �(�) is second-order of samples used to calculate the variogram of the sample stationary, the calculation process of the estimated variance between (��,�� +ℎ). Afer the above steps, the analysis of can be performed by the following method. the distribution characteristics of sample points in the target 2 ∗ 2 ∗ 2 area has been completed. However, in order to estimate the �� =�[� (�0)−�(�0)] −{�[� (�0)−�(�0)]} unknown value of the target area variable, the ftting of the � � � convenience function point of the actual sample is also called (12) = ∑∑������,� −2∑����,0 +�0,0 the theoretical variogram model. Te theoretical model of the �=1�=1 �=1 variogram is to abstract the experimental variogram and then use it to calculate the Kriging interpolation. In order to minimize the variance of unbiased estimates, ∗ � Te empirical variogram contains values at a limited min{var[� (�0)−�(�0)] − 2� ∑ (�� −1)},andthenwecan ℎ �=1 number of . To estimate the measurements at unknown get the equations for the weighting coefcients �� in (8). locations, access to the value of h between the scattered points in the empirical variogram is required. Hence, a � ∑� (� ,� )+�= (� ,�) mathematical model is selected to be ftted in the empirical �cov � � cov 0 � �=1 variogram. Tis model is frequently chosen from spherical (13) model, exponential model, Gaussian model, power model, � and linear model. We choose the sample radio environment ∑�� =1, �=1,2,⋅⋅⋅,� data and input them to the Matlab. Ten we can use the ftting �=1 function to fnd a mathematical expression. Te equations are the Kriging equations. In addition, (8) 4.4. Kriging Interpolating. Once the variogram is obtained, is written in matrix form: values at unknown locations can be estimated based on [�][�] = [�] (14) known data points. Mathematically, this problem can be regarded as a spatial interpolation problem. Assuming that where the target area to be studied is A, the variable in the target area is {�(�) ∈ �},where� represents a position in the target area. �1 �(�) � (�= 1,2,⋅⋅⋅ ,�) [ ] Te sample value of in the target area � [ � ] �(� )(� = 1, 2, ⋅ ⋅ ⋅ , �) �(� ) � [ 2 ] is � .Tenthevalue 0 at the point 0 [ . ] � [�] = [ . ] to be estimated is the weighted sum of the known point [ ] sampling values: [ ] [ �� ] � [ −� ] � (�0) = ∑��� (��) (8) �=1 �01 � (� = 1,2,⋅⋅⋅ ,�) [ ] where � is the weight coefcient of [� ] [ 02] the known sample point. Due to the fact that �(�) satisfes [ ] [ . ] the second-order stationary assumption when analyzing the [�] = [ . ] [ ] (15) distribution of sample points in the target area, then [ ] [�0�] (i) there is a mathematical expectation for variable �(�), 1 and the expected value is a constant, �[�(�)] = �; [ ]

(ii) there is a covariance function for the variable A; that �11 �12 ⋅⋅⋅ �1� 1 is, the value of the point to be estimated in the target [ ] [� � ⋅⋅⋅ � 1] area is only related to the distance B between the [ 21 22 2� ] [ ] positions of the known sample points; then, [ . . . .] [�] = [ . . d . .] . [ ] {� (�) ,�(�+ℎ)} =�[� (�) ⋅�(�+ℎ)] −�2 [ ] cov [� � ⋅⋅⋅ � 1] (9) �1 �2 �� =�(ℎ) . [ 11⋅⋅⋅10]

According to the unbiased requirements of the interpola- Afer solving the weight coefcients � of the above tion available, equations, you can use (8) to calculate and calculate the 8 Wireless Communications and Mobile Computing predicted values with the valuation points. Since the Kriging Table 2: Experimental equipment parameter confguration. interpolation algorithm has a minimum estimation error based on known samples and is considered according to the AP MAC channel Transmission power (dBm) distribution of the attribute values of the target region, the AP1 48:7A:DA:B9:21:A0 11 20dBm data of more known sample points within the target region AP2 48:7A:DA:B9:12:60 1 20dBm can be used for estimation, and the estimated values are closer AP3 50:DA:00:9D:B3:E0 6 20dBm to the true value. We apply the Kriging interpolation algorithm to infer more radio environment information data according to the 5.2. Interpolation Performance Evaluation. In this section, sample collected information. We use the sample data to experimental settings are described in detail. We will intro- estimate the value of sensing area. Our Kriging weights are duce the experiment environment frst, and the baseline derived through minimizing the estimator error variance; methods used in the experiments are presented. Te exper- that is, imental data is also introduced, and experimental settings ∗ and evaluation metrics are also proposed to evaluate the min var (�� − ��) ��∈� (16) performance of our method.

× under the unbiasedness constraint, given by the following. 5.2.1. Experimental Settings. Te experimental area is 150m 150m as the sensing area to collect the radio environment ∗ information and infer the missing radio environment infor- �[� − ��]=0 (17) � mation data. In order to simulate the wireless network environment of the target area, three wireless network Te mathematical expectation of the sample radio envi- access points (APs) and radio network controllers (ACs) are ronment information is zero. Assuming the intrinsic station- deployed to form a WLAN in the target area to cover the arity and utilizing Lagrange multiplier optimization algo- target sensing area. Te AP and AC are H3C WA4320-ACN rithm to minimize the estimator error variance (16) under the � and H3C WX3010E. In our WLAN network, the frequency unbiasedness constraint (17), the Kriging weights � in (8) can band of the electromagnetic wave transmitted by the AP is be calculated as 2.4 GHz band. In order to better cover the target area and reduce mutual interference between APs, the three APs use � � ⋅⋅⋅ � 1 −1 � 1 1,1 1,� 1,� the 1, 6, and 11 channels of the 2.4G band respectively, and the . . . . . transmission power of AP electromagnetic waves is 20 dBm. . . d . . . ( )=( ) ( ) (18) All parameters are listed in Table 2. �� ��,1 ⋅⋅⋅ ��,� 1 ��,� In order to simulate a small number of users collecting radio environment information in the target area, we mesh � 1 ⋅⋅⋅ 1 0 1 the target area. We divide the target area into subareas, each of which has a size of 5� × 5�, and then the participants where ��,� is the radio environment information variogram collect the WLAN signal in diferent subareas using mobile value between the ith and jth neighbor data points, ��,� is the devices to collect Received Signal Strength Indication (RSSI), radio environment variogram value between the ith neighbor as shown in Figure 7. Te device used by the participants data point and the interpolation point. to collect the RSSI value is a Lenovo smart phone (Lenovo A3910e70), and the Wi-Fi analyzer is used to obtain the received signal strength value of the wireless network in the 5. Simulation Evaluation subarea. In order to prove that using the Kriging spatial In this section, we show the prototype REM system based on interpolation algorithm to infer the wireless network envi- MCS and demonstrate the simulation results of the inference ronment data in our REM platform has higher accuracy, of the missing radio environment information data according control experiments are set up. In the control experiments, to the sample data. the missing wireless network environment in the target area is restored by Nearest Neighbor (NN) and Inverse Distance 5.1. Implementation of the Prototype System. As is shown Weighting (IDW) according to the sample data of the wireless network environment data collected by the participants. in Figure 4, the radio environment information is collected and displayed in the web portal of the platform. As we can see several Wi-Fi properties can be seen on the banner. 5.2.2. Baseline Methods. To verify the high accuracy of the Te properties collected by participants are as follows: SSID, Kriging spatial interpolation algorithm, we used the nearest BSSID, frequency, and the Wi-Fi signal strength level. Te neighbor (NN) interpolation algorithm, inverse distance information belongs to the Wi-Fi signal sources sensed by the weighted (IDW) interpolation algorithm, and Kriging inter- nearby participants. Te density of the red nodes represents polation algorithm to predict the restoration goal under the the density of the Wi-Fi signal sources. As we can see in same data volume of sample data. Figure 8 the density of the Wi-Fi signal source is not uniform. (a) NN [21, 22]: Nearest neighbor interpolation is a Tisresultisasexpected. simple method of multivariate interpolation in one or more Wireless Communications and Mobile Computing 9

200

180

160

140

120

100 r(h) 80

60

Figure 4: Te web portal of REM based on MCS. 40

20

0 dimensions. For a given set of points in space, a Voronoi 05 10 1520253035404550 diagram is a decomposition of space into cells, one for each h given point, so that anywhere in space, the closest given 1 r∗ ℎ) = ∑N(ℎ) Z x  −Z(x +ℎ))2 pointisinsidethecell.Tisisequivalenttonearestneighbor ( 2N(ℎ) i=1 i i interpolation, by assigning the function value at the given 1.0857 point to all the points inside the cell. r(ℎ) = 0.25 × ℎ (b) IDW [23]: Inverse distance weighting (IDW) is a Figure 5: Variation function discrete data points and ftting curves. type of deterministic method for multivariate interpolation with a known scattered set of points. Te assigned values to unknown points are calculated with a weighted average of the function, it is suggested that a power model is selected; that values available at the known points. is,

5.2.3. Experimental Data. In order to obtain the discrete data {0, ℎ = 0 � (ℎ) = { (19) points calculated from the experimental variogram based �ℎ�,ℎ≥0 on the WLAN sample data collected in the target area, a { theoretical variogram model and a power function model where � and � are the ftting parameters with a strict were selected according to the distribution of the discrete constraint that 0 <�<2. data points. To verify the efect of diferent numbers of As indicated in the fgure, the power function is well sample data on inference of the radio environment data of the ftted, so we choose the power function as the variogram. It entire target area, the target area is divided into 315 subareas, can be obtained by ftting with Matlab ftting function, as is and a scale factor of the number of sample data and total shown in Figure 5. Afer the Matlab ftting function, we can data is set. During the experiment, the WLAN emission 1.68 get the variogram being �(ℎ) = 0.25 × ℎ .Ten,wecan electromagnetic wave signal propagates in the free space and calculate the variogram between all known points. Te value causes attenuation. Terefore, the smaller the distance from of sample RSSI points is related to the distance. the electromagnetic wave transmission position, the greater the RSSI value that can be obtained. We choose �=0.1 and �=0.3to indicate the percentage (10% and 30%) of 5.2.4. Evaluation Metrics. First, the signal propagation is subareas being covered by the participants of the target area, simulated over the interest area. Meanwhile, the interpolation respectively; by drawing the target area WLAN signal RSSI error of the Kriging interpolation algorithm is compared with heat map, we can compare the accuracy of our algorithm nearest neighbor (NN) interpolation algorithm and inverse under diferent coverage situations. According to the sample distance weighted (IDW) interpolation algorithm. data collected from the sample, the heat map of the RSSI is plotted using the Kriging spatial interpolation algorithm. 5.3. Prediction Performance Analysis. In this section, the Figure 5 presents the empirical variogram and ftting performances of proposed method are evaluated. First, we result of a beacon in radio environment information. We show the results of user latent interest distribution. Ten, the input the sample information data to Matlab. Te ftted curve impact of interest number and the proportion of training set demonstrates the spatial correlation model of data and is used on link prediction can be verifed. to estimate the information data at a sensing location. As shown, the value of empirical variogram, which is the scatter 5.3.1. Signal Propagation. When �=0.1,10%ofallsub- plot in Figure 7, increases with h. It infers that there is an areas of the target area are covered by the participants to obvious trend (general spatial variation of the mean value) collect WLAN RSSI values, and then we use all the WLAN of RSS distribution in the area. Compared with more widely data restored by the Kriging spatial interpolation algorithm used ftting functions, e.g., the spherical and exponential according to 10% of the sample data of all data in the target 10 Wireless Communications and Mobile Computing

− −   − −  −  −

 −  −

 −  − width (m) width − (m) width − RSSI (dBm) RSSI   (dBm) RSSI − −   − −  −  −                     length (m) length (m) � Figure 6: When �=0.1, using the Kriging interpolation algorithm Figure 7: When =0.3, using the Kriging interpolation algorithm restores the target area WLAN signal RSSI heat map. restores the target area WLAN signal RSSI heat map.

11 area, as shown in Figure 6. Te lighter color of the heat map 10 indicates that the signal power of the WLAN received by 9 smart phones at the location is greater, whereas the darker 8 the heat map color, the smaller the WLAN power received by 7 phonesatthelocation.InFigure6,itcanbeseenthatthecolor in some areas is not smooth enough, and the RSSI data value 6 fuctuates greatly and its continuity is poor. Figure 7 shows 5 that 30% of all subareas of the target area are covered by the 4 perceived user and collect valid sensory data. Ten 30% of 3 all the data in the target area are restored using the Kriging (dBm) Error Interpolation interpolation algorithm to restore the data of all WLANs 2 in the target area. Te color in the heat map is excessive. 1 Comparing the two fgures, we found that the WLAN RSSI 0 value of the target area restored by spatial interpolation using 0.05 0.1 0.15 0.2 0.25 0.3 30% of all data using the target area is more accurate than the Sample Data Ratio p WLAN received power of the restored 10% of the target area using all the data. Nearest Neighbor Interpolation Inverse Distance Weighted Interpolation Kriging Spatial Interpolation

5.3.2. Interpolation Algorithm Error Comparison. To verify Figure 8: Te interpolation error comparison of NN interpolation the accuracy of the Kriging spatial interpolation algorithm, algorithm, IDW interpolation algorithm, and Kriging interpolation we used the nearest neighbor interpolation (NN) algorithm, algorithm when the sample data proportion is diferent. inverse distance weighted (IDW) interpolation algorithm, and Kriging interpolation algorithm to predict the restoration goal under the same data volume of sample data. error using diferent interpolation algorithms, which can be Figure 8 shows the error comparison of the NN interpo- expressed as follows: lation algorithm, IDW interpolation algorithm, and Kriging � interpolation algorithm when the sample data occupies all 1 � ∗� �= ∑ ��� −� � , �=1,2,...,� (20) the data in the target area. To verify the accuracy of diferent 5� �=1 interpolation algorithms, we selected fve subareas in the target area as comparison regions interpolated using diferent where � denotes the error value afer one of the interpolation interpolation algorithms, and the positions of the fve sub- algorithms uses multiple interpolations. � is the number of areas are shown as the positions of the fve blue stars in the interpolation experiments using this interpolation algorithm. target area, shown in Figure 8. Interpolation error calculation �� indicates that this point uses some interpolation algorithm process is to choose to select each of the fve subareas to to obtain the estimated value based on the sample point data ∗ use three diferent interpolation algorithms to calculate the near the location. � is the measured data in the target area. estimated value of the point based on the nearby sample Figure 8 shows the error of diferent interpolation algo- data, and then take the absolute value of the estimated value rithms when the sample data occupies diferent proportions and the measured value of the point. Ten the experiment of the overall data. When the sample data occupies 0.05 of is repeated several times to get the average value and the the total data, the interpolation error of the three diference Wireless Communications and Mobile Computing 11

 budget, only some of the participants can join the sensing task to collect the radio environment information, which  leads to incomplete radio environment information. To solve the problem, the Kriging algorithm is proposed to infer the  missing radio environment information data with collected sample radio environment information data. Te perfor-  mance is compared to the NN and IDW algorithms over diferent levels of sparsity. Te simulation result shows that  Kriging interpolation algorithm can infer the missing radio environment information data and generates more accurate  radio environment information data than the NN and IDW

Interpolation Error) ( dBm algorithm. In the future, some data estimating and processing  methods [24–26] can be used to add the sensing data for constructing the radio environment map.  . . . . . . Sample Data Ratio p Data Availability

Nearest Neighbor Interpolation Te data used to support the fndings of this study are Inverse Distance Weighted Interpolation included within the article. Kriging Spatial Interpolation Figure 9: Te interpolation error comparison of NN interpolation Conflicts of Interest algorithm, IDW interpolation algorithm, and Kriging interpolation algorithm when the sample data proportion is diferent. Te authors declare that they have no conficts of interest.

Acknowledgments algorithms is relatively large. Te interpolation errors of the nearest neighbor interpolation, Kriging spatial interpolation Tis work was jointly supported by (1) National Natural and inverse distance weighting are 10dBm, 7dBm, and 6dBm, Science Foundation of China (No. 61771068, 61671079) and respectively. With the increase of the proportion of sample (2) Beijing Municipal Natural Science Foundation (No. data, the errors of the three interpolation algorithms are 4182041). reduced. Among them, the error of Kriging spatial interpo- lation algorithm decreases sharply. Ten the error of three interpolation algorithms tends to be stable, and it can be References clearly seen that when the nearest neighbor interpolation [1]P.Singh,M.Kumar,andA.Das,“Efectivefrequencyplanning algorithm is used, interpolation has the greatest error. More- to achieve improved KPI’s, TCH and SDCCH drops for a real over, the IDW interpolation algorithm and Kriging spatial GSM cellular network,” in Proceedings of the 2014 International interpolation algorithm have smaller error when restoring the Conference on Signal Propagation and Computer Technology WLAN electromagnetic wave environment of the target area (ICSPCT), pp. 673–679, India, July 2014. according to the sample data. [2]Z.Hou,Y.Zhou,L.Tian,J.Shi,Y.Li,andB.Vucetic, Figure 9 shows the values of the radio environment data “Radio environment map-aided doppler shif estimation in LTE obtained by interpolating sample data from the nearest fve railway,” IEEE Transactions on Vehicular Technology,vol.66,no. regions in a subregion of the target region using the nearest 5, pp. 4462–4467, 2017. neighbor interpolation algorithm, inverse distance weighted [3] K. Ichikawa and T. Fujii, “Radio environment map construction interpolation algorithm, and Kriging interpolation algo- using Hidden Markov Model in multiple primary user environ- rithm. It can be seen from the fgure that as the proportion ment,” in Proceedings of the 2017 International Conference on of sample data increases; that is, the number of data samples Computing, Networking and Communications, ICNC 2017,pp. increases; the errors obtained by using the three interpolation 272–276, January 2017. algorithms and the real data frst decrease. Moreover, the NN [4] K. Katagiri, K. Sato, and T. Fujii, “Crowdsourcing-Assisted interpolation algorithm has the largest error variation when Radio Environment Maps for V2V Communication Systems,” the sample data is less than 0.2, the interpolation error of the in Proceedings of the 2017 IEEE 86th Vehicular Technology IDW interpolation algorithm decreases as the sample data Conference (VTC-Fall),pp.1–5,September2017. increases. [5]R.K.Ganti,F.Ye,andH.Lei,“Mobilecrowdsensing:current state and future challenges,” IEEE Communications Magazine, 6. Conclusion vol. 49, no. 11, pp. 32–39, 2011. [6]B.Guo,Z.Wang,Z.Yuetal.,“Mobilecrowdsensingand In this paper, we frst introduced MCS to collect data for computing: the review of an emerging human-powered sensing REM construction and proposed a system architecture to paradigm,” ACM Computing Surveys,vol.48,no.1,article7, collect the radio environment information. Limited by the 2015. 12 Wireless Communications and Mobile Computing

[7] H. Ma, D. Zhao, and P. Yuan, “Opportunities in mobile crowd [22] R. G. Keys, “Cubic convolution interpolation for digital image sensing,” IEEE Communications Magazine,vol.52,no.8,pp.29– processing,” IEEE Transactions on Signal Processing,vol.29,no. 35, 2014. 6, pp. 1153–1160, 1981. [8] J. Riihijarvi, J. Nasreddine, and P. Mahonen, “Demonstrating [23] D. Shepard, “A two-dimensional interpolation function for radio environment map construction from massive data sets,” irregularly-spaced data,” in Proceedings of the ACM 23rd in Proceedings of the 2012 IEEE International Symposium on National Conference (ACM ’68),pp.517–524,1968. Dynamic Spectrum Access Networks, DYSPAN 2012, pp. 266-267, [24]Z.Ma,J.Xie,H.Lietal.,“Teroleofdataanalysisinthe October 2012. development of intelligent energy networks,” IEEE Network,vol. [9] S. Ulaganathan, D. Deschrijver, M. Pakparvar et al., “Building 31,no.5,pp.88–95,2017. accurate radio environment maps from multi-fdelity spectrum [25] S. P. Bharati, F. Cen, A. Sharda, and G. Wang, “RES-Q: sensing data,” Wireless Networks,vol.22,no.8,pp.2551–2562, Robust Outlier Detection Algorithm for Fundamental Matrix 2016. Estimation,” IEEE Access,vol.6,pp.48664–48674,2018. [10] S. Ureten, A. Yongacoglu, and E. Petriu, “A comparison of [26] Z. Ma, J.-H. Xue, A. Leijon, Z.-H. Tan, Z. Yang, and J. Guo, interference cartography generation techniques in cognitive “Decorrelation of neutral vector variables: theory and appli- radio networks,” in Proceedings of the ICC 2012 - 2012 IEEE cations,” IEEE Transactions on Neural Networks and Learning International Conference on Communications,pp.1879–1883, Systems,vol.29,no.1,pp.129–143,2018. June 2012. [11] J. Liang, M. Liu, and X. Kui, “A survey of coverage problems in wireless sensor networks,” Sensors & Transducers,vol.163,no.1, pp. 240–246, 2014. [12] Z. Wei, Q. Zhang, Z. Feng, W. Li, and T. A. Gulliver, “On the construction of Radio Environment Maps for Cognitive Radio Networks,” in Proceedings of the IEEE Wireless Communications and Networking Conference, WCNC 2013,vol.1,pp.4504–4509, April 2013. [13]V.Atanasovski,J.VanDeBeek,A.Dejongheetal.,“Construct- ing radio environment maps with heterogeneous spectrum sensors,”in Proceedings of the IEEE International Symposium on Dynamic Spectrum Access Networks, DySPAN 2011,vol.3,pp. 660-661, May 2011. [14]M.Srivastava,T.Abdelzaher,andB.Szymanski,“Human- centric sensing,” Philosophical Transactions of the Royal Society A: Mathematical, Physical & Engineering Sciences,vol.370,no. 1958, pp. 176–197, 2012. [15] A.A.Khan,M.H.Rehmani,andA.Rachedi,“Cognitive-Radio- Based Internet of Tings: Applications, Architectures, Spec- trum Related Functionalities, and Future Research Directions,” IEEE Wireless Communications Magazine,vol.24,no.3,pp.17– 25, 2017. [16] A. Yaqot and P. A. Hoeher, “Efcient Resource Allocation in Cognitive Networks,” IEEE Transactions on Vehicular Technol- ogy,vol.66,no.7,pp.6349–6361,2017. [17] J. Talvitie, M. Renfors, and E. S. Lohan, “Distance-based inter- polation and extrapolation methods for RSS-based localization with indoor wireless signals,” IEEE Transactions on Vehicular Technology,vol.64,no.4,pp.1340–1353,2015. [18] S. Grimoud, B. Sayrac, S. Ben Jemaa, and E. Moulines, “An algorithm for fast REM construction,” in Proceedings of the 2011 6th International ICST Conference on Cognitive Radio Oriented Wireless Networks and Communications, CROWNCOM 2011, pp. 251–255, June 2011. [19] A. Umbert, F. Casadevall, and E. G. Rodriguez, “An outdoor TV band Radio Environment Map for a Manhattan like layout,” in Proceedings of the 13th International Symposium on Wireless Communication Systems, ISWCS 2016,pp.399–403,September 2016. [20] D. Zhang, L. Wang, H. Xiong, and B. Guo, “4W1H in mobile crowd sensing,” IEEE Communications Magazine,vol.52,no.8, pp. 42–48, 2014. [21] R. Olivier and C. Hanqiang, “Nearest Neighbor Value Interpo- lation,” International Journal of Advanced Computer Science and Applications,vol.3,no.4,2012. Hindawi Wireless Communications and Mobile Computing Volume 2019, Article ID 5650245, 13 pages https://doi.org/10.1155/2019/5650245

Research Article Enhanced Android App-Repackaging Attack on In-Vehicle Network

Yousik Lee,1 Samuel Woo,2 Jungho Lee,3 Yunkeun Song,1 Heeseok Moon,4 and Dong Hoon Lee 5

1 ESCRYPT, 4F, ABN Tower, 331 Pangyo-ro, Bundang-gu, Seongnam-si, Gyeonggi-do 13488, Republic of Korea 2Electronics and Telecommunications Research Institute, 218 Gajeong-ro, Yuseong-gu, Daejeon 34129, Republic of Korea �ℎ 3Korea Information Certifcate Authority Inc., 5 Fl C-dong, PDC, 242, Pangyo-ro, Bundang-gu, Seongnam-si, Gyeonggi-do 13487, Republic of Korea 4KATECH, 74 Yongjeong-ri, Pungse-myeon, Dongnam-gu, Cheonan, Chungcheongnam-do, Republic of Korea 5Korea University, Anam-ro, Seongbuk-gu, Seoul 02841, Republic of Korea

Correspondence should be addressed to Dong Hoon Lee; [email protected]

Received 18 October 2018; Accepted 27 December 2018; Published 3 February 2019

Guest Editor: Karl Andersson

Copyright © 2019 Yousik Lee et al. Tis is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Te convergence of automobiles and ICT (information and communication technology) has become a new paradigm for the development of next-generation vehicles. In particular, connected cars represent the most in-demand automobile-ICT convergence technology. With the development of 5G technology, communication between vehicle and external device using autonomous driving and Internet of things (IoT) technology has been remarkably developed. Control of vehicles using smart phones has become a routine feature, and over 200 Android apps are in use. However, Android apps are easy to tamper by repackaging and allowing hackers to attack vehicles with using this vulnerability, which can lead to life-critical accidents. In this study, we analyze the vulnerabilities of connected car environments when connecting with IoT technologies and demonstrate the possibility of cyberattack by performing attack experiments using real cars and repackaging for commercial apps. Furthermore, we propose a realistic security technology as a countermeasure to attain safety against cyberattacks. To evaluate the safety of the proposed method, a security module is developed and a performance evaluation is conducted on an actual vehicle.

1. Introduction technology. [4] 5G will be the backbone of IoT,overcoming all space-time constraints and providing a complete connection Various electronic control units (ECUs) are being mounted with all the user-centric “things”. [5] Te existing V2X in the latest vehicles to enhance the safety and comfort of the technology is evolving into C-V2X using LTE or 5G, and vehi- driver and passengers, and this automobile-ICT (information cles are connected to external devices via IoT technologies. and communication technology) convergence has become Vehicles, which have for decades been typical machines, are a new paradigm for the development of next-generation evolving into large IT systems. As a result, IT experts can now vehicles [1, 2]. As the development of automotive electronics participate in the development of vehicles and the general is accelerating, the proportion of electric parts in automobile public can understand vehicles more easily. production has exceeded 40% and is expected to exceed 50% However, with the development of automotive electron- by 2020. Along with the increase in automotive electronic ics, vehicles have become a new target for hackers [6–8]. Lee parts, communications between vehicles and outside data et al. published the results of a cyberattack experiment that services are also increasing. It is expected that around 75% of analyzed the vulnerabilities of an ELM327-based CAN-to- all vehicles worldwide will be connected cars using wireless Bluetooth device and apps for vehicles [9]. Furthermore, as networks by 2020 [3]. Connected cars, one of the leading use shown in the hacking of GM OnStar by Samy Kamkar and case of IoT, are growing rapidly with the development of 5G the study on the vulnerabilities of in-vehicle smartphones by 2 Wireless Communications and Mobile Computing

Mikhail Kuzin and Victor Chebyshev, vulnerable smartphone Table 1: PID codes and parameters. apps are emerging as a new attack surface for vehicles [10, 11]. In the present study, we automate app analysis tasks that PID Code Description Units Formula ∗ we performed manually in [9] and analyze the vulnerabilities 04 Calculated engine load value % A 100/255 of 213 apps registered in the Google Play store. Based on 0A Fuel pressure kPa 3∗A the results of this analysis, we carry out attacks to control 0C Engine RPM kPa (256∗A+B)/4 actual vehicles afer repackaging the 10 most popular apps 0D Vehicle speed Km/h A among the commercial apps for vehicles. Finally, we propose a security technology to protect against such attacks and develop an access control device based on a whitelist for Te connected car environment consists of the following communication service security for ELM327 devices. Tese components [12, 13]: are universal access control devices applicable to all vehicles if CANIDsarepossibletobeconfguredbycarmanufacturers. (i) Te vehicle in which ECUs have been installed and an Tis paper organized as follows. Background describes in-vehicle network has been confgured the overview of basic concepts and components. Next, (ii) Te portal for providing various services to the Proposed Attack Model introduces how the attack model vehicle we proposed is constructed in both laboratory and real vehicle environment. Countermeasure describes the security (iii) Te communication link for connecting the vehicle measure against the attack we conducted. Finally, Conclusion with the portal presents our conclusions. In such an environment, vehicles have multiple ECUs installed in them, which are connected to an in-vehicle 2. Background network such as a CAN BUS system. Te communication link is constructed using a wireless 2.1. CAN (Controller Area Network). To support efcient communication device such as a telematics ECU or an communication among ECUs, BOSCH developed the Con- ELM327. Te portal is divided into web-based services and troller Area Network (CAN) in the early 1980s. CAN is smartphone application-based services. a sender ID-based broadcast communication technology Trough experiments, this study proves that vehicles are that supports the bus network topology. CAN drastically exposed to serious threats if connected car environments are decreased the complexity and length of in-vehicle commu- constructed without solving the vulnerabilities of in-vehicle nication lines by resolving the drawbacks of point-to-point networks; a security mechanism is then proposed to solve this communication. In the data transmission process between problem. Te attack model that we propose has been designed ECUs using CAN, the sending ECU includes its unique ID based on the connected car environment shown in Figure 1. in the CAN data frame before sending the data. Te receiving ECUs can selectively receive the data afer checking the ID of the sending ECU included in the broadcast data frame. 2.3. OBD Protocol. OBD (on-board diagnostics) is used to CAN bus systems are classifed into the following two types diagnose the vehicle state. Te frst OBD was produced to depending on the length of the ID in the data frame: control the exhaust gases of vehicles [14]. As the development of automotive electronics accelerates, more ECUs have been (i) Standard CAN 2.0A (11-bit ID) installed and the OBD has developed to diagnose them and check the fault list. Later, OBD was established as an (ii) Extended CAN 2.0B (29-bit ID) international standard (ISO 15765-4), and it has become possible to confrm the vehicle state with an automotive Te data frame is composed of SOF (start of frame), Arbitra- diagnosis device through standardized terminals [15]. In tion ID (arbitration), Control, Data, CRC (cyclic redundancy 1996, the US government mandated the installation of OBD- check), ACK (acknowledge), and EOF (end of frame) felds. II in all vehicles sold in the US to control the environmental Te standard CAN 2.0A format uses an 11-bit ID as the problems of exhaust gases. arbitration bit, and the extended CAN 2.0B format uses a 29- Te checking of vehicle status using OBD-II operates bit ID as the arbitration bit. via the request/response method and the OBD PID (OBD Because CAN was designed only for closed network Parameter ID) which is used in checking process defned in environments, it does not provide basic information protec- the SAE J1989 standard [16]. Te structure of the CAN ID tion features (e.g., confdentiality, authentication, and access and data frame for communication defned in this standard is control). Recently, as connected car services were commer- shown in Figure 2. Te PID codes and parameters are defned cialized, the use of FOTA (Firmware Over-Te-Air) spread, in Table 1 [17, 18]. in-vehicle networks, and external networks has become Te diagnostic information of vehicles obtained through interconnected. As a result, CAN has been exposed to the OBD-II can be easily acquired not only through the dedicated same cyberattacks experienced in general IT environments. diagnostic devices used in car repair shops, but also through smartphone applications and ELM327 modules purchased 2.2. Connected Car. Te connected car environment implies from afermarket suppliers. Terefore, malicious users can that vehicles are always connected to an external network. control vehicles by manipulating the diagnostic information Wireless Communications and Mobile Computing 3

e Vehicle e Communication Link Aer Marker Wireless Pairing ECU OBD2 to Bluetooth Mobile IN-Vehicle or Wi-Fi DeDevicevice ECU CAN BUS ECU Component Network

Mobile Application

Mobile Radio Communication Network

e Portal

Figure 1: Connected car environment composed of ELM327 devices and mobile applications.

• Request PID Frame ID Field Data Field

Data 0x7DF Mode PID Code Not Used Length

11bit 1Byte 1Byte 1Byte 5Byte

• Response PID Frame ID Field Data Field

Parameter 0x7E8 Data Mode PID Code Not Used 0x7E9 Length A B C D Parameter Parameter Parameter Parameter

11bit 1Byte 1Byte 1Byte 1Byte 1Byte 1Byte 1Byte 1Byte

Figure 2: Structure of CAN ID and data frame for communication. of vehicles, and security measures are required for the to identify the threats in each attack phase by analyzing diagnostic devices and related protocols. the cyberattack process, and to increase the resilience of organizations by defeating and neutralizing the purpose, intention, and activities of attackers. Te Cyber Kill Chain 2.4. ELM327. ELM327 is a microcontroller for car repair consists of seven steps: reconnaissance, weaponization, deliv- used to diagnose the vehicle state. If a wireless communica- ery, exploitation, installation, command and control, and tion module such as Bluetooth is added to ELM327,the driver actions on objectives. Te characteristics of each phase are can check the vehicle state in real time by connecting his/her described in Table 2 [20]. smartphone with ELM327. Te process of checking the vehicle state using a smartphone and ELM327 is illustrated in Figure 3 3. Proposed Attack Model 2.5. Cyber Kill Chain. Lockheed Martin proposed a cyberat- Te attack model we propose is more realistic than those tack process that is generally applicable to all cyberattacks of existing studies and has been designed on the basis of [19]. Tey also proposed the Cyber Kill Chain method two threats. Te frst threat is that smartphone apps for 4 Wireless Communications and Mobile Computing

Smart phone Vehicle Application

Step 1 Pairing Bluetooth Connecting

With ELM327 ECU

Step 2 AT Command ELM327 Module Default Setting ECU

Loop …. Step 3 PID Request Vehicle Data

Request(PID) ELM327

Step 4 ECU PID Response Vehicle Data Request(PID) ECU Step 5

Show Information ECU

Figure 3: Process of checking vehicle status using smartphone and ELM327.

Table 2: Cyber Kill Chain phases and characteristics. 3.2. Target Vehicle. Te target vehicle is assumed to utilize the connected car environment shown in Figure 1. Te target Phase Action vehicle uses CAN to facilitate communication among the Reconnaissance Identify the target ECUs.However,becausetheCANbussystemdoesnot Weaponization Prepare the operation guarantee authentication for communicated data frames, a Delivery Launch the operation CAN data frame may be used in a retransmission attack Exploitation Gain access to victim [21, 22]. Installation Establish beachhead at the victim Command and Control Remotely control the implants 3.3. Victim Behavior. Te driver of the target vehicle (victim) Actions on Objectives Achieve the mission’s goal downloads a vehicle diagnosis app from the app market onto his/her smartphone and receives various services while driving his/her vehicle. Te vehicle diagnosis app that the vehicles can be easily forged/repackaged and redistributed. driver (victim) downloaded is assumed to be a malicious app Te second threat is that attackers can attack wirelessly distributed through the app market by the attacker. Te driver using repackaged malicious smartphone apps at any time. (victim) using the malicious app cannot recognize the attack Te proposed attack model is composed of an attacker, a behaviors of the malicious app (injecting a data frame that target vehicle, and a victim. Te characteristics of every can control a specifc ECU). entity are gathered to form one attack model. We analyzed the proposed attack model using the cyber intrusion kill 3.4. Attack Model. Te driver of the target vehicle (victim) chain. who has downloaded the vehicle diagnosis app can receive various services while driving his/her vehicle. Te attacker 3.1. Attacker Ability. Using a vehicle diagnostic device, the can infict cyberattacks to many unspecifed vehicles by attacker can acquire a CAN data frame that can drive a abusing this service situation. Te connected car environ- specifc ECU mounted in the target vehicle. Furthermore, ment in Figure 1 represents an environment in which a the attacker can download a vehicle application that is wireless communication module (ELM327) is installed in distributed/sold in the app market and repackage it in a the OBD-II terminal. Trough this terminal, the vehicle can desired form. Te attacker can then inject a desired CAN data connect to the in-vehicle CAN, and the smartphone app frame into the in-vehicle CAN using a malicious app. Te is always interconnected with the in-vehicle CAN during malicious app repackaged by the attacker can be distributed vehicleoperationviatheparingoftheusersmartphoneand through a third-party market. the ELM327 module. Te attack model process is divided into Wireless Communications and Mobile Computing 5

Table 3: Analysis of proposed attack model using Cyber Kill Chain. Table 4: AT commands for controlling vehicles.

(i) Analysis of vehicle control packet (using AT Command Description Group a diagnostic device) Reconnaissance Z Reset All General (ii) Analysis of an Android app for vehicles SH xyz Set Header to xyz OBD (iii) Analysis of ELM327 protocol AL Allow Long(>7byte) message ODB Weaponization (i) Repacking the Android app for vehicles AR Automatic Receive OBD Delivery (i) Distribution to third-party app market R0/R1∗ Responses Of/On∗ OBD Exploitation (i) User downloads and installs the app on SP x Set Protocol to h OBD Installation his/her smartphone CAF0/CAF1∗ CAN Automatic Formatting Of/On∗ CAN (i) Analysis of vehicle operation state Command & Control through the in-vehicle network packets (i) Forced control of the automotive E/E Actions on Objectives (1) Preparation phase 1: communication protocol analy- system, causing a trafc accident sis of the CAN-to-Bluetooth module (ELM327 mod- ule). an attack preparation phase and an attack performance phase. (2) Preparation phase 2: vulnerability analysis of the Te entire process including the attack preparation phase and vehicular application and production of the malicious attack performance phase is as follows: vehicular application. (1) Attacker analyzes the communication vulnerabilities (3) Actual attack phase 1: LABCAR-based attack experi- of a wireless communication module for vehicles ment. (based on ELM327). (2) Attacker downloads an app for vehicles from the app (4) Actual attack phase 2: real car-based attack experi- market, using a wireless communication module for ment. vehicles based on ELM327. Finally, we performed a risk assessment for all apps related to (3) Attacker generates malicious code using the analysis ELM327 sold on the Google Store. In order to risk assessment, results from step (1). we manufactured automated analysis tools. (4) Attacker inserts malicious code in the downloaded app for vehicles and repackages it. 4.1. Preparation Phase (Analysis). In this section, the vulner- (5) Attacker distributes the repacked malicious app abilities of the ELM327 module and the vehicle diagnosis through the app markets (including third-party mar- app are analyzed. In addition, all the apps sold/distributed in kets). the Android app market are examined and the risk of app- (6) Victim installs the ELM327-based wireless commu- repackaging attacks is analyzed. nication module in the OBD-II terminal of his/her vehicle. (A) Analysis of Communication Protocol between Vehicle Diagnosis App and ELM327. As described in Section 2.4, (7) Victim downloads the app for vehicles to his/her the ELM327 module is mounted in the OBD-II port of smartphone from the app market (or a third-party the vehicle, and delivers the vehicle state information to market) and installs it. It is assumed that the down- the vehicle diagnosis app according to the request-response loaded app is the malicious app distributed by the method. Te ELM327 module generally uses a fxed CAN ID attacker. when sending a request message. However, in a special case, (8)Victimrunsthemaliciousappanddriveshis/her the CAN ID and data of the request message can be changed vehicle. using the AT command provided by ELM Electronics. Te (9) As soon as the vehicle starts operating, the malicious AT commands that can be used to control vehicles are listed app carries out a malicious act. in Table 4. However, even if the CAN ID and data are changed, the Te attack model consists of nine steps in total as outlined vehicle recognizes any data as normal data if it follows the above. Table 3 shows an analysis of the proposed attack model communication protocols of AT commands and ELM327. In using the Cyber Kill Chain. other words, if the CAN data desired by the user is sent to ELM327 as shown in Figure 4, the same information is 4. Attack Experiment received in the vehicle. Every vehicle diagnosis app based on ELM327 acquires In this chapter, we prove the possibility of forced control of the vehicle state information using the OBD PID. Terefore, vehicles using an app repackaged by an attacker. Our attack the OBD PID code can be easily found by decompiling and experiment consists of two phases: preparation phases and analyzing the vehicle diagnosis app. In the next section, actual attack phases. Te preparation phase and the actual we discuss the method of fnding vulnerabilities using the attack step are each divided into two detailed steps: AT command and OBD PID information in a commercial 6 Wireless Communications and Mobile Computing

CAN ID Change to 0x222h

Insert CAN DATA [0x11,0x22,0x33,0x44,0x55,0x66,0x77,0x88]

CAN ID Change to 0x700h

Insert CAN DATA [0x12,0x34,0x45]

CAN ID Change to 0x710h

Insert CAN DATA [0x12,0x34,0x45]

Figure 4: Message transmission from outside the vehicle to ELM327. diagnosis app, and the method of repackaging it into a attacked with the desired vehicle control command malicious app. and repackages it. (ii) Delivery of attack command in the specifc condition (B) Analyzing and Repackaging Vehicle Diagnosis App.For of the vehicle. vulnerable vehicle diagnosis apps, the strings of the ELM327 ATcommandandOBDPIDareexposedinplaintext.Ifthe Te vehicle diagnosis app can check the vehicle status using attacker has decompiled the app, he/she can easily fnd the the OBD PID. Te attacker can insert a desired vehicle control AT command and OBD PID by string search only. In this command by using the repackaging method with the OBD ID chapter, the method of searching attack points and inserting in the above-mentioned analysis process. attack codes using the ELM327 AT command and OBD PID is In the next section, the forgery risk associated with the AT described from the standpoint of an attacker, and the process command and OBD PID of existing vehicle diagnosis apps of producing a malicious app using this method is explained. distributed in the Android app market is analyzed. Figure 3 shows the process in which the vehicle diagnosis app operates. Te vehicle diagnosis app uses AT command 4.2. Actual Attack Experiment. Tissectiondescribesthe and OBD PID afer being connected with ELM327. Tis attack experiment, which utilizes an actual vehicle and allows to analyze the vehicle diagnosis app easily. the results from the preparation phase. Te actual attack When we decompile a target app for attack and search experiment in this section corresponds to steps (4) throw the AT command and OBD PID, they are searched. Tis (7) of the seven steps of the Cyber Kill Chain. Te attack searched string is then replaced with the vehicle control string experiment environment is outlined in Table 5. desired by the attacker, and a malicious app is produced by repacking the app. Figure 5 shows how a desired vehicle (A) LABCAR. A LABCAR-based attack experiment was control command can be inserted by simply modifying the performed frst to verify the efectiveness of the attack tools searched AT command or the OBD PID. produced in the preparation phase. LABCAR is confgured Te vehicle diagnosis app checks the vehicle status using as shown in Figure 8, and each component has the functions the OBD PID. If the attacker can fnd the receiving location described below. information in the decompiled vehicle diagnosis app, that If the attack is successful in the above-mentioned LAB- location can be specifed as the attack point. Figure 6 shows CAR environment, it is checked through the operation the results of fnding the attack point and acquiring the of the CANoe monitoring tool and the instrument panel. vehicle information. If the attacker can insert the attack code Tis LABCAR-based attack experiment confrmed that it is for controlling the vehicle based on this information, he/she possible to inject a malicious data frame into an in-vehicle can arbitrarily control the vehicle. Figure 7 shows that the CAN through an app-repackaging attack. vehicle control command has been inserted at the desired point by the attacker. (B) Actual Vehicle. We also performed an attack experiment Trough the above analysis process, two attack tools can using an actual vehicle. Te results of this experiment were be produced. Te attack tool types to be produced and the recorded in a video, which was then uploaded to the web [23]. method of producing the attack tools are as follows: Te attack experiment was carried out using the attack tools produced in the app analysis step. Tere were three attacks in total, which are described below. (i) Delivery of attack command at the time the app is started when the vehicle diagnosis app is started; it (i) Door Unlock. Afer the driver leaves the vehicle in initializes ELM327 using the AT command. Tus, the which ELM327 is installed, the attacker accesses the attacker replaces the AT command in the app to be ELM327 through a smartphone and unlocks the door Wireless Communications and Mobile Computing 7

CAN ID Change to 0x060h

Insert CAN DATA [ 0x11,0x22,0x33,0x44,0x55,0x66,0x77]

Figure 5: Message inserting CAN data by modifying AT command or OBD PID.

Vehicle PID Parameter Data

PID Parameter Data Log

Speed PID(0x0D)

Speed Date

Figure 6: Message attack location of the analyzed vehicle diagnosis app and the acquisition of vehicle information.

Table 5: Cyber AT commands for controlling vehicles.

Experiment tools Model Name Functions & Features Note (i) Target Vehicle Vehicle Omitted Actual vehicle (ii) Analyze CAN Data & Monitoring Target (i) Testing Simulation ECU Instrument panel Omitted LABCAR actual vehicle (ii) Target ECU (i) Simulation CAN BUS CAN BUS CAN CASE XL LABCAR (ii) Connecting ECU and CANoe ELM327 ELM327 (Version 1.5) (i) Diagnostic Device LABCAR actual vehicle Smartphone Samsung Galaxy A3 (OS Version 5.1.1) (i) Target Smartphone Device LABCAR actual vehicle Smali & BakSmali Smali & BakSmali (Version 2.1.0) (i) APK Assemble and Disassemble Tool LABCAR actual vehicle Signapk Signapk (i) APK Signing Tool LABCAR actual vehicle CANoe CANoe (Version 7.1) (i) CAN Monitoring Tool & Capture Tool LABCAR actual vehicle (i) Target Application Diagnostic Application Omitted LABCAR actual vehicle (ii) Deployed Application 8 Wireless Communications and Mobile Computing

Target Vehicle PID Parameter Data

Malicious Function

….

….

Figure 7: Vehicle diagnosis app in which the vehicle control command has been inserted at the desired point.

CANoe Bus Cluster

CANoe

Smart ELM327 Phone

Figure 8: Operation LabCar simulation environment tool.

lock. Once the door lock is unlocked, the attacker started it, the Engine Of command was delivered and can steal items from inside the vehicle and commit the target vehicle was stopped. various other malicious acts. (iii) Engine Of (Vehicle’s Status). In this experiment, the repackaged app delivered the Engine Of command (ii) Engine Of (Application Start). In this experiment, in a specifc state of the vehicle. It was confrmed that the Engine Of command set by the attacker is deliv- the engine stopped at the vehicle speed specifed by ered when the repackaged app is started. Te target the attacker. vehicle was empty for safety and this experiment was performed with a safety device. It was confrmed Trough the above three experiments, it was found that that when the attacker installed a malicious app and various apps sold/distributed in the Google Play store were Wireless Communications and Mobile Computing 9

Start

Iteration K=all APK files in the folder

Change extension from APK to ZIP

Unzip and 7 zip get DEX file

Decompile and BakSmali get SMALI files

Iteration K=all SMALI files possible to attack AT command YES exist?

NO impossible to attack

Reporting

End

Figure 9: Operation process of the automatic vulnerability analysis. exposed to the app-repackaging attack. Tere are primarily two reasons that the app-repackaging attack is possible. First, it is easy to analyze applications in terms of sofware. Second, the attack command can be delivered to the internal network through the OBD-II port of the vehicle in terms of hardware. In the next chapter, we will discuss efective countermea- sures against the above-mentioned attack.

4.3. Analysis of the Risk of Commercial Vehicle Diagnosis Apps. In this section, the forgery risk associated with the ATcommandandOBDPIDofappsisanalyzed.Teriskis Figure 10: Running screen of the automatic vulnerability analysis. analyzed by checking the existence of the AT command in the target app. In this study, the risk analysis was conducted for 213 apps related to ELM327 among the vehicle-related apps in theGooglePlaystore(asofJune2017). apps can be automated because it is a mechanical, repetitive To check the existence of the AT command, the DEX task [24]. In this study, an automatic vulnerability analysis (Dalvik Executable) fle must be found frst, which is a byte tool was developed for this task [25]. Figure 9 shows the code-level executable fle that exists in the APK (Android process used by the automatic vulnerability analysis tool, and Application Package) fle. Ten the smali fle is extracted Figure 10 shows the running screen. by decompiling the DEX fle, and the existence of the Te results of analyzing the apps using the automatic AT command is determined. Te searching and analyzing vulnerability analysis tool are outlined in Table 6. Te number process for checking for AT commands in several hundred of apps that have a plain text AT command that can be 10 Wireless Communications and Mobile Computing

OBD2 Port OBD2 Port

Firewall Unit ELM327 OBD2 Port Module Unit

ELM327 Module

Firewall ELM327 Unit Module

Firewall Unit Firewall Unit Manufactured Concept Prototype Firewall Unit (a) (b) (c) Figure 11: Access control system production process, (a) concept design, (b) prototype, and (c) actual manufacturing tool.

Table 6: App analysis results using automatic vulnerability analysis (A) Whitelist-Based Firewall. Most external devices such as tool. ELM327 access the in-vehicle CAN through the OBD-II port. Possible to attack Impossible to Car manufacturers must establish an access control policy Number of apps (%) attack (%) for CAN data frames that are sent from external devices connected to the OBD-II port. In order to construct a safe Total 213 166 (78) 47 (22) in-vehicle CAN communication environment, a frewall is Paid apps 66 58 (88) 8 (12) required to verify the ID of a CAN data frame (which comes Free apps 147 108 (73) 39 (27) in through the OBD-II port) and selectively send it to the subnetwork. Car manufacturers can perform access control for CAN used for attack is 166, which accounts for approximately 78% data frames that fow in through the OBD-II port during of the 213 apps in total. In other words, most of the apps vehicle operation by generating a whitelist. Te whitelist- in the Google Play store can be used for attacks through based frewall can be designed/developed as follows: app-repackaging. Furthermore, both free and paid apps are exposed to attacks. (1) Car manufacturer defnes a CAN ID that can be Te results of analyzing the apps using the automatic fowed into the in-vehicle CAN through the OBD-II vulnerability analysis tool are outlined in Table 5. Te number port during vehicle operation. of apps that have a plain text AT command that can be (2) Car manufacturer develops a whitelist-based frewall used for attack is 166, which accounts for approximately 78% using the CAN ID defned in step (1) and installs it in of the 213 apps in total. In other words, most of the apps the OBD-II port. in the Google Play store can be used for attacks through app-repackaging. Furthermore, both free and paid apps are (3) Every external device can send data frames to the in- exposed to attacks. vehicle CAN only through the frewall installed at the OBD-II port. 5. Countermeasure (4) Te frewall checks the data frames sent from external devices and drops abnormal data frames. In this chapter, we propose a security mechanism to protect We designed and developed a whitelist-based frewall as against app-repackaging attacks. We designed a security shown in Figure 11. Tis frewall is installed between the mechanism considering the constraints of in-vehicle CANs OBD-II portand external device (ELM327 module) as shown and verifed it. Te proposed security mechanism is largely in Figure 11(a). Afer producing the prototype shown in divided into two methods: Figure 11(b), we connected it to an actual vehicle and per- (i) Malicious data frame blocking method using a formed an access control experiment. Te prototype module whitelist-based access control system was developed using an F28335-based microcontroller unit (MCU). All CAN data frames sent by ELM327 to the F28335- (ii) Code analysis defense method using app obfuscation based MCU were checked frst before being delivered to the Wireless Communications and Mobile Computing 11

Plain App

Found AT Command Obfuscated App

Figure 12: Result of running the automatic vulnerability analysis tool for obfuscated app.

OBD-II port. Terefore, all data frames that do not have the reverse engineering, and vehicle diagnostic device analysis predefned CAN ID could be dropped. were obtained. Using the acquired CAN data frames, wired We developed the frewall device as a fnished product and wireless attack models were proposed and actual vehicle with assistance from the Korea Automotive Technology hacking experiments were conducted. Among the proposed Institute. Te frewall device that we developed is shown attack models, the short-range wireless attack and the long- in Figure 11(c). Tis frewall is equipped with an MCU range wireless attack showed very threatening results. It based on an Infneon XC2265N (16-bit) microcontroller, has been proven that a short-range wireless attack can and the whitelist can be easily updated. Tis access control hack vehicles in a Bluetooth pairing between a smartphone policy for CAN data frames must be determined by the car infectedwithmaliciouscodeandthehands-freefunction manufacturer. of vehicles. Furthermore, it has been experimentally proven that long-range wireless attacks can be performed using (B) App Obfuscation.Teapp-repackagingattackwaspos- the vulnerabilities of the communication protocol and the sible because readability was provided for easy analysis of telematics devices installed in vehicles [7]. the internal app code, and the addition and repackaging of M. Wolf, T. Hoppe, and others proved through an the app contents were possible. Te app-repackaging attack is experiment based on the CANoe simulator that a message carried out largely in two steps. retransmission attack is possible in an in-vehicle CAN. To In the frst step, the AT command of the target app is solve this problem, they proposed a message authentication found. In the second step, the app is repackaged by inserting method based on certifcate-based ECU authentication and malicious code. From the viewpoint of the Cyber Kill Chain, symmetric keys. However, it is impossible to apply their pro- if only one of the attack steps is disabled, the attack cannot be posed security mechanisms to an actual vehicle environment performed. because the computing power of ECUs and the payloads of Practically, the most efective method for preventing an CAN data were not considered [27, 28]. app-repackaging attack is to prevent the AT command from D. K. Nilsson et al. proposed the DDA (Delayed Data being found. Authentication) method, which considered the computing Te obfuscation technique that is mainly applied in Java capacity of ECUs and the limited CAN message structure. Te is described in Table 7 [26]. DDA method uses the message authentication code (MAC) When class encryption is not utilized, it is difcult to to prevent message retransmission attacks. Tey proposed defend against attacks that fnd the AT command through a method of using the CRC feld, while pointing out the string encryption, renaming, control fow, and API hiding. shortage of areas available in the CAN message structure. Figure 12 shows the analysis result obtained using the Furthermore, they proposed a DDA-based method of authen- automatic vulnerability analysis tool developed in this study, tication in which four messages are grouped. However, the in a case where analyzing source code inside the app is made CRC feld in CAN cannot be used dynamically. In addition, difcult by applying the class encryption method, which real-time data processing must be guaranteed in vehicles. encrypts the DEX fle and dynamically loads it when running It is impossible to apply the DDA method to the vehicle the app. environment because it generates a delay of at least 80 ms in In order to defend against the attacks performed in this authentication [29]. study, it is recommended to apply class encryption, which Woo et al. proposed a data frame authentication method makes the AT command unsearchable by encrypting the DEX using 32-bit truncated MAC. In consideration of the lim- fle that includes the original source code. ited characteristics of the CAN data frame, they proposed amethodofusingtheCRCandExtendedIDfeldsfor 6. Related Studies message authentication. In addition, they proposed a method of updating the session key to strengthen the safety of K. Kosher et al. frst reported ECU forced control attacks the 32-bit truncated MAC. Tis method ensures real-time using actual vehicles. CAN data frames that can control data processing without generating additional data frames. vehicles by force using CAN message fuzzing, ECU frmware However, it is impossible to apply their method to an actual 12 Wireless Communications and Mobile Computing

Table 7: App analysis results using automatic vulnerability analysis tool.

Obfuscation Description Te used string is replaced with an encrypted string, and a decryption method is added to the class fle and the string encryption encrypted string is decrypted during runtime. Te classes, felds, and methods are renamed with meaningless names to make it difcult to analyze the renaming decompiled source code. Te positions of commands in the code area of the class fle are changed or trash commands are inserted to control fow make it difcult to analyze fow during decompiling. API hiding Sensitive libraries are used or the method calling is hidden. class encryption A specifc class fle is encrypted and stored, and the dynamically decrypted code is run during runtime.

vehicle environment because the CAN protocol itself must Conflicts of Interest be modifed in order to use the CRC feld for sending the message authentication code [6]. Te authors declare that they have no conficts of interest.

Acknowledgments 7. Conclusion Tis research was supported by the Korea Ministry of Land, In this study, we have shown that vehicles can be eas- Infrastructure and Transport. It was also supported by the ily attacked by analyzing ELM327s, which are commercial Korea Agency for Infrastructure Technology Advancement products that send vehicle information to smartphones and (Project no. 17TLRP-B117133-02). examining Android apps in the app market (Google Play store) that show vehicle information to owners afer receiving the information. As countermeasures to such attacks, app References obfuscation and a whitelist-based frewall were proposed. For [1] A. Saad and U. Weinmann, “Automotive Sofware Engineering the attack method used in this study, commercial products and Concepts,” GI Jahrestagung,vol.34,pp.318-319,2003. that are now sold or distributed were used. Because anyone [2] E. Nickel, “IBM Automotive Sofware Foundry,” in Proc. Conf. can repackage Android apps if they have abilities to develop Comput.Sci.inAutomot.Ind., Frankfurt University, Frankfurt, andanalyzethem,thishaslargerippleefectsinterms Germany, 2003. of reality and risk. Te more the 5G is expanded, the [3] J. Greenough, “Te Connected Car Report: Forecasts, more the connection between the vehicle and the external Competing Technologies, and Leading Manufacturers,” Busi- device is increased; thus, the risk of cybersecurity will be ness Insider, 2016, http://www.businessinsider.com/connect- increased. Terefore, multistep, multifaceted security mea- ed-car-forecasts-top-manufacturers-leading-car-makers-2015-3. sures are required to attain security against realistic threats. [4]S.Pandi,F.H.P.Fitzek,C.Lehmannetal.,“Jointdesignof Inthecaseofthefrewallproposedinthisstudy,aseparate communication and control for connected cars in 5G commu- device that operates based on a whitelist is attached to nication systems,” in Proceedings of the 2016 IEEE Globecom an afermarket device. However, car makers must install Workshops, GC Wkshps 2016, December 2016. the security feature in the OBD-II interface, because we [5]J.Ni,X.Lin,andX.S.Shen,“EfcientandSecureService- cannot realistically depend on an external device to counter oriented Authentication Supporting Network Slicing for 5G- threats to the vehicle itself. Furthermore, from the Cyber enabled IoT,” IEEE Journal on Selected Areas in Communica- Kill Chain perspective, the security measures proposed in tions,vol.36,no.3,pp.644–657,2018. this study are countermeasures in terms of “reconnaissance” [6] S. Woo, H. J. Jo, and D. H. Lee, “A Practical Wireless Attack on and “command and control.” Te security measures in the the Connected Car and Security Protocol for In-Vehicle CAN,” command and control step are still efective because using IEEE Transactions on Intelligent Transportation Systems,vol.16, Wi-Fi instead of Bluetooth for the communication between no.2,pp.993–1006,2015. theOBD-IIdeviceandsmartphoneapps,usinganother [7] K. Koscher, A. Czeskis, F. Roesner et al., “Experimental Security dongle instead of ELM327, or using iOS instead of Android Analysis of a Modern Automobile,” in Proceedings of the is in the “reconnaissance,” “identifcation,” or “vulnerability” 2010 IEEE Symposium on Security and Privacy, pp. 447–462, steps. In the “Actions on Objectives” step, the attacks can be Oakland, CA, USA, May 2010. detected and defended using IDS. Research on this topic is [8] C. Miller and C. Valasek, “Demo: Adventures in Automotive lef for future studies. Networks and Control Units,” in Proceedings of the DEFCON 2013,2013. [9] J. H. Lee, S. Woo, S. Y. Lee, and D. H. Lee, “APractical Attack on Data Availability In-Vehicle Network Using Repackaging Android Applications,” Journal of the Korea Institute of Information Security and No data were used to support this study. Cryptology,vol.26,no.3,pp.679–691,2016. Wireless Communications and Mobile Computing 13

[10]S.Karmar,“DriveItLikeYouHackedIt:NewAttacksandTools [28] T. Hoppe and J. Dittman, “Snifng/replay attacks on can buses: to Wirelessly Steal Cars,”in Proceedings of the DEFCON 23,2015. A simulated attack on the electric window lif classifed using an [11] M. Kuzin and V. Chebyshev, “Hey Android, Where is My Car?” adapted cert taxonomy,”in Conf. Embedded Syst. Security,pp.1– in Proceedings of the RSA Conference 2017,2017. 6, 2007. [12] P. Kleberger, T. Olovsson, and E. Jonsson, “Security aspects of [29] D. K. Nilsson, U. E. Larson, and E. Jonsson, “Efcient In- the in-vehicle network in the connected car,” in Proceedings of Vehicle Delayed Data Authentication Based on Compound the 2011 IEEE Intelligent Vehicles Symposium (IV), pp. 528–533, Message Authentication Codes,”in Proceedings of the 2008 IEEE Baden-Baden, Germany, June 2011. 68th Vehicular Technology Conference (VTC 2008-Fall),pp.1–5, [13] D. K. Nilsson, U. E. Larson, and E. Jonsson, “Creating a Secure Calgary, Canada, September 2008. Infrastructure for Wireless Diagnostics and Sofware Updates in Vehicles,” in Proc. Conf. Comput. Saf., Reliab., and Secur.,pp. 207–220, Newcastle upon Tyne, UK, 2008. [14] A. Tahat, A. Said, F. Jaouni, and W. Qadamani, “Android- based universal vehicle diagnostic and tracking system,” in Proceedings of the 2012 IEEE 16th International Symposium on Consumer Electronics (ISCE 2012), pp. 137–143, Harrisburg, Pa, USA, June 2012. [15] Te International Organization for Standardization (ISO), “Te International Organization for Standardization (ISO),” Te International Organization for Standardization standard 15765- 4:2016, 2016. [16] Society of Automotive Engineers (SAE) International, “Rec- ommended Service Procedure for the Containment of Cfc-12,” Society of Automotive Engineers standard J1989, 2011. [17] S. Chen, J. Pan, and K. Lu, “Driving Behavior Analysis Based on Vehicle OBD Information and AdaBoost Algorithms,” in Proceedings of the in International MultiConference of Engineers and Computer Scientists (IMECS) 2015 Vol I, pp. 102–106, 2015. [18]C.Furmanczyk,D.Nufer,B.Sandonaetal.,Integrating ODB- II, Android, and Google App Engine to Decrease Emissions and Improve Driving Habits, http://citeseerx.ist.psu.edu/viewdoc/ download?doi=10.1.1.469.359&rep=rep1&type=pdf. [19] Lockheed Martin Corporation, “Seven Ways to Apply the Cyber Kill Chain5 with a Treat Intelligence Platform,” white paper, 2015. [20] Lockheed Martin Corporation, “Gaining the Advantage: Apply- ing Cyber Kill Chain5 Methodology to Network Defense,” Guide, 2015. [21] T. Hoppe and J. Dittman, “Snifng/Replay Attacks on CAN Buses: A Simulated Attack on the Electric Window Lif Clas- sifed Using an Adapted CERT Taxonomy,” in Proc. Conf. on Embed. Syst,2007. [22] T. Hoppe, S. Kiltz, and J. Dittmann, “Security threats to automotive CAN networks—Practical examples and selected short-term countermeasures,” Reliability Engineering & System Safety,vol.96,no.1,pp.11–25,2011. [23] 2017, https://sites.google.com/team-aegis.com/webpage/pub- lish data. [24] F. Palmieri, U. Fiore, and A. Castiglione, “Automatic security assessment for next generation wireless mobile networks,” Mobile Information Systems – Emerging Wireless and Mobile Technologies, vol. 7, no. 3, Article ID 404328, pp. 217–239, 2011. [25] GitHub repository, 2017,https://github.com/song4jang/ODB-S- app repackaging-. [26] Y. Piao, J. Jung, and J. H. Yi, “Structural and functional analyses of proguard obfuscation tool,” Te Journal of Korean Institute of Communications and Information Sciences,vol.38,no.8,pp. 654–662, 2013. [27] M. Wolf, A. Weimerskirch, and T. Wollinger, “State of the art: Embedding security in vehicles,” EURASIP Journal on Embedded Systems,vol.2017,2007. Hindawi Wireless Communications and Mobile Computing Volume 2018, Article ID 7943586, 13 pages https://doi.org/10.1155/2018/7943586

Research Article Automatically Traceback RDP-Based Targeted Ransomware Attacks

ZiHan Wang ,1 ChaoGe Liu ,1 Jing Qiu ,2 ZhiHong Tian ,2 Xiang Cui,2 and Shen Su2

1 Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China 2Cyberspace Institute of Advanced Technology Guangzhou University, Guangzhou, China

Correspondence should be addressed to ChaoGe Liu; [email protected], Jing Qiu; [email protected], and ZhiHong Tian; [email protected]

Received 13 July 2018; Revised 24 October 2018; Accepted 22 November 2018; Published 6 December 2018

Guest Editor: Vishal Sharma

Copyright © 2018 ZiHan Wang et al. Tis is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

While various ransomware defense systems have been proposed to deal with traditional randomly-spread ransomware attacks (based on their unique high-noisy behaviors at hosts and on networks), none of them considered ransomware attacks precisely aiming at specifc hosts, e.g., using the common Remote Desktop Protocol (RDP). To address this problem, we propose a systematic method to fght such specifcally targeted ransomware by trapping attackers via a network deception environment and then using traceback techniques to identify attack sources. In particular, we developed various monitors in the proposed deception environment to gather traceable clues about attackers, and we further design an analysis system that automatically extracts and analyze the collected clues. Our evaluations show that the proposed method can trap the adversary in the deception environment and signifcantly improve the efciency of clue analysis. Furthermore, it also helps us trace back RDP-based ransomware attackers and ransomware makers in the practical applications.

1. Introduction Because traditional ransomware was typically spread randomly without specifc targets via network scanning or Ransomware was frst emerged in late 1980s [1, 2] and has host probing, they can be easily detected by monitoring of resurfaced since 2013 [3]. Recently, several wide-spread ran- the abnormal behaviors in host activities such as fle system somware attacks have caused signifcant damages on a large operations and network trafc [1, 3, 8]. Recently, more and number of user systems and businesses on the Internet. more ransomware attacks aimed at specifc targets. Kaspersky Symantec reported a 250% increase in new crypto ran- Security Bulletin indicated that targeted attacks have become somware families between 2013 and 2014 [2]. In May 2017, one of the main propagation methods for several widespread WannaCry spread across more than 150 countries and ransomwarefamiliesin2017[9,10].Forinstance,anattacker 200,000 computers in just a few days, and severely disrupted using Crysis ransomware frst logs in a victim’s host and many businesses and personal systems [4, 5]. In addition, spreads itself via a brute force attack on the common Remote specifcally targeted ransomware like Crysis disrupted many Desktop Protocol (RDP). Such a targeted ransomware attack small and large enterprises across the globe; e.g., Trend Micro usually has a clear command-and-control structure and observed that the Crysis family specifcally targeted busi- aimed at resource exploitation and resource thef on these nesses in Australia and New Zealand in September 2016. Te targets, while generating fairly limited noisy on hosts and number of such targeted ransomware attacks was doubled in networkswhichishardtodetect. January 2017, compared with in late 2016 [6]. What is more, Existing ransomware defense methods (designed for the lack of focus on security has lef IoT (Internet of Tings) dealing with randomly-spread attacks) usually protect a host devices vulnerable, which has been the target of 10% of all by blocking the spreading of ransomware attacks (in nearly ransomware attacks. Researcher predicts IoT ransomware real-time) based on the signatures generated by ransomware attacks being likely to increase to around 25% to 30% of all detection solutions. However, because of the diferent char- ransomware cases [7]. acteristic of targeted ransomware attacks with less notable 2 Wireless Communications and Mobile Computing patterns, these traditional blocking-based defense systems method. In Section 4, we present the implementation of our become much less efective for these targeted attacks. deception environment prototype. In Section 5, we describe To address this issue, we propose to utilize advanced thedetailsoftheclueanalysissystem.InSection6,wediscuss defense schemes to protect important hosts under targeted the evaluation setup and results. We conclude this paper in ransomware attacks. In this paper, we utilize the cyber decep- Section 7. tion technology to help us protect critical systems through attack guidance, by drawing attackers away from these 2. Background and Related Work protected systems. While the cyber deception technology helps us protect important targets (such as in dealing with 2.1. Related Work on Ransomware Defense. Ransomware is the Advanced Persistent Treat (APT) [11, 12]), it cannot a type of malware which manipulates an user system to help us traceback attack sources. To address this issue, we extort money. It operates in many diferent ways, e.g., simply further design specifc techniques to traceback RDP-based locking a user’s desktop or encrypting an entire fle system. ransomware attacks and identify the original attack sources Recent rampant ransomware attacks have called for efective as the main deterrence of ransomware attackers. ransomware defense solutions. In the studies that tackle Our deception environment simulates an actual user sys- ransomware counteraction, several solutions are proposed to tem in three layers with multiple monitors to observe various confront this attack [14, 15]. key system operations, related to login, network communica- Some of these solutions are proposed to deal with all types tion, clipboard, process, shared folder, and fle system. It col- of ransomware [1, 16–20]. For example, Kharraz presented a lectstraceablecluesandhelpsusdetecttheRDPransomware dynamic analysis system called UNVEIL. Te system analyzes attack. Because traditional traceback methods usually require and detects ransomware attacks by modeling ransomware security experts to manually analyze a large amount of behaviors. It focuses on the observation of three elements, collected clues, it is difcult for make them to achieve fast namely, I/O data bufer entropy, access patterns and fle responses. Terefore, we develop an automatic analysis system system activities [1]. Moreover, some others are type-specifc to work on traceable clues by taking advantage of natural solutions that deal with only one type, such as crypto- language processing and machine learning techniques. ransomware [21–25]. For example, Scaife presented an early- To evaluate our system, we invite 122 volunteers in a sim- warning detection system that alerts users during suspicious ulated RDP-based ransomware attack. Te proposed system fle activities [21]. Utilizing a set of behavior indicators, the was able to capture traceable clues through the proposed detection system can halt a process that appears to tamper deception environment. It can also automatically analyze with a large amount of user data. Furthermore, it is claimed the clues efectively. Te convergence rate of the analysis that the system can stop a ransomware execution with a system reaches about 98%. Moreover, we demonstrated that median loss of only 10 fles. Similarly, some studies tackle the it helps us traceback RDP-based attack sources in practical detection of specifc ransomware families only [26–28]. For applications. example, Maltester is a family-specifc technique proposed In summary, this paper makes the following contribu- by Cabaj to detect Cryptowall infections [27]. It employs tions: dynamic analysis along with technology to analyze the network behavior and detect the infection chain. (i) We propose a systematic method to deter RDP-based Tese solutions can be categorized into prevention and ransomware by identifying attackers, which traps ran- detection. However, these two kinds of countermeasures have somware attackers via a cyberdeception environment the following disadvantages. First, many prevention measures and uses an automatic analysis system to obtain require many services to be disabled, which is likely to afect traceable clues and identify attack sources. service functionality. For example, Prakash suggested several (ii) We build a deception environment to trap RDP- prevention measures including disabled macros in ofce based ransomware attacker, by simulating an user documents, and restricted access permissions on “Temp” environment in three layers: a network layer, a host and “Appdata” folders [29]. Secondly, the detecting system layer and a fle system layer. Te environment helps is ofen difcult to conceal itself and perform its functions us discover attacker behaviors and collects attacker- when against ransomware attacks that precisely aiming at related information. specifc hosts, e.g., using the common Remote Desktop (iii) We develop an automatic analysis system with natu- Protocol (RDP). Finally, while these countermeasures can ral language processing and machine learning tech- be used to detect or block specifc ransomware attacks, they niques to automatically recognize efective clues for cannot fundamentally inhibit the spread of ransomware. tracing back ransomware attack sources. But traceability technology can fundamentally inhibit the ransomware spread by traceback to attack sources. (iv) We designed two practical experiments to test RDP- based ransomware attacks and ransomware makers, 2.2. RDP-Based Ransomware Attacks. Traditional ransom- and demonstrated the feasibility of the proposed ware randomly spreads across the Internet, in executable system. fles, development kits, macro fles, and other malicious Te remainder of the paper is structured as follows. In programs on a large scale with various dissemination meth- Section 2, we briefy present background and related work. ods, including phishing emails, puddle attacks, vulnerability In Section 3, we describe the methodology of our systematic attacks, server intrusion, and supply chain pollution. Tey Wireless Communications and Mobile Computing 3

①② ③④⑤ Ensnare to the Ransom deception Monitor -ware YES Extraction Analysis Result environment Detect

No

Figure 1: Data collection and analysis process of the whole prototype. use diferent ways to trick a victim to launch such programs. for APT attacks, ransomware attacks, intranet attacks and Among these dissemination method, phishing emails is the other threats defense. A Gartner report in 2015 [30] pointed most widely used. However, according to Kaspersky's 2017 out the market prospect of deception-based security defense ransomware report, the number of targeted ransomware technology and predicted that 10% of organizations will use attacks based on RDP is growing rapidly. deception tools (or tactics) to counter cyber-attacks in 2018. Recently, more and more ransomware criminals have Compared with the traditional passive defense approach, spread ransomware using RDP services and then installed cyber deception technology is an active defense approach ransomware manually. Tese attackers use a brute-force and can be applied to all stages of network attacks. We can method to acquire usernames and passwords on a target use this technology to trap the RDP-based attacker, detect machine with an active RDP service [9]. For instance, one targeted attacks, and deter ransomware attackers by precisely of the typical families, Crysis, a copycat of Locky, not only identifying them. aims at common businesses but also targets healthcare service providers [9]. Crysis gains access to admin level privileges Trap RDP-Based Ransomware Attackers. A targeted ran- by stealing passwords and credentials. In addition, during somware attack generally has three steps: detection, infltra- an RDP session, the attacker uses both clipboard and shared tion, and execution [31]. However, traditional security solu- folders to upload fles to a remote host. And then, attackers tions are unable to cope with the internal translation phase. In can installed ransomware manually. addition, traditional honeypot technology (ofen used to fght network attacks) generally does not focus on tracing back to 2.3. Cyber Deception Technology. Because many critical sys- attackers. However, cyber deception technology can deceive tems are known and always on, it is difcult to protect them the attacker into a surveillance environment and consume his from potential network attacks. time and energy with bait information. (i) Attackers can use zero-day vulnerabilities, highly Detect RDP-Based Ransomware Attacks.Oncetheattacker antagonistic malicious code, or other resources to obtains the correct username and password combination, he break the defense system. usually returns multiple times within a short period to try and (ii) Because humans are always the weakest link in infect the compromised host [6]. In one particular case, Crysis defense systems, attackers can use social engineering was deployed six times on an endpoint within a span of 10 to identify system weaknesses and penetrate the minutes. As a result, by monitoring in the cyber deception defense. environment, we can detect RDP-based ransomware attacks in time and determine the attacker’s behavior through the (iii) Attackers can repeatedly explore the potential vulner- environment monitor. abilities on a target system to identify its weaknesses. However, when an attacker aims at a specifc target, Deter the Ransomware Attacker. Deterring ransomware e.g., exploiting its RDP service, traditional passive defense attackers can be approached in two ways. First, if an attacker methods cannot be usually less efective. Terefore, we need realizes he is entrapped, it becomes a deterrent. Second, if to use advanced active solutions to deal with such attacks with the attacker is exposed to the deception environment and less observable features, such as cyber deception. remains within the perspective of the defense surveillance, Te earlier use of cyber deception technology is honey- the monitor can collect the attacker’s traceable clues that are pot. Honeypot detects attacks by deploying a series of systems accidentally released by the attacker (e.g., IP address, path, or resources in the service network that do not have real nickname, strings). Te exposure of these clues, hidden from business. When a trap is accessed, it represents an attack. attackers, can be a powerful deterrent to other attackers. Honeypot system generally waiting attacks passively, and does not have the role of misleading and confusing attack- 3. Methodology ers. What’s more, the honeypot system does not have real business, and does not have high interactive characteristics, In this section, we describe our method of tracing back RDP- which may easy to be identifed by attackers. Compared with based ransomware attackers. Figure 1 summarizes the data the traditional honeypot system, the cyber deception system collection and analysis process of the entire prototype. First, can be deployed more conveniently, the cyber deception we implement a deception environment to trap attackers. Sec- environment is more real, and can be linked with existing ond, we monitor RDP-based ransomware attacks and collect defense products. It can provide more efective solutions information when they occur. Tird, we extract efective clues 4 Wireless Communications and Mobile Computing from the monitor information. Fourth, we use automatic Te File Layer Monitor.Bymonitoringtheflelayer,wecan analysis to screen a large number of clues for tracing back identify ransomware attacks by fle changes. Furthermore, it the attacker. Finally, we will generate a report to traceback can gather local traceable information by monitoring fles the RDP-based ransomware attacker. We refer readers to in the shared folder. For instance, as a shared folder on the Sections 4 and 5 for the detailed implementation of this proto- remote PC is always used to transfer ransomware from the type. attacker machine during the RDP session. In addition, for a more convenient and quick attack, the attacker ofen mounts 3.1. Deception Environment. Generally, the ransomware the entire local disk to the remote computer. As a result, attack execution stage has two steps: login and spread [31]. through the monitor of the shared folder, we can detect the To build a deception environment is nontrivial in practice newly-added shared folders in real time and capture a large because it must make the ransomware attacker believe that it amount of path information automatically. belongs to a real user and the user data is worthy to attack. Because advanced attackers always exploit static features 3.3. Clue Extraction. Trough environmental monitors, it based on certain analysis systems before they launch attacks can gather a lot of information lef by attackers, such as [32], an intuitive approach to address such reconnaissance login information, communication information, clipboard attacks is to build the user environment in such a way that content, folder path, and portable execution (PE) fle Many the user data is valid, real, and nondeterministic. In addition, traceable clues can be extracted here, including but not the environment serves as an “enticing target” to encourage restricted to IP address, keyboard layout, compile path, and ransomware attackers. We elaborate on how to generates fle path. In order to analyze these clues quickly, we divided an artifcial, realistic, and enticing user environment for the them into two categories: string clues and path clues. Tese RDP-based ransomware in Section 4. clues are then submitted to the automatic analysis system. We TeRDP-basedattackerscommonlyuploadmalicious will elaborate the types of clues that the proposed system can programs in the following ways before spread ransomware: extract in Section 5.1. (1) Te attacker downloads malicious programs on the Internet; (2) programs are transferred through FTP, SCP, or 3.4. Automatic Analysis. According to our investigation, cur- other transport protocols; (3) programs are uploaded through rent traceback tools mostly analyze clues manually. However, the clipboard; (4) programs are uploaded through a shared we usually have to deal with a large amount clues with folder. Te clipboard and shared folders are most commonly no semantic correlation. Because such manual traceback used to transfer programs by RDP-based ransomware attacks analysis usually takes a lot time and eforts, we propose an because they are simple and convenient. However, both are automatic analysis system and we will elaborate on how to easy to monitor by our proposed system. analyze clues automatically in Section 5.2.

3.2. Environment Monitor. In order to avoid the attacker's 4. Implementation of the observation and collect more attacker’s information, a shared Deception Environment folder and clipboard on the remote PC are always used to transfer ransomware programs from the attacker machine As the Windows platform is the main target of ransomware, afer the attacker logs in to the environment. Tis paper we choose Windows as the proof of concept implementation. proposes three monitor layers: the network layer, the host In this section, we describe the implementation details of a layer and the fle layer. We elaborate on how to confgure the Windows-based deception environment prototype. It elabo- monitor system for the deception environment in Section 4. rates on how the deception environment traps ransomware attackers, how the monitor detects the RDP-based attack and Te Network Layer Monitor. Te network layer monitor collects traceable clues. Te entire system implementation detects a remote connection and collects information includ- processisshowninFigure2. ing the remote IP addresses, remote ports, status codes of ports, keyboard layout and so on. When the RDP-based 4.1. At the Network Layer attacker logs in to the host, the monitor can obtain informa- tion and detect the attack without the attacker's knowledge. 4.1.1. Te Login Monitor. Te login monitor is used to detect attacks in real time and collect the attacker’s login informa- Te Host Layer Monitor. We propose to detect changes such tion. On the Windows platform, Win32 is an environment as processes and clipboards, by monitoring the host layer. subsystem that provides an API for operating system services Te host layer monitor can gather information about the and functions to control all user inputs and outputs. Te attacker's behavior and their use of these system applications login monitor relies on Windows APIs to gain access to the in the deception environment. For instance, as the clipboard system and run with privileges to access their own areas is in the system-level heap space, any application in the system of memory. It uses Winsock 2.0 to get access to networks has access to it. Te RDP-based ransomware always takes and uses protocols other than the TCP/IP suite. Te login advantage of the clipboard to interact between applications. monitor takes network requests and sends those requests to Moreover,itmightgettheclueslefbytheattackersusingthe the Winsock 2.0 SPI (Service Provider Interface) by calling clipboard locally, as the Windows system shares the clipboard the main Winsock 2.0 fle Ws2 32.dll. It provides access to by default during the RDP session. transport service providers and namespace providers. Te Wireless Communications and Mobile Computing 5

2. Environment Monitor 3. Clue Extraction

4. Automated Analysis Login Monitor Login Clues

Communication Users and Programs 1. Construct Deception Remote Host Clues Analysis 5. Result Network Layer Monitor Environment Network Address Clipboard Monitor Clipboard Clues Cyber Deception Analysis Auxiliary Host Layer Environment Traceable Language Process Monitor Compile Clues Clues Identifcation File Layer Shared Folder Monitor Path Clues Traceable Strings Identifcation File Monitor Ransomware detect

Figure 2: RDP-based ransomware attack traceback system process.

IP Helper API makes it possible to get and modify network of the Clipboard Viewer Chain. Afer each Clipboard Viewer confguration settings for the localhost. It consists of the window responds to and processes the message, it must send DLL fle iphlpapi.dll and includes functions that can retrieve the message to the next window according to the handle information about the protocols, such as TCP and UDP [33]. of the next window in its saved linked list. Te clipboard As a result, the login monitor can directly access the data monitor can obtain the clipboard's new contents by using bufers involved in transmission control protocols, routing the “GetClipboardData” Windows API through the window. tables, network interfaces, and network protocol statistics. When a subsequent copy or cut operation are executed, the data in the clipboard are rewritten. As a result, the 4.1.2. Te Communication Monitor. Te communication clipboard monitor guarantees real-time listening and writes monitor is responsible for captures network trafc. By locat- real-time information to the log fle. Te log fle is updated ing the TCP packets in the network trafc which contain whenever the clipboard monitor receives a clipboard change the interactive confguration information of a RDP login, the notice. When the log fle is updated, to prevent it from being RDP connection information can be obtained as the attacker’s detected by the attacker or being encrypted by ransomware, personal information. Te main packet characteristics: (1) the monitor sends it to a secure host and completely erases it Te packet’s name is ofen ClientData. (2) Te packet location from the environment. is usually afer the TCP three-way handshake. (3) Te data packet position in the front. (4) Te amount of data is 4.2.3. Te Process Monitor. To run a program on a Windows signifcantly larger. system, a new process must be created. Te monitor gets the PE fle run by the attacker by monitoring the environment 4.2. At Host Layer process. It frst records the state of processes commonly used in the deception environments before a RDP-based 4.2.1. Te Deception Host. According to our observation, ransomware attack. Afer the login monitor detects such an the main methods used by attackers to login a remote host attack, it monitors the system for newly created processes are: (1) weak password direct login; (2) add access account in the environment through the Windows API. “CreateTool- login through vulnerability. As a result, we deliberately set help32Snapshot” takes status snapshots of all processes in real the administrator privileges of the deception environment time, which includes the process identifer (PID). When a as weak passwords and leave common vulnerabilities in the suspicious process has started, the monitor recognizes it by environment, such as EternalBlue,toattractattackers.To the PID and looks up the process's running path with the guide attackers to upload ransomware using only clipboard help of “GetModuleFileNameEx”. Finally, the monitor fnds and shared folders, we block external trafc transfers from thesuspiciousprogram'sPEflethroughthepathandcopies outside the environment and close common transfer ports it to the secure host. (e.g., port 20, 21, 80, and 443). 4.3. At the File Layer 4.2.2. Te Clipboard Monitor. Te clipboard monitor can obtain clues in real time by monitoring the clipboard’s 4.3.1. Deception Files. Deception fles are constructed with changes. It uses Clipboard Viewer to listen to message two goals. First, we need to make the attacker believe that the changes in the clipboard without afecting its contents. Te deception environment is a real user's host. Second, we need Clipboard Viewer is a mechanism that can get and display to make the attacker believe that there are resources in the the contents of the clipboard. As Windows applications are environment that are worthy of attacking. message-driven, the key to the monitor is responding to and To simulate a realistic environment, we deploy large processing clipboard change messages. When the content numbers of diferent types of fles on it, for example, images, changes, the monitor triggers the WM DRAWCLIPBOARD audio fles, database fles, and documents that can be accessed message and sends the changed message to the frst window in a Windows session. Based on Amin Kharraz’s research [1], 6 Wireless Communications and Mobile Computing we created four fle categories that ransomware always tries noticed by the attacker, thus revealing some unexpected to fnd and encrypt: documents (∗.txt, ∗.doc(x), ∗.ppt(x), traceable clues. Te monitor can access a list of paths to the ∗.xls(x), ∗.pdf and ∗.py), keys and licenses (∗.key, ∗.pem, attacker shared folders. ∗.crt, and ∗.cer), fle archives (∗.zip, ∗.rar), and media As we originally observed, shared folders using Remote (∗.jp(e)g, ∗.mp3, and ∗.avi). We obtain these fles in three Desktop ofen have a path in the remote host with the prefx ways.First,wecreatefleswithvalidheadersandcontent “\\tsclient\”. When the monitor traverses the storage to this using standard libraries (e.g., python-docx, python-pptx, prefx, it uses “FindFirstFile”tofndthefrstfle.Itthenuses pdfit, and OpenSSL). Second, using Google search syntax “FindNextFile” to fnd the next fle with the returned handle. and crawler technology, we download a large number of When the resulting handle is in a folder format, it continues to fles on the Internet. Tird, we collect a number of non- traverse all fles under that folder. Initially, the monitor tries to confdential documents from the hosts of 20 volunteers to get the full fle names and fle contents by traversing the new emulate actual user environments. When we assign user fles shared folder. However, during the actual experimentation, for the deception environment, the path length is generated it is found that as the number of fles in storage grows, the randomly. Each folder may have a set of subfolders randomly. monitor takes far more time and resources to get all the fle For each folder, a subset of extensions is randomly selected. contents than just the fle paths. All traverses are more likely Furthermore, each directory name is generated based on to alert the attacker. Terefore, the monitor only obtains the meaningful words. Consequently, we generate paths and fle paths that the attacker shares on its host with the help extensions for user fles, giving them variable fle depth and of “GetFileName”. Moreover, in order to prevent encryption meaningful content. by the ransomware, the mounted disk monitor will directly To make the simulated environment more valuable, we transfer the acquired shared fle path list to another secure deploy bait information on it, such as database, false code host. comments, digital certifcates, administrator password, SSH keys, VPN keys, browser history passwords, ARP records, and DNS records. When the bait information is obtained by an 5. Clue Extraction and Analysis attacker, it may trick it to attack the deception environment. Trough the deception environment, we trap the ransomware 4.3.2. Te File Monitor. Te fle monitor can detect the attacker and collect a lot of information that may contain ransomware by monitoring fle type changes and fle entropy many traceable clues. However, such traceable clues are changes,whichmethodisproposedin2016[21].Tetype ofen not visually observable and are complex in nature. In of data stored in a fle can describe the order and position addition, many of the above clues contain information that is of specifc byte values unique to a fle type. Since fles gen- not helpful in tracing back ransomware attackers. Terefore, erally retain their fle types and formatting in the deception in order to assist in the analysis of the monitor information environment, the bulk modifcation of such fles should be and extract the efective main traceable clues, in this section considered suspicious. When the monitor sees this type of we propose how to extract clues and how to analyze traceable changes, we can infer that a ransomware attack has occurred. clues using an automatic approach afer extraction. Entropy can express the randomness of each character in a string. Te higher the entropy value, the stronger the 5.1. Clue Extraction. We mainly obtain kinds of clues from randomness. Te Shannon entropy of an array of bytes can the extraction, including remote login information (IP be computed as the sum: addresses), network trafc, clipboard contents (pictures and 255 texts), shared folder information (path strings), and ran- 1 somware samples (compile time and compile paths). Te �=∑�� log (1) � 2 � �=0 �� shared folder path clues can be obtained directly from the monitor. However, clipboard clues, compile clues, and remote � =�/���������� � for �� � and �,thenumberofinstancesofbyte host clues are ofen not visually observable and are complex value � in the array. As the entropy value is represented by a in nature. As a result, the extraction module mainly focuses number from 0 to 8, the entropy value of 8 represents the byte on the extraction of remote host clues, clipboard string clues, array composition of its completely uniform distribution. and compilation clues. Since the probability of each byte occurring in the encrypted ciphertext is basically the same, the entropy value will be close Remote Host Clue Extraction. Te IP address, port numbers, to the upper limit. Because the ransomware always encrypts and folder path clues can be directly obtained from the login a large number of fles, when we detect that a fle change to information and folder paths. TCP packets that interact with a high entropy value fle in a short period of time and also the confguration information in a network communication change the fle type, we assume that the fle is subject to a PCAP package are usually named “ClientData”. We extract ransomware attack. theclientnamefeldfromthe“ClientData”packettoobtain the attacker’s hostname. In addition, the KeyboardLayout 4.3.3. Te Shared Folder Monitor. By traversing the disk feld indicates the default keyboard layout for an attack- storage in real time, the shared folder monitor discovers the ing host; e.g., the Chinese Simplifed layout number is updates of the shared folders in real time. It obtains the 0x0004, and the American English keyboard layout number contents of the attacker's fles locally, which are ofen not is 0x0409. Te remote user's idiomatic language (the mother Wireless Communications and Mobile Computing 7 tongue) can be found by the keyboard layout to infer the number is always located in the “\QQ\QQfle\”folder(e.g., attacker's nationality. D:\QQ\QQfle\86∗∗∗∗∗∗086\FileRecv\...). In this way, the analysis system can quickly and accurately Clipboard Clue Extraction. Te main fle formats available get host names, email accounts, program names, social to the clipboard monitor are Windows Bitmap, GDI fle, sofware numbers, and other traceable clues that are carried ANSI characters, Unicode characters, and WAV audio data. in the attacker's fle paths. However, such user information We mainly aim at extracting the traceable clues of character for less of the overall clues is acquired by chance. Terefore, types. It extracts character clues from the clipboard in various we will conduct further analysis on this basis. formats by judging the GetClipboardData API’s “DataType” value. 5.2.2. Account Analysis. Trough the analysis of APT1 and some other attribution reports, we fnd that the mapping Compilation Clue Extraction.ForallWindowsRDP-based between the attacker and the physical world identity can be ransomware samples that we examined, we empirically better obtained by analyzing the account number lef by the observed that the most commonly used formats for these ∗ ∗ attacker. Tis information includes but is not limited to the samples are the PE fle, especially .exe and .dll; some PE location of the IP address, the spelling and registration of the fles have compilation information in the fle, and this infor- domain name, the URL corresponding IP address, and the mation does not change with the migration of the programs. domain name of the mailbox account. Because it is difcult As a result, it is a good way to obtain the creator’s information. to identify this information efectively in a large number A PE fle mainly consists of fve major components: DOS of strings and path clues, the analysis system automatically MZ header, DOS stub, PE header, section headers, and identifes the IP address, domain name, URL, and mailbox section content. Each component contains a great deal of account by regular matching. Ten, with the help of threat information. Tere is very little information that we can use intelligence and big data technology, more relevant clues are to identify the creator, and some identifcation information obtained. needs to be extracted from the content of each section. In this paper, there are many clues in the PE fles that can 5.2.3. Language Identifcation. User languages ofen help to be extracted to trace back the attacker: fle name, PE fle determine an attacker's idiomatic language, but because of type, compiler version, compilation path, compile time, last a large number of languages in diferent countries and modifed time, last open time, IP address, URL, domain the high similarity of some languages, we use automatic name, language, string, wide character, and so on. We extract analysis systems to identify the language of clues. We test most of the clues with PEView [34]. However, since it cannot the accuracy of two language identifcation toolkits using directly obtain the compilation path, we use the pefle [35] entire path information in four diferent languages. We have tool to extract paths by locating in the PE structure. found “langid.py” toolkit to be overall more accurate than Since the extraction clues include diferent encoding “langdetect”toolkit.Tecomparisonresultsareshownin formats, to facilitate observation and the unifed mode of Figure 3. Te langid.py is a language identifcation toolkit subsequent analysis, the extraction system completely con- developed by Lui and Baldwin at the University of Melbourne verts the data encoding obtained into the UTF-8 format and [36]. It combines a naive Bayes classifer with cross-domain saved in the SQLite database. Before submitted to the analysis feature selection to provide domain-independent language system, the extracted clued are divided into two categories: identifcation. string clues and path clues. 5.2.4. Traceable Strings Identifcation. When traceable strings 5.2. Automatic Analysis. At this point, the clues extracted are needed, traditional string analysis methods usually use from this system mainly include string clues and path clues. Named Entity Recognition (NER). However, the clues to Te number of string clues is large, mixed with a large number be analyzed mainly include strings and path. Strings before of unidentifable strings. Because the number of path clues and afer the path separator have few semantic correlations. is also very large with no semantic correlation, it is difcult What is more, the string between the path separators and to identify traceable clues manually. So we focus on how to the remaining strings to be analyzed are mostly semantically automate the identifcation of traceable clues for path clues. unrelated due to their limited length. So this paper proposes an algorithm, which can quickly and automatically analyze 5.2.1. Users and Programs Analysis. In the analysis of the the traceable strings in strings and path. path clues, we frst propose to obtain the attacker clues by In order to flter out meaningful traceable clues related to identifying the features of the context-related segmentation the attacker’s identity, the path clues and string clues are split on the same path. For instance, each user has a separate into strings and identifed by common words and gibberish user folder, and it is located in the “Users” folder under in the following steps. Figure 4 shows the automatic traceable Drive C. As a result, the system can obtain the user name clues identifcation system process. at the attacking host by obtaining the folder name under the “Users” folder (e.g., C:\Users\Dell\...). Te “Program fles” Make Stop Words. Te system splits the path by the path folder usually contains the name of the sofware program delimiter, as these separated path strings that are common installed on the machine (e.g., C:\Program Files\Microsof to multiple computers have no identifying efect. So, we take Visual Studio 11.0\...). What is more, the QQ account out the fle string names that are common to 20 normal user 8 Wireless Communications and Mobile Computing

Common Words Removal. Afer segmentation, the system 117 116 107

103 obtains a lot of repeatable strings. But most of them are 99

88 commonly used words and does not help in identifying 73

66 ransomware attackers. As a result, we flter out the common words by comparing each string with a dictionary. If the 44 string can be matched in a dictionary, the system marks the 27 12

11 common word property of the word as false, otherwise, as 5 1 1 0 true. ENGLISH CHINESE RUSSIAN KOREAN Gibberish Detect. Based on our analysis and observation, we langid toolkit correct recognition found that there is a large amount of garbled information langdetect toolkit correct recognition in path clues. Most of these strings are generated randomly, langid toolkit incorrect recognition and people can recognize this randomness manually, but it langdetect toolkit incorrect recognition is difcult for a program to recognize it automatically. As a result, we propose detecting the gibberish by training the Figure 3: Accuracy of Two Language Identifcation Tools: In tests Markov chain model with English texts. For each string X, using the same known language path clues, the accuracy of the langid there is a probability that the i character in X is toolkit for English, Chinese, Russian and Korean was 88%, 100%, 99%, 99.0% respectively, while the accuracy of the langdet toolkit P {��+1 =�|�1 =�1,�2 =�2,...,�� =��} was73%,0.85%,40%,and4.60%,respectively. (2) =�{��+1 =�|�� =��}

String Path Each letter is related to the upper and lower two letters. Clues Clues Tis relationship can be expressed by 2-gram. For example, ... Path Split with regard to the string “Ransomware”: Ra, an, ns, e[space]. Te analysis system can record how ofen the characters appear next to each other, through the collection of gibberish and some commonly used normal phrases or vocabulary. It Traceable Strings Stop normalizes the counts, afer reading through the training data Identifcation Words and then measures the probability of generating the string Sentence based on the digest by multiplying the probability of the pair Tokenize Word Tokenize of adjacent characters in the string [38]. Te training data can help statistics corresponding to gibberish and normal strings, Alphabetic Case respectively, the average transfer probability. Tis probability Characteristics Sign Split then measures the amount of possibilities assigned to this string according to the data as observed by the model. When Strings we test the string “Ransom,” it computes

Common words � � � � � � � � � � detect Gibberish detect ���� [ ������ ]=����[� ][ � ]×����[� ][ � ] (3) � � �� Result ×⋅⋅⋅×����[ � ][ ]

Figure 4: Process of automatic identifying of traceable clues. If the input string is gibberish, it will pass through some pairs with very low counts in the training phase and hence have low probability. Te system then looks at the amount of probability per character for a few known normal strings computers as stop words. Ten, it removes the stop words and a few examples of known gibberish and then picks a from the strings afer each split. threshold between the gibberish’s most possibilities and the normal string’s least possibilities: Segmentation. Afer removing the stop words, we separate sentences and words for each segment of the path with �ℎ���ℎ��� tokenize(e.g.,NLTK,NER)[37].However,asthesegments of the path are not semantically related, the segmentation (��� (����������) + ��� (��������ℎ�����)) (4) = is less efective. Based on our extensive research on path 2 information, users ofen prefer to use symbols such as � � underscores to split fle names. In addition, they also like to When we analyze the string ‘X’, if ����[ � ]>�ℎ���ℎ���, use capitalization in multiword strings to facilitate reading. the analysis system will view the string as normal string. If � � As a result, the system uses the common identifer in the ����[ � ]≤�ℎ���ℎ���, the analysis system will view the document and alphabetic case characteristic to segment the string as ‘False.’Ten, the system removes the strings with the strings again. ‘False’ type as gibberish. Wireless Communications and Mobile Computing 9

Table 1: Traceable strings recognized result [13]. Analysis Data Change Figure

Common Words Gibberish Final 50 String Recognized Recognized Recognized Wh∗∗∗∗ter True True True 40 gandcrab True True True kate z True True True 30 vwrtjty True False False program False True False 20 Data quantity Data

Identifcation Result. Afer the above analysis, the automated 10 analysis system outputs the strings for which both the commonwordtagandthegibberishtagareTrue,whichis the list of traceable clues [13]. Table 1 is reproduced from Z. H. 0 source stopword tokenize gibberish result Wang et al. (2018) [under the Creative Commons Attribution Analysis Process License/]. When the string is a normal word, such as “program”, the volunteer1 volunteer7 common words recognition will make it “False.” Gibberish volunteer2 volunteer8 recognition can recognize the word only with non-randomly volunteer3 volunteer9 generated identifer strings. Moreover, it cannot identify 4 10 the words that do not conform to the spelling patterns of volunteer volunteer common words (e.g., vwrtjty). Te string is considered to volunteer5 volunteer11 have auxiliary traceability only if both of the analysis values volunteer6 volunteer12 are true. We rely extensively on string inversion to verify the Figure 5: Te data volume changes in 12 volunteers' traceable clues accuracy of the system. It is found that most of the traceable through the analysis system. Te screening rate can reach 98.34% strings on the volunteer storage can be recognized. Table 1 [13]. shows the recognition results of several typical strings.

6. Evaluation 6.2. Traceback Attackers. Next, we carried out a detailed In this paper, we evaluate the proposed method with two analysis of the information lef over from one of the volun- experiments. Te goal of the frst experiment is to demon- teers' attacks. By understanding the data from the automatic strate that the proposed method can help trace back to the analysis of an attacker, we can infer that the attacker's RDP-based ransomware attackers and capture their private real identity may have the following characteristics: (1) Te information. Te goal of the second experiment is to demon- attacker is likely to come from China. (2) Te attacker may strate that the method can also automatically recognize clues be a security and related personnel. (3) Te attacker may in ransomware samples and help trace back to ransomware work in the Chinese Academy of Sciences and have relations makers. with a large security manufacturer in China. Te conclusion is shown in Figure 6. It shows the results from the analysis 6.1. RDP-Based Ransomware Attacker system in fve sections.

6.1.1. Clue Capture and Analysis. We used Windows 7 system Username. Trough automatic analysis, a total of three user virtual machines to deploy the deception environment and names from the attacker's host were extracted from a large assess the efectiveness of it. While Windows 7 is not required, number of clues, in which, by using the “Dell” user, it can be it was chosen because of the wide range of applications assumed that the attacker was using the Dell host. and because it is one of the main targets of ransomware. We invited 122 professional volunteers to help with the Programs. Te automatic analysis system identifed several experiment and provided them with an experimental feet of typical sofware programs installed in the attacker's host. 12 virtual hosts. Most of the volunteers’ login information, For instance, “QQ” is a widely used real-time social tool clipboard content, shared folder path, and uploaded PE fle in China; “AliPay” is an online payment tool developed were able to be successfully captured by the monitor system. by Alibaba and widely used in China; “Sogou input” is We choose 12 computers’ path the information collected an input method sofware for Chinese developed by China from volunteers and record the rate of convergence afer Sogou Company; MeiTu is a photo beautifcation sofware each step of the analysis. Figure 5 shows how the number of developed by a Chinese company and widely used in China; traceable clues remains as each step of the data is processed In addition, “Sinfor” and “360” are both well-known security by the analysis system [13]. Figure 5 is reproduced from Z. H. manufacturers in China and have a large number of safety- Wang et al. (2018) [under the Creative Commons Attribution related products; “Visual Studio” is a common development License/public domain]. sofware. It is found that much Chinese-made sofware widely 10 Wireless Communications and Mobile Computing

Admin DELL Default MeiTu Visual ∗∗∗ Studio A te User Sogou am Name Input Wh∗∗∗ freebuf ter 360 ...

∗∗t Traceable Programs Strings Sinfor ... QQ ∗∗e AliPay Keyboard Layout visuma Account ntrag 372∗∗∗ 582 ∗∗∗ CH- 353 SIMPLI 208 FIED ∗∗ ∗∗ @ ∗∗@∗∗ e.ac.cn t.ac.cn

Figure 6: Analyze an attacker using automatic analysis system to display traceable information about the attacker in fve aspects: host user name, program sofware, keyboard layout, account information, and traceable string.

used in China was installed on the attackers' hosts, as well as some development and security products that were installed less ofen by ordinary users.

Keyboard Layout. Trough network communication analysis, we obtain the attacker's keyboard layout is Simplifed Chi- nese, from which it can be inferred that the attacker's native language is probably Chinese. Tis is consistent with the host installed sofware features.

Account Number. According to the system automatically identifed account number, it found two QQ accounts and two e-mail accounts. Te e-mail accounts for the Chinese Academy of Sciences business accounts can confrm the above assumption that the attackers from China, and it is likely to work in the Chinese Academy of Sciences. Trough Figure 7: Te registration infographic of QQ account 372∗∗∗82. the registration information of the QQ account (as shown in Figure 7), we can fnd the following information: (1) the attacker could be a male, aged 32; (2) Internet-related work, security researchers as the main research direction. “∗∗e” likely with 360 Security;(3)locatedinHaidian District, ∗∗ Beijing, China.Inthe“353∗∗∗208” account registration and “ t” are acronyms for departments under the Chinese Academy of Sciences, while “A∗∗∗team”isasecurity information, mail, work, and other information are empty, ∗∗ nicknamed a special English string: “Wh∗∗∗ter”itcanbe research team in the “ e” department. “Visumantrag”is presumed that the account is likely to be a private use. ofen used when processing visas from various countries. Te “Wh∗∗∗ter” matches the nickname of the QQ number ∗∗∗ Traceable Strings. Te automatic analysis system sifs strings “353 208” account and is likely to be its regular nickname. from the monitor that may be traceable. “Freebuf ”isa security information exchange website commonly used by 6.3. Ransomware Maker Automatic Analysis. In order to Chinese security personnel. “Bootnet”isasecondaryattack verify that the analysis system is available in tracing back the method that is ofen used by attackers, and is ofen used by ransomware maker, we obtained more than 8000 ransomware Wireless Communications and Mobile Computing 11

Table 2: Same identifer for diferent samples [13].

Sample Hash Same Identity Strings 68dd613973f8a∗∗∗7befe0f4b461d258ce569b9020079∗∗∗9ae0f090266a60cc fa0c084aef41969∗∗∗1f427a1b65e1b1464fa82d297e712∗∗∗e1fdf312c04481 jqxxppho/aonk/je7z2 4a36da0286d42cd∗∗∗306fa99239c8238821beaf40744be∗∗∗ba862dbc6d90f 70cbc839bcb88∗∗∗7422bbd2fa812dd3de02391b6a9f1∗∗∗0762a6f49cf32501

0.46% 1.14% 3.53% 0.34%

number 7197 878 17.20% 0.57% 2.05% 3.99% 0.34% 0% 100%90%80%70%60%50%40%30%20%10% 6.61% 58.09% Ransomware without a PDB path 0.11% Ransomware with PDB path 0.57%

Figure 8: Of the 8,075 real ransomware samples, 11% were able to 0.46% extract the Program Data Base (PDB) path, while the remaining 89% of authors may have chosen to hide the PDB path during 4.56% compilation. zh hu en mt nl pl nn f sl et samples from VirusTotal [39] and extracted the Program Data nb da Base (PDB) path from the samples (as shown in Figure 8). de es It is shows that, despite the large number of attackers' it deliberate erasure of the compiled information, legacy PDB Figure 9: Of the 878 ransomware sample PDB paths, English path information can still be found from a large number of accounted for 510, and the remaining monitored languages ransomware samples. were Chinese: 31(3.53%), Dutch: 40(4.56%), Norwegian Nynorsk: What is more, the analysis system is used to automatically 4(0.46%),Slovenian:5(0.57%),NorwegianBokm˚al: 1(0.11%), Ger- analyze more than 800 diferent ransomware samples’ PDB man: 58(6.61%), Italian: 3(0.34%) Hungarian: 35(3.99%), Maltese: path. We found multiple identity strings in the analysis 18(2.05%), Polish: 5(0.57%), Finnish: 151(17.20%) Luxembourgish: results of diferent samples. One of the typical results is 3(0.34%), Danish: 10(1.14%), and Spanish: 4(0.46%). shown in Table 2 [13]. Table 2 is reproduced from Z. H. Wang et al. (2018), [under the Creative Commons Attribution License/public domain]. We used VirusTotal to validate ∗∗ the samples and found that they were all Cobra family translated into English) with an MD5 value of “60c6a92af ∗ ransomware. As a result, it could be assumed that these 6d0c16f7”. \\ \\ \\ \\ samples are made by the same ransomware maker. When “E: LIAO STUDIO UAC simulated an improved ver- one sample is traced back, other samples can also arrive at sion of the 360 anti-virus program 1117 Bale ∗ ∗∗ ∗∗ ∗∗ ∗∗ \\ \\ the conclusion of the ransomware maker. When diferent 1 4.1 .2 . 0( 1) 9818 program WriteSystem32 \\ traceable information is analyzed from diferent samples, Release VirtualDesktop.pdb” the identity information of the ransomware maker can be We can infer from the automatic analysis result that the described by integrating the information. ransomware maker is likely from China. We can prove this point by the following: Language Identifcation. When we carry on the language (1) Language Identifcation: Tis is done by identifying recognition to the PDB path of 878 samples, the result is as the path language, which can be used to speculate that the Figure 9. Te system automatically identifes the path which attackermayhavecomefromChina.Byunderstandingthe mainly contains 11 languages, of which English accounts for Chinese strings in the path, we fnd that the meaning of this the largest proportion. text is to improve to bypass the detection of 360, security sofware that has a large market in China. Actual Case Analysis. Te automatic identifcation system (2) IP address: Te system identifed an IP address in the helped us identify the traceable strings in the PDB path. Take PDB path. With the help of threat intelligence database, it is the example of a sample PDB path (the Chinese has been found that the IP came from Xinjiang, China (Figure 10). 12 Wireless Communications and Mobile Computing

Acknowledgments Te authors would like to thank VirusTotal for providing ransomware samples and the volunteers from the University of Chinese Academy of Sciences. Tis work is supported in part by the National Key research and Development Plan (Grant no. 2018YFB0803504) and the National Natural Science Foundation of China (61871140, 61572153, U1636215, 61572492, and 61672020). Figure 10: Partial infographic of IP: 1∗4.1∗∗.2∗∗.∗∗0inthethreat intelligence library. References

[1] A. Kharraz, “UNVEIL: A Large-Scale, Automated Approach (3) Traceable Strings: Te extracted identifer string to Detecting Ransomware,” in Proceedings of the 25th USENIX “LIAO” resembles a Chinese pinyin and is most likely a Chi- Security Symposium (USENIX Security 16),pp.757–772, nese surname. To sum up, we speculate that the ransomware USENIX Association, 2016. maker is most likely from China. [2] K. Savage, P. Coogan, and H. Lau, “Te evolution of ran- somware,” Tech. Rep., 2015. 7. Conclusion [3]A.Kharraz,W.Robertson,D.Balzarotti,L.Bilge,andE. In this paper, we focus on tracing back RDP-based ran- Kirda, “Cutting the Gordian Knot: A Look Under the Hood of somware. Recently, more and more ransomware attackers are Ransomware Attacks,” in Detection of Intrusions and Malware, and Vulnerability Assessment,vol.9148ofLecture Notes in using RDP attacks to spread ransomware with impunity due Computer Science, pp. 3–24, Springer International Publishing, to the use of strong cryptography. We propose tracing back Cham, 2015. the attack sources to deter RDP-based ransomware with the [4] S. Mohurle and M. Patil, “A brief study of Wannacry Treat: proposed deception environment. It collects traceable clues Ransomware Attack 2017,” International Journal of Advanced and performs automatic analysis by using natural language Research in Computer Science,vol.8,no.5,2017. processing and machine learning techniques. Te evaluation [5]X.Yu,Z.Tian,J.Qiu,andF.Jiang,“ADataLeakagePrevention shows that it is able to trap attackers and collect traceable Method Based on the Reduction of Confdential and Context clues lef by attackers in a deception environment. With Terms for Smart Mobile Devices,” Wireless Communications and automatic clue identifcation, it can converge the amount of Mobile Computing,vol.2018,ArticleID5823439,11pages,2018. traceable clues to about 2%. We use this method to analyze [6] J. Yaneza, “Brute Force RDP Attacks Plant CRYSIS Ransom- two practical cases and show its efectiveness. By tracing back ware,” https://blog.trendmicro.com/trendlabs-security-intelli- to ransomware attackers, we can provide a strong deterrent to gence/brute-force-rdp-attacks-plant-crysis-ransomware/. stife the development of ransomware. [7] “10% of Ransomware Attacks on SMBs Targeted IoT Devices,” https://www.darkreading.com/application-security/10–of-ran- Data Availability somware-attacks-on-smbs-targeted-iot-devices-/d/d-id/1329817. Te data used to support the fndings of this study are [8]Q.Tan,Y.Gao,J.Shi,X.Wang,B.Fang,andZ.H.Tian, available from the corresponding author upon request. “Towards a Comprehensive Insight into the Eclipse Attacks of Hidden Services,” IEEE Internet of Tings Journal,2018. [9] “Kaspersky Security Bulletin: STORY OF THE YEAR 2017,” Additional Points 2017, https://media.kasperskycontenthub.com/wp-content/up- Tis manuscript provides additional deception environment loads/sites/43/2018/03/07164824/KSB Story of the Year Ran- monitors and analytical methods by using more volunteers somware FINAL eng.pdf. and samples. It proves the validity of the methods through [10]Z.Tian,Y.Cui,L.Anetal.,“AReal-TimeCorrelationofHost- specifc traceback cases (i.e., traceback the ransomware Level Events in Cyber Range Service for Smart Campus,” IEEE attacker and traceback the ransomware maker). Access,vol.6,pp.35355–35364,2018. [11] R. Ross et al., Managing Information Security Risk: Organi- Disclosure sation, Mission, and Information System View,NationalInsti- tute of Standards and Technology, 2011, http://csrc.nist.gov/ An earlier version of this paper was presented at the Interna- publications/nistpubs/800-39/SP800-39-fnal.pdf. tional Conference: IEEE Tird International Conference on [12]Y.Wang,Z.Tian,H.Zhang,S.Su,andW.Shi,“APrivacy Data Science in Cyberspace, Guangzhou, China, 18-21 June Preserving Scheme for Nearest Neighbor Query,” Sensors,vol. 2018. Te authors’ initial conference paper focused mainly on 18,no.8,p.2440,2018. the efectiveness of deception environment monitors and the [13] Z. H. Wang, X. Wu, C. G. Liu, Q. X. Liu, and J. L. Zhang, screening rate of the analysis system. “RansomTracer: Exploiting Cyber Deception for Ransomware Tracing,” in Proceedings of the IEEE Tird International Confer- Conflicts of Interest ence on Data Science in Cyberspace, pp. 227–234, 2018. [14] B. A. S. Al-rimy, M. A. Maarof, and S. Z. M. Shaid, “Ransomware Te authors declare that they have no conficts of interest. threat success factors, taxonomy, and countermeasures: A Wireless Communications and Mobile Computing 13

survey and research directions,” Computers & Security,vol.74, [28] J. Chen, Z. Tian, X. Cui, L. Yin, and X. Wang, “Trust Architec- pp. 144–166, 2018. ture and Reputation Evaluation for Internet of Tings,” Journal [15] F. Jiang, Y. Fu, B. B. Gupta et al., “Deep Learning based Multi- of Ambient Intelligence & Humanized Computing,vol.2,pp.1–9, channel intelligent attack detection for Data Security,” IEEE 2018. Transactions on Sustainable Computing,2018. [29] C. Le Guernic and A. Legay, “Ransomware and the Legacy [16] N. Andronio, S. Zanero, and F. Maggi, “HelDroid: dissecting Crypto API. Paper presented at the Risks and,” in Risks and and detecting mobile ransomware,” in Research in Attacks, Security of Internet and Systems: 11th International Conference, Intrusions, and Defenses,vol.9404ofLecture Notes in Computer CRiSIS 2016, Roscof, France, 2017. Science,pp.382–404,Springer,2015. [30] L. Pingree, Emerging Technology Analysis: Deception Techniques [17]F.Mercaldo,V.Nardone,A.Santone,andC.A.Visaggio, and Technologies Create Security Technology Business Opportu- “Ransomware Steals Your Phone. Formal Methods Rescue It,” nities,Gartner,2015. in Formal Techniques For Distributed Objects, Components, And [31] “Ransomware and Businesses,” https://www.symantec.com/ Systems: 36th IFIPWG 6.1 International Conference, FORTE content/dam/symantec/docs/security-center/white-papers/ran- 2016, held as part of the 11th International Federated Conference somware-and-businesses-16-en.pdf, 2016. On Distributed Computing Techniques, DisCoTec 2016,E.Albert [32] A. Kharraz and E. Kirda, “Redemption: Real-Time Protection and I. Lanese, Eds., pp. 212–221, Springer International Publish- Against Ransomware at End-Hosts,” in Proceedings of the ing, 2016. International Symposium on Research in Attacks, Intrusions, and [18]D.Sgandurra,L.Mu±oz-Gonzßlez, R. Mohsen, and E. C. Lupu, Defenses,pp.98–119,2017. “Automated Dynamic Analysis of Ransomware: Benefts, Limi- [33] F. Bergstrand, J. Bergstrand, and H. Gunnarsson, Localization tations and use for Detection,”https://arxiv.org/abs/1609.03020. of Spyware in Windows Environments,. [19]S.Song,B.Kim,andS.Lee,“TeEfectiveRansomware [34] “PEView,”https://www.aldeid.com/wiki/PEView. Prevention Technique Using Process Monitoring on Android [35] “python-pefle,”https://pypi.python.org/pypi/pefle. Platform,” Mobile Information Systems,vol.2016,ArticleID [36] M. Lui and T. Baldwin, “langid., py, An of-the-shelf language 2946735, 9 pages, 2016. identifcation tool,” in Proceedings of the ACL 2012 system [20]T.Yang,Y.Yang,K.Qian,D.C.-T.Lo,Y.Qian,andL.Tao, demonstrations. Association for Computational Linguistics,pp. “Automated detection and analysis for android ransomware,” 25–30, 2012. in Proceedings of the 17th IEEE International Conference on [37]S.BirdandE.Loper,“NLTK:thenaturallanguagetoolkit,” High Performance Computing and Communications, IEEE 7th in Proceedings of the ACL 2004 on Interactive poster and International Symposium on Cyberspace Safety and Security and demonstration sessions. Association for Computational Linguis- IEEE 12th International Conference on Embedded Sofware and tics, Barcelona, Spain, July 2004. Systems, HPCC-ICESS-CSS 2015, pp. 1338–1343, USA, August [38] “Rrenaud. Gibberish-Detector,” https://github.com/rrenaud/ 2015. Gibberish-Detector/blob/master/README.rst. [21] N. Scaife, H. Carter, P.Traynor, and K. R. B. Butler, “CryptoLock [39] “VirusTotal,” http://virustotal.com. (and Drop It): Stopping Ransomware Attacks on User Data,” in Proceedings of the 36th IEEE International Conference on Dis- tributed Computing Systems, ICDCS 2016,pp.303–312,Japan, June 2016. [22]M.M.AhmadianandH.R.Shahriari,“2entFOX:Aframework for high survivable ransomwares detection,” in Proceedings of the 13th International ISC Conference on Information Security and Cryptology, ISCISC 2016, pp. 79–84, Iran, September 2016. [23]M.M.Ahmadian,H.R.Shahriari,andS.M.Ghafarian, “Connection-monitor & connection-breaker: A novel approach for prevention and detection of high survivable ransomwares,” in Proceedings of the 12th International ISC Conference on Information Security and Cryptology, ISCISC 2015,pp.79–84, Iran, September 2015. [24] D. Kim, W. Soh, and S. Kim, “Design of Quantifcation Model for Prevent of Cryptolocker,” Indian Journal of Science and Technology,vol.8,no.19,2015. [25] C. Moore, “Detecting Ransomware with Honeypot Techniques,” in Proceedings of the 2016 Cybersecurity and Cyberforensics Conference (CCC),pp.77–81,Amman,Jordan,August2016. [26]F.Mbol,J.Robert,andA.Sadighian,“AnEfcientApproach to Detect TorrentLocker Ransomware in Computer Systems,” in Cryptology and Network Security,S.ForestiandG.Persiano, Eds., vol. 10052 of Lecture Notes in Computer Science,pp.532– 541, Springer International Publishing, Cham, 2016. [27] K. Cabaj, P. Gawkowski, K. Grochowski, and D. Osojca, “Net- work activity analysis of CryptoWall ransomware,” Przegląd Elektrotechniczny,vol.91,no.11,pp.201–204,2015. Hindawi Wireless Communications and Mobile Computing Volume 2018, Article ID 3473910, 13 pages https://doi.org/10.1155/2018/3473910

Research Article Enjoy the Benefit of Network Coding: Combat Pollution Attacks in 5G Multihop Networks

Jian Li ,1 Tongtong Li,2 Jian Ren,2 and Han-Chieh Chao3,4

1 School of Electronic and Information Engineering, Beijing Jiaotong University, Beijing 100044, China 2Department of ECE, Michigan State University, East Lansing, MI 48824-1226, USA 3Department of Electrical Engineering, National Dong Hwa University, Taiwan 4Institute of Computer Science & Information Engineering and Department of Electronic Engineering, National Ilan University, Taiwan

Correspondence should be addressed to Jian Li; [email protected]

Received 27 September 2018; Accepted 15 November 2018; Published 2 December 2018

Guest Editor: Karl Andersson

Copyright © 2018 Jian Li et al. Tis is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

In the upcoming 5G era, many new types of networks will greatly expand the connectivity of the world such as vehicular ad hoc networks (VANETs), Internet of Tings (IoT), and device-to-device communications (D2D). Network coding is a promising technology that can signifcantly improve the throughput and robustness of these emerging 5G multihop networks. However, network coding is generally very fragile to malicious attacks such as message content corruption and node compromise attacks. To take advantage of network coding in performance gain while refraining malicious network attacks is an interesting and challenging research issue. In this paper, we propose a new error-detection and error-correction (EDEC) scheme that can jointly detect and remove the malicious attacks based on the underlying error-control scheme for general multihop networks that can model the 5G multihop networks. Te proposed scheme can increase the throughput for network with pollution attacks compared to existing error-detection based schemes. Ten we propose a low-density parity check (LDPC) decoding based EDEC (LEDEC) scheme. Our theoretical analysis demonstrates that the LEDEC scheme can further increase the throughput for heavily polluted network environments. We also provide extensive performance evaluation and simulation results to validate the proposed schemes. Tis research ensures the expected performance gain for the application of network coding in the 5G network under malicious pollution attacks.

1. Introduction message. Network coding provides a trade-of between max- imum multicast fow rate and computational complexity. Most of the newly emerging networks in the 5G communi- Network coding was frst introduced in the seminal paper cation network can be modeled using multihop networks, by Ahlswede et al.[5].Liet al. [6] formulated the multicast including vehicular ad hoc networks (VANETs), Internet of problem in network coding as the max-fow from the source Tings (IoT), and device-to-device communications (D2D). to each receiving node. Tey proved that linear coding is New computing frameworks in the 5G network such as sufcient to achieve the optimum. Tis work made network fog computing [1] can also be categorized into multihop coding simpler and more practical. Koetter and Medard [7] networks. Te network throughput and robustness could be have shown that linear codes are sufcient to achieve the improved when network coding is utilized. Many researchers multicast capacity by coding on a large enough feld. Ho et al. have proposed to adopt network coding in the upcoming [8] have shown that using of random linear network coding 5G network, such as in [2–4]. Te core notation of network is a more practical way to design linear codes. Gkantsidis coding is that it allows the participating nodes to encode and Rodriguez [9] have applied the principles of random incoming packets at intermediate network nodes in a way that network coding to the context of peer-to-peer (P2P) content when a sink receives the packets, it can recover the original distribution and have shown that fle downloading times 2 Wireless Communications and Mobile Computing can be reduced. Esmaeilzadeh et al. [10] proposed to use is also presented in Section 3. Te EDEC scheme and feedback-free random linear network coding in broadcasting performance analysis are described in Section 4. Section 5 layered video streams over heterogeneous single-hop wireless presents the LDPC decoding and analysis of the LEDEC networks. Wu et al. [2] designed an efcient data dissem- scheme.WeconcludeinSection6. ination protocol for VANETs using network coding. Lei et al. [3] applied network coding in Named Data Networking 2. Related Work (NDN)inthe5GIoTnetworkandimprovedthecontent delivery efciency. Chi et al. [4] proposed to jointly utilize Existing work on pollution elimination can largely be divided D2D communication and network coding to achieve high into error-detection based schemes and error-correction communication reliability in the ultra-dense 5G cell deploy- based schemes. For error-detection based schemes, the errors ment. are normally detected at the intermediate forward nodes. However, in the context of network coding, all participat- For error-correction based schemes, the errors are generally ing nodes must encode the incoming packets according to a corrected at the sink node. fxed coding algorithm. If a packet from an intermediate relay While the error-correction based schemes seem to be node is corrupted or being tampered, the entire communica- more appealing, the computational complexity for encoding tion may be disrupted. and decoding is relatively high [11]. Tese schemes are gener- Te main purpose of this paper is to develop schemes that ally designed based on knowledge of the network topology, can combat network pollution and malicious attacks from the which makes these schemes less fexible to the current network nodes based on error-control coding in multihop networks. Jaggi et al. [12] introduced the frst polynomial- networks. We propose a new scheme that combines error- time rate-optimal network codes to correct the errors brought detection and error-correcting (EDEC) to combat network by malicious nodes. Tey also gave the theoretical network pollution attacks. In our scheme, the original message symbol capacity with the existence of malicious nodes. However, is frst encoded using an error-control code before encoded the communication overhead of the scheme is signifcant. In by network coding and transmitted. Te application of error- [13], the authors proposed a network error-correcting code control code gives intermediate nodes the capability to detect that combines the nonlinear coding at the source node and possible errors or pollution of the message. Unlike the linear coding at the intermediate nodes. Te code can achieve existing schemes, when an intermediate node detects an higher transmission rate than linear network error-correcting error, the packet will continue to be forwarded. As long as codes with exponential encoding and decoding complexity. the errors are correctable, the sink nodes will be able to Te limitations of error-correction based schemes make recover the corrupted packets and decode the original packet error-detection attractive in some network scenarios. Krohn symbols. Ten we further extend the EDEC scheme and et al. [14] proposed to use homomorphic hash functions to propose LDPC based EDEC (LEDEC) scheme. In the LEDEC guarantee correctness of the network fow. Te main idea is scheme we treat the packets as LDPC codewords at the sink that each intermediate node will check the correctness of the nodes and use the belief propagation algorithm (BPA) to packets through homomorphic hash functions. If a packet decode the LDPC code. It can guarantee a certain network fails the correctness check at an intermediate node, it will throughput even for a heavily polluted network environment, be discarded. Tis approach can reduce the communication while the throughput becomes very low for error-detection overhead and can be used in random network coding. based schemes. Moreover, we mainly focus on the throughput However, the computational complexity is still very high. impact brought by diferent strategies (discard versus keep) When the network scale is large, computing too many hash towards corrupted packets. values will also create high delay. Boneh et al.[15]provided Te major contributions of this paper are the following: cryptographic protection against pollution attacks by authen- ticating linear subspaces in network coding, which incurs (i) We propose an EDEC scheme by combining a mod- less computation delay than [14]. Shang et al.[16]proposed ifed error-control code and network coding. Te new homomorphic signature schemes for generation-based proposed EDEC scheme can increase the throughput network coding. for network environment with pollution attacks com- In [17–20], homomorphic message authentication codes pared to existing error-detection based schemes. (MAC) were designed to detect polluted packets. Yu et al. (ii) We propose the LEDEC scheme by augmenting the [17] applied multiple MACs to each packet in secure XOR EDEC scheme with the LDPC decoding. Te pro- network coding to flter out polluted packets. Agrawal et posed LEDEC can further improve the throughput al. [18] also designed a homomorphic MAC to check the even for network environment with heavy pollution. integrity of network coded data with less computational (iii) We conduct extensive simulations to evaluate the per- overhead. Li et al. [19] proposed to use a symmetric key formance of the proposed schemes and demonstrate based homomorphic MAC algorithms to detect corrupted the advantage of the proposed schemes. packets. In [21], the authors analyzed and improved two homomorphic authentication schemes homomorphic sub- Te rest of this paper is organized as follows: Section 2 space signature (HSS) [22] and key predistribution-based tag presents the related work. Preliminaries for network cod- encoding (KEPTE) [23] for network coding. ing, error-control coding and the proposed modifed error- To further reduce computational and communication control code are discussed in Section 3. Adversary model overhead, Kehdi et al. [24] developed a simple error-detection Wireless Communications and Mobile Computing 3 based Null Key scheme. Teir idea is to partition the �- � 1 R1 dimensional linear space over ��(� ) into two orthogonal subspaces of dimension � (symbol subspace) and �−�(null R1 R1 key space). Comparing to the homomorphic hash function, R1+R2 the null key scheme is more efcient and has virtually no R1+R2 message delay. Newell et al.[25]improvedtheworkof[24] 34 by splitting the null key into one large constant portion and another small periodically updated portion, making it more R R R2 1+ 2 suitable for network coding in wireless environments. R2 For all these error-detection based schemes, corrupted R packets will be discarded. In packetized networks, a large 2 2 message is divided into small packets. If a malicious node is able to continuously corrupt even a small fragment (packet) ofthemessage,thisfragmentwillbediscarded.Asaresult,the Source node whole message will be corrupted and the efective throughput Sink node will become lower and even close to zero in extreme situa- Relay node tions. In our previous work [26], we proposed to keep the corrupted packets instead of dropping them; thus we could Figure 1: A simple example of network coding. decode the corrupted packets in the intermediate forwarding nodes based on the subspace properties. In this paper we x further generalize the idea and combine both error-detection 1 [ ] and error-correction to guarantee a high throughput even for � [x ] [ 2] heavily polluted network environment. y =∑� � ⋅ y � = ∑� x = � [ ] , � � ,� � �,� � � [ . ] (1) � � �=1 . It is worth noting here that the authors in [27] proposed � :ℎ���(� )=�1 [ . ] using a cooperative nonparametric statistical framework [x ] (COPS) for misbehavior mitigation in network coding. Teir � framework does not need the knowledge of the data packet where ���,� ∈��(2)is the local network encoding coefcient, contents and provides another possible way to combat the ��,� ∈��(2)is the global network encoding coefcient network pollution attacks. for symbol y�,and�� =[��,1,��,2,...,��,�] is the network encoding vector. For a sink node V, there is a set of incoming � � 3. Preliminaries symbols y1,...,y� from � where ����(� )=V to be decoded. For network coding to achieve the expected benefts, In this section, we will frst introduce basic concepts and all the participating nodes in the network should be free notations of network coding. Ten we will present the of network pollution and malicious attacks. Suppose, under preliminary for error-control code and our proposed mod- the linear network coding, the sink node receives � packet ifed error-control code, which is the base of our proposed symbols y1,...,y�. It decodes the original message symbols schemes. At last we will present our adversarial model x1,...,x� by solving a set of linear equations: adopted in this paper. �1 x1 �11 ... �1� x1 y1 [ ] [ ] [ ] [ ] [ ] 3.1. Network Coding. Te main idea of network coding can be [� ] [x ] [� ... � ] [x ] [y ] [ 2 ] [ 2] [ 21 2� ] [ 2] [ 2 ] illustrated through the butterfy graph [6] in Figure 1. Assume [ ] [ ] = [ ] [ ] = [ ] . [ . ] [ . ] [ . . ] [ . ] [ . ] (2) the capacity of all the edges is �; the capacity of this network [ . ] [ . ] [ . d . ] [ . ] [ . ] is 2� according to the max-fow min-cut theorem. Only by [��] [x� ] [��1 ... ���] [x� ] [y�] encoding the incoming bits �1, �2 at node 3, this network can achieve the maximum capacity. For clarity purpose, we number the network coding vectors In this paper, we adopt the notations of [7]. A network �� received in the sink node from 1 to �. If all the relay nodes is equivalent to a directed graph �=(�,�),where� encode correctly and �≥�, the sink node can decode all the represents the set of vertices corresponding to the network message symbols successfully. nodes and � represents all the directed edges between vertices V corresponding to the communication link. Te start vertex 3.2. Error-Control Coding of an edge � is called the tail of � and written as V =����(�), while the end vertex � of an edge � is called the head of of � 3.2.1. Error-Detection. Suppose the original message symbols � and written as �=ℎ���(�). are in the �-dimensional linear space over ��(2 ).Aferwe For a source node �, there is a set of symbols X(�) = encode the symbols using the �×�generating matrix G of an � (x1,...,x�) to be sent, where x� ∈��(2), 1 ≤ � ≤ �.Fora (�, �) block code, the encoded codewords will form a linear � link � between relay nodes �1 and �2,writtenas�=(�1,�2), subspace over ��(2 ) of dimension �. So there will be another the symbol y� transmitted on it is the function of all the y�� �−�dimensional subspace over the � dimensional space, � such that ℎ���(� )=�1.And�� can be written as which is orthogonal to the codewords subspace. Denote a 4 Wireless Communications and Mobile Computing validcodewordbyc and the bases for the �−�dimensional Received subspace by h1,...,h�−�;wehave< c, h� >= 0, 1 ≤ � ≤ � − �, <⋅,⋅> Codeword where represents the inner product. Distance (� − �) × � H =[h�,...,h� ]� H Decode Let matrix 1 �−� ; forms a < (d-1)/2 right parity check matrix of the codewords and satisfes Wrong Right � Codeword Distance = d Codeword c ⋅ H = 0. (3) Decode wrong Distance > (d-1)/2 Suppose r = c+e is a received codeword, where e is an �-tuple Received error generated by a malicious node. For the received word r, Codeword we can get the syndrome of error pattern e as follows:

� � � � � s = r ⋅ H = (c + e) ⋅ H = c ⋅ H + e ⋅ H = e ⋅ H . (4) Figure 2: Limitations of error-control codes.

For a received codeword r,therearetwocases:(i)e equals to a legitimate codeword generated by G.Inthiscase, intact and corrupted packets will be gathered by the sink although r contains error, the error is undetectable using nodes. Te sink nodes can correct the corrupted packets and conventional error-control coding techniques; (ii) e contains have a higher communication throughput than the error- a nonzero projection to the orthogonal parity check subspace, detection based schemes. � which satisfes r ⋅ H =0̸ . In this case, we can detect that the received word contains error. 3.3. Modifed Error-Control Code. Te error-detection based In network coding, suppose c�’s are a set of valid schemes mainly focus on detecting the corrupt packets. As codewords, c =∑� ��c� is a linear combination of these mentioned in Section 2, when a corrupt packet is detected, codewords, where �� is the network encoding coefcient. It it will be discarded. So if an adversary continues to corrupt can be easily verifed that for 1≤�≤�−�, certain packets, these packets will be continuously dropped and the communication may never succeed. In this paper we c ⋅ h� =∑� c ⋅ h� =0. try to utilize the corrupted packets to improve the throughput � � � � (5) � in these situations. To achieve this goal, we need to propose a modifed error-control code. Equation (5) ensures that we can still check the correctness of H packet symbols using row vectors of afer they are encoded 3.3.1. Limitations of Conventional Error-Control Code. Alin- by network coding. Similar to [24], we call the row vectors � h , 1≤�≤�−� ear error-correcting code encodes the original bits message � null keys. symbol m to an � bits codeword c using a �×�generating matrix G. Suppose the minimum distance is �, according to 3.2.2. Error-Correction. From (4), it is clear that r is a the results in Section 3.2.2, the maximum number of errors codeword if and only if s =0. Te task of maximum we can correct is ⌊(� − 1)/2⌋. If the number of errors is more likelihood decoding is to fnd the minimum weight error than this amount, we may correct the corrupted codeword � � pattern e such that r ⋅ H = e ⋅ H . In this case, the received r into a false one, as illustrated in Figure 2. is corrected to r + e = c. In linear network coding, although the packet symbols 3.3.2. Proposed Error-Control Code Tat Can Detect Erroneous arenottheoriginalonessentfromsourcenodes,wecan Decodings. Te conventional error-control code may have r ,...,r still perform error-correction using (4). Suppose 1 � are undetected decoding errors. Tis is an inherent nature. No � e received codewords from incoming edges; is the error matter how low we set the code rate, these undetected errors r =∑� r vector added to the network encoding � � � and may still exist. Te decoding errors can only be detected using � r = r + e. If the error is within the correction capability of mechanisms other than a stand-alone error-correcting code. � � the error-control code, the syndrome will still be r ⋅ H = Terefore, we propose to apply modifed error-control � � � � e⋅ H +∑� ��r� ⋅ H = e⋅ H + 0 = e⋅ H .Tuswecancorrect code to both message symbols and network coding coef- the error using syndrome decoding. cients in (1). In this section, we will use the message symbol as m Te error-detection and error-correction capabilities are an example. Te original message symbol is frst mapped � h h determined by the (�, �) code structure. Based on the pol- to a bit value using homomorphic hashes. Ten will m �+� lution levels of the network, we can select the error-control be appended to to form a new bits message symbol. codes accordingly base on the following proposition. By encoding this new message symbol we can get the fnal codeword. So the code becomes an (�, � + �) code. By adding Proposition 1 (see [28]). For an (�, �) block code with the the extra bits, we can mitigate limitations of the conventional minimum distance �, it can detect all the �−1or less errors, error-control code. Figure 3 illustrates the modifed encoding or it can correct all the ⌊(� − 1)/2⌋ or less errors. scheme. At the sink nodes, the decoded symbol is frst split � � In our proposed schemes, the corrupted packets detected into two parts m and h afer the decoding as shown in �� � �� at the intermediate nodes will not be discarded. Both the Figure 4. Ten we calculate the mapping h from m .Ifh Wireless Communications and Mobile Computing 5

Original message successful probability of forging a valid symbol symbol, k bits Encode for n=3232, k=3072, b=3, s=40 New message to n bits 1 map symbol, k+t bits (n,k+t) code Send t bits 0.8 appendix

0.6 X: 6 Figure 3: Te encoding process of modifed error-control code in Y: 0.5008 EDEC scheme. 0.4

Decode Received to k+t bits k+t bits Split t bits 0.2

codeword symbol appendix the check passing of probability Split Compare ? 0 k bits message map t bits 0 10 20 30 40 50 60 symbol appendix no. of errors (1's) in a bogus symbol

Comparing Results: Equal: Right decoding Figure 5: Probability for a forged symbol to pass the check. Unequal: Wrong decoding

Figure 4: Te decoding process of modifed error-control code in Whenthesizeofthegeneratingmatrixislarge,wewilluse EDEC scheme. sparse matrices G and H to reduce the encoding and decoding complexity. Suppose there are �1’s in each row and in each h� G and are diferent, we can detect a decoding error. Our column on average in matrix F =[H ] for �≪�,wehavethe � modifcation is equivalent to choose 2 message symbols following theorem. �+� �+� from 2 symbols. Other message symbols in the 2 symbol Teorem 3. � space are considered to be illegal. Because the decoding Te probability � that a malicious node success- algorithm only guarantees that the decoded codeword is fully forges a valid symbol is �+� in the dimensional subspace, the decoded codeword � �+� � ( ) belonging to the 2 −2 symbol space implies that the � = �� , � ( �−� ) (7) decoding contains error. �� ( � ) � � Teorem 2. Suppose a decoding error occurs, the wrong where � is the number of -combinations from a set of �+� �≪� 1 codewordwillbeanycodewordinthe2 symbol space. So elements and is the number of ’s in a bogus symbol. the probability of detecting an erroneous decoding is Proof (sketch). When the number of errors, which are 1’s in 2�+� −2� 2� −1 1 the bogus symbol, is small compared to �, for each error bit �= = =1− . (6) 2�+� 2� 2� with index � there will be � distinct vectors with 1’s at the index � in F. Te bogus symbol will be linear dependent with these �, � Trough properly choosing the parameters ,wecan � vectors with high probability. Tus a bogus symbol with � detect erroneous decodings with little additional overhead. errors (�≪�) will be linear dependent with at most �� vectors, which may include rows from both matrices G and 3.4. Adversary Model. In network coding, a small error H. injected into an intermediate relay node may difuse to many When a malicious node forges a symbol which is only packets at the sink node, which will incur errors in the linear dependent with the row vectors in G, it will successfully message symbols afer the solving of (2). Tis can cause a forgeavalidsymbol.Sincethemaliciousnodedoesnotknow signifcant waste of network resources and sometimes can the row vectors of the matrices G and H except for the null even ruin the whole network communication. Tis kind of key vectors it has, these row vectors can be viewed as random attack is called pollution attack. to the malicious node. Te successfully probability is the In this paper, we assume that the malicious node can number of ways of choosing �� vectors in all the row vectors inject bogus packets into the network. If the succeeding relay of G divided by that of choosing �� vectors in all the row nodes do not detect the bogus packets and produce network vectors of F excluding the � null key vectors the malicious coded packets using the bogus packets, we call these packets node has already known. corrupted packets. Te malicious nodes try to forge legitimate packets that can pass the check in (5). In error-detection For large �, �, the malicious node has to make � small to based schemes that use null keys to check the validity of achieve a high probability ��. In a typical parameter setting packetsymbols,thesourcenodewillrandomlydistribute with � = 3232, � = 3072, �=3, �=40,eachintermediate � null keys in each intermediate relay node for packet node will receive 40 null keys randomly selected from 160 checking. Te malicious node can increase its successful null keys. Te successful probability of forging a valid symbol attack probability by producing bogus symbols orthogonal to is shown in Figure 5. From the fgure we can see that to the space spanned by the � null keys it receives. achieve successful probability �� =0.5,themaliciousnode 6 Wireless Communications and Mobile Computing

the sink nodes during the initialization phase through the secure transmission protocol. Tis will prevent the malicious Original message symbols k nodes from corrupting the encoding vector, which is essential Dimension = Relay network Source for the data decoding. Moreover, the source node will also applying network G Node Coded message symbols coding send the encoding matrix � for network encoding vectors Encode with Dimension = n and G for packet symbols to all the sink nodes. Once the (n, k+t) modifed error control code initialization phase is done, the source nodes can multicast any number of packets to sink nodes. Te overhead of the initialization phase is negligible. Figure 6: Apply error-control codes in linear network coding. 4.1.2. Transmission Phase. In the transmission phase, the source node, relay nodes, and sink nodes will perform the can only add 6 errors into a forged symbol, which are easily proposed EDEC scheme according to Algorithms 1, 2, and 3. correctable by the (3232, 3072) code with minimum distance In Algorithm 1, the source node will encode the network � = 161 . Tus for the malicious node, in order to produce encoding vector �� using the modifed error-control code bogus symbols that can pass the check, it will forge symbols with withamuchlongerappendixandamuchlowercoderate, a small number of errors (1’s) and orthogonal to the null key compared to the encoding of packet symbols. Tis can vectors it has. improve the error resistance and detection probability for For the error-control code used to detect the bogus erroneous decodings to guarantee the correctness of the symbols, the errors within the bogus symbols are correctable. network encoding vectors used for data decoding. Since there However, in the error-detection based schemes that do not is only one network encoding vector in each packet, the perform error-correcting, the bogus symbols that pass the overhead brought by this higher security level is negligible. intermediate check will result in a failure of decoding the Algorithm 2 presents the EDEC algorithm for relay nodes. original message symbols at sink nodes. Sincethenullkeyvectorsarealreadydistributedinthe initialization phase, the relay nodes can check whether a 4. Proposed EDEC Scheme and Evaluation packet is intact. Algorithm 3 presents the EDEC algorithm for sink nodes. Similar to the null key scheme, our approach also utilizes Since a sink node has already received the encoding matrix the orthogonal space properties of the error-control code, G� and G in Algorithm 1 in the initialization phase, it can but we use both the error-detection and error-correction perform the error-control code decoding and detect the techniques. Te source node will encode the original message erroneous decodings. When collecting enough intact packets, symbols using the modifed error-control code prior to it can derive the original data symbols through the decoding message transmission as shown in Figure 6. Te properties of network code. of the error-control code keep unchanged during the linear network coding. When a corrupted packet is detected, instead 4.2. Simulation Results of EDEC Scheme. In this section, we of dropping it, we forward it to the sink nodes so that the frst present the simulation platform for EDEC scheme. Ten packet can be used along with other packets to recover the we will compare the EDEC scheme and the error-detection original messages. However, the corrupted packet will not based schemes, which are represented by the null key scheme participate in network coding in the subsequent relay nodes in the simulation. once it is detected to be corrupted. Te sink node can make use of the corrupted packets according to the decoding results 4.2.1. Simulation Platform. We simulate the EDEC scheme of the modifed error-control code in Section 3.3. using ns-2. Te scenario is set as a grid network with one source node, a number of relay nodes, and sink nodes. All 4.1. Te Proposed EDEC Scheme. Te proposed EDEC the nodes are set as wireless nodes using wireless physical scheme is divided into two phases: initialization phase and layer and 802.11 MAC protocol. Te wireless channel is set to transmission phase. Te initialization phase is for null key TwoRayGround. Once a node receives a packet, it will start and security parameter distribution while packet symbols the corresponding operations depending on its type (source, are transmitted through network coding in the transmission relay, malicious, and sink) and the packet content. phase. Figure 7 shows the topology of the simulated network. Te source node is located at the lower lef corner and 13 4.1.1. Initialization Phase. In the initialization phase, the sink nodes lie at the upper right. Te rest of the nodes source node will frst distribute � row vectors of the parity are all intermediate nodes that can relay packets. In the check matrix orthogonal to G in Algorithm 1 (null keys) simulation, we randomly choose a number of intermediate to each of the relay nodes through a secret transmission nodesasmaliciousnodestoperformpollutionattacks.Tese protocol. Unlike normal linear network coding in which the nodes try to send out bogus packets that can pass the network encoding vectors are attached to the start or the end intermediate check as described in Section 3.4 to pollute of the packets, we propose to distribute the encoded encoding the network. We can change the number of malicious nodes vectors to predetermined secret locations in the packets. to evaluate performance of the algorithms under diferent Te source node will send the location information to all network conditions. Te percentage of malicious nodes is Wireless Communications and Mobile Computing 7

(1) for packet i do (2) //Encode network encoding vector �� in equation (1) using the modifed error-control code (Figure 3) (3) h� ←� map(��) (4) u� ←� ( �� | h�) (5) Encoded network encoding vector ←� u� ⋅ G� (6) Distribute the encoded network encoding vector into predetermined locations of the packet (7) for every symbol m of the packet do (8) //Encode m using the modifed error-control code (Figure 3) (9) h ←� map(m) (10) u ←� ( m | h) (11) Encoded symbol ←� u ⋅ G (12) end for (13) Send out the encoded symbols (14) end for

Algorithm 1: EDEC algorithm for source nodes.

(1) if every symbol in the received packet is intact then (2) if � (a predetermined number) packets received then (3) Generate � randomly, linearly combined packets using the received packets (network coding) (4) Send out the network encoded packets (5) end if (6) else (7) Mark the packet as corrupted and send it out (8) end if

Algorithm 2: EDEC algorithm for relay nodes.

(1) A packet is received (2)Decodeeverysymbolinthepacketusingthedecodingalgorithmforthemodifederror-controlcode(Figure4) (3) Reassemble the encoded network encoding vector from the predetermined secret locations and decode the network encoding vector (4) if the network encoding vector and all symbols are decoded correctly then (5) if the packet is independent then (6) Save the packet (7) if � (in equation (2)) independent packets are saved then (8) Solve the network coding equations (9) end if (10) end if (11) end if

Algorithm 3: EDEC algorithm for sink nodes.

about 20% in Figure 7. Te rest of the intermediate nodes act still have packets collisions that will eventually infuence the as relay nodes. Afer receiving a packet, they will frst conduct simulation results. In this paper, we only focus on evaluating pollution detection. In the error-detection based schemes, if the proposed EDEC and LEDEC schemes. According to the packet is corrupted, it will be dropped. In our proposed the transmission range of a single node, adjacent nodes are EDEC scheme, we will continue to forward the corrupted assigned diferent time slots (see Figure 8) to avoid packets packet to the sink nodes. However, the corrupted packet will collisions. Tis can guarantee that the transmission range not participate in the succeeding network coding. Te nodes belonging to the same time slot would not overlap. We divide behaviors will be detailed in the next section. the entire time duration into 9 time slots and the duration of Because the packets are transmitted through broadcast- each time slot is 100 ms. Te nodes are allowed to send packets ing, although the MAC protocol is IEEE 802.11, we will only if they are in their own slots. If not, they will have to wait 8 Wireless Communications and Mobile Computing

according to Section 4.1.1, the source node will encode Network the network encoding vector �� for each packet using the modifed error-control code presented in Figure 3 with �= 32. According to Teorem 2, the probability of detecting an erroneous decoding for the network encoding vector is −32 1−2 , which means once the decoded network encoding vector passes the verifcation in Figure 4, we can view the network encoding vector as intact. Ten the source node will distribute the encoded network encoding vector into predetermined locations in each packet. We fragment each packet embedding the encoded network encoding vector into symbols with the size 3056 bits and encode each packet symbol using the modifed error-control code with �=16. At last, the encoded packets are sent out into the network.

(b) Relay Nodes. Relay nodes will perform EDEC scheme according to Algorithm 2. Because the network is collision Source node free and all the transmitted packets can be received, each Sink node packet only needs to be transmitted once. To utilize network Relay node Malicious node coding efciently while minimizing the transmission delay, each relay node will perform network encoding for every Figure 7: Simulation scenario. �=4valid packets it receiveds.

(c) Malicious Nodes. We assume that the malicious nodes will not perform network encoding. Tey only send out forged packets and conduct pollution attacks as described in Section 3.4. 4 0140 (d) Sink Nodes. Sink nodes will decode both the modifed error-control code and network code according to Algo- 8 557 8 rithm 3.

623 62 4.2.3. Simulation Results. We conduct simulations under diferent percentages of malicious relay nodes. Te through- put comparison between the EDEC scheme and the error- 40140 detectionbasedschemesisshowninFigure9.Fromthe fgure we can see that the EDEC scheme outperforms the 88557 error-detection based schemes in throughput: (i) When the percentage of malicious nodes is less than 10%, the two schemes have almost the same performance. (ii) With the increasing of malicious nodes, the performance of error- detection based schemes degrades signifcantly, while the throughput of the EDEC scheme remains almost unchanged. Broadcast range of nodes in time slot 2 (iii) When the percentage of malicious nodes is larger than 40%, the throughput of the EDEC scheme begins to decrease Figure 8: Illustration of the 9 time slots to avoid packets collisions. because the cumulative number of errors in packet symbols becomes too large to correct by the modifed error-control code. Te uncorrectable packet symbols have the same until their next slots. In Figure 8 is an example in which the degrading efect on the throughput as the packets discarded nodes belonging to time slot 2 can simultaneously transmit in the error-detection based schemes. without packet collision. 5. LDPC Decoding and LEDEC Scheme 4.2.2. Nodes Design. Four types of nodes are designed according to the algorithms described in Section 4.1 on the In the EDEC scheme, only linearly independent packets simulation platform. participate in the network decoding at the sink nodes. Corrupted or linearly dependent packets will not be used, (a) Source Node. In the simulation, the source node has which is a waste of the resource. In this section, through �=32data packets to send. Afer initializing the network utilizing these packets to recover more message symbols Wireless Communications and Mobile Computing 9

(1) while Tere are check nodes connected to only one unknown symbol node do (2) for Each of these check nodes do (3) Te unknown symbol node ←� xor (All of the other symbol nodes connected to the check node) (4) end for (5) end while (6) if All the unknown symbol nodes are recovered then (7) Decode successfully (8) end if

Algorithm 4: BPA decoding algorithm for BEC.

1 01x 1 d1 d2 d3 d4 Symbol nodes 0.8 1110 0 xor 1 = 1 H = 00 11 0 010 1 1 0.6 h1 h2 h3 Check nodes

0.4 Figure 10: An illustrative example of parity check matrix and Tanner Troughput graph. 0.2

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 check nodes (corresponding to the rows of the parity check Percentage of malicious nodes in intermediate nodes matrix). An illustrative example of the parity check matrix EDEC anditsTannergraphisshowninFigure10.Intheparity Error-detection Based Schemes check matrix, every row represents a parity check equation. Te symbol nodes, which correspond to the bits equal to 1’s in Figure 9: Troughput comparison between EDEC scheme and the error-detection schemes. a row of the parity check matrix, are connected to the check node which corresponds to the same row. Tese nodes and edges in the Tanner graph express the parity check equation of that row. In Figure 10, node ℎ1 represents the frst row of using low-density parity check (LDPC) decoding, we propose the parity check matrix. Te frst, second, and third elements an LDPC decoding based EDEC (LEDEC) scheme. of the frst row in parity check matrix are 1’s, so symbol nodes �1, �2,and�3 are connected to ℎ1 in the Tanner graph. 5.1. LDPC Code. LDPC linear block code was frst introduced Te decoding algorithm can be described through by Gallager in 1962 [29]. One of the important characteristics Algorithm 4 . of LDPC code is its sparse parity check matrix. By using iterative decoding, LDPC code can achieve error-correction 5.3. Relationship between Linear Network Code and LDPC performance close to Shannon bounds [30]. LDPC codes can Code. In linear network coding, packets are linearly com- be categorized as the regular LDPC code, of which the parity bined at the intermediate nodes. Te packets received at check matrix H has a fxed number of 1’s per column and per the sink nodes satisfy (2). In the network code decoding row, and the irregular LDPC code, of which the parity check part of the EDEC algorithm, only independent valid packets matrix may have diferent numbers of 1’s in each column areused.However,thereisalsohelpfulinformationinthe and each row. In this section, we will formulate the network linearly dependent packets or corrupted packets. If we can coding to the irregular LDPC code. exploit and use these packets, the system performance can be further improved. Denote the received encoding vector as 5.2. Decoding of LDPC Code. Te iterative decoding algo- �� =(��,1,...,��,�),where1≤�≤�and � is the number rithm, known as belief propagation algorithm (BPA), is of received packets. Ten the generation matrix of the block generally used to decode the LDPC code. Among all the network code G� can be defned as channel models the BPA algorithm for the binary erasure � � � � � channel (BEC) is the simplest, where only three numbers G� =[�1 ⋅⋅⋅�� , ��+1 ⋅⋅⋅��] . (8) need to be considered: 0, 1,and� (erasure). Te BPA can � be described over the Tanner graph [31], which is a bipartite As an example, suppose there is only one bit � in each graph. In a Tanner graph, there are two types of nodes: the original packet of the source node for 1≤�≤�. Defne � symbol nodes (corresponding to the received bits) and the x =(�1,...,��) . In this case, each received packet in the sink 10 Wireless Communications and Mobile Computing

nodes also contains only one bit �� for 1≤�≤�.Denoteall � the � received packets as a vector y =(�1,...,��) .Wehave Codeword 1 Modifed Bit 1 Bit 2 ... Bit k in packet 1 decoder the following encoding equation:

Codeword 1 Modifed y = G�x. (9) Bit 1 Bit 2 ... Bit k in packet 2 decoder Teorem 4. Te linear network code can be viewed as a rate- ...... Codeword 1 less LDPC code and can be decoded using the BPA algorithm. Modifed Bit 1 Bit 2 ... Bit k in packet m decoder Proof (sketch). Since we can have an uncertain number of �>�received encoded bits in the encoding equation (9), we LDPC LDPC ... LDPC Decoder Decoder Decoder can view the network code as a rateless code. Te generating matrix G� can be rewritten as Figure 11: Main idea of the LEDEC scheme.

� �� G =[ 1]=[ ]�, � −1 1 (10) �2 �2�1 For the LEDEC scheme, let �� denote the probability where matrix �1 canbemadeasan�×�full rank matrix that an edge from a check node is connected to a symbol � � through row exchange afer � independent packets are node of degree ,and � denote the probability that an edge received, �2 is an (�− �) × � matrix, and �� is an �×�identity from a symbol node is connected to a check node of degree � matrix. Te corresponding parity check matrix H� can be in the Tanner graph of the corresponding LDPC code. written as Te generating functions for an LDPC code is defned as: �(�) = ∑ � ��−1 �(�) = ∑ � ��−1 � � � , � � . According to [32], (� �−1) the maximal fraction of erasures that a random LDPC code H =[ 2 1 ], � (11) with given generating functions can correct is bounded by ��−� �max = min{�/�(1 − �(1 − �))} (0 < � < 1) with probability −3/4 at least 1−O(� ),where� is the length of the code. For the where ��−� is an (�−�)×(�−�) identity matrix. We can verify throughputoftheLEDECscheme,wehaveTeorem5. the correctness of H� by verifying the follow equation: � Teorem 5. Te throughput of the LEDEC scheme is −1 � � � (�2�1 ) � H�G� =[ ] [ ]�1 = 0. (12) ⌊�⋅� ⌋ � �−1 max � ��−� 2 1 � �−� �= ∑ ( )�� (1 − ��) , (13) �=0 � Afer deriving the corresponding parity check matrix H�,we can decode the linear network code using the BPA algorithm. where �max = min{�/�(1 − �(1 − �))} (0 < � < 1), �� is Te linear network code can be viewed as a rateless LDPC the erasure probability, � is the number of packets a sink node code and has the property of error-control codes. received, and ⌊⋅⌋ is the foor function.

Although linear network codes can be viewed as rateless Proof. Suppose a sink node receives � packets and the LDPC codes, the BPA algorithm cannot be used to decode a erasures in the packets are independent; the distribution of network code if the network code is derived afer the decod- the number of erasures � in � received packet symbols is a ing of a conventional error-control code, because we cannot � � �−� binomial distribution with Pr(�) = ( � )��(1−��) ,0≤�≤ detect the incorrect decodings which should be viewed as � as the probability mass function (PMF). erasures. For the modifed error-control codes in the EDEC Te proposed scheme can combat all erasures up to �⋅ scheme, we can determine the erroneous decodings and mark −3/4 �max with probability at least 1−O(� ), which is close to the corresponding bits as erasures. Terefore, we can decode ⌊�⋅� ⌋ �=∑ max (�) the linear network code using the BPA algorithm. Figure 11 1. Tus the throughput can be written as �=0 Pr . illustratesthismainideaoftheLEDECscheme.

5.4. Teoretical Analysis. Trough the information in linearly 5.5. Performance Analysis and Simulation. In this section, dependent packets, the LEDEC scheme can get additional we provide simulation results of the LEDEC scheme on the benefts from BPA decoding of the LDPC code. Consider simulation platform presented in Section 4.2. All the settings thecaseinwhichthepercentageofmaliciousnodesishigh, and parameters are the same as in Section 4.2. most of the packets are corrupted by the malicious nodes. Te error-detection based schemes cannot work because all of the 5.5.1. Nodes Design. For the LEDEC scheme, the source node, packets are discarded. Te EDEC scheme does not work well relay nodes, and malicious nodes are the same as those either because with high erasure probability �� there will not in Section 4.2. Te decoding process in the sink nodes is be enough correctable packets to solve the network coding diferent, in which all packets received will be used. Te equations (2). BPA decoding will not start until the sink nodes collect all Wireless Communications and Mobile Computing 11

1 55 50 0.8 45 40 0.6 35

30 0.4 Troughput 25 20 0.2 15 10 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 5 Percentage of malicious nodes in intermediate nodes 10 20 30 40 50 60 70 80 90 LEDEC Figure 12: An example of the parity check matrix (transposed) in EDEC network coding. Error-detection Based Schemes Figure 13: Performance evaluation of the LEDEC scheme.

�=32 �=32 the independent packets. Afer receiving model most of the newly emerging networks in the 5G independent packets, we can use the BPA algorithm to decode network. Our purpose is to maintain the network throughput whenever a new packet arrives. However, there is a trade- even when the percentage of malicious nodes is large (≥30%). of in determining when to start the BPA algorithm. When We frst introduce an error-detection and error-correction the algorithm is performed too frequently, the computational (EDEC) scheme. By utilizing the information available in the overhead will be high. On the other side, if we do not start corrupted packets, the network throughput can be increased the BPA decoding until we have collected a large number of with only a slight increase of the computational overhead packets, the communication delay will be high. To get a trade- compared to the error-detection based schemes. To further of, the sink nodes will trigger the BPA decoding upon the increase the throughput for the network environment with receiving of every 10 new packets. Tis process will continue heavy pollution, we introduce LEDEC scheme that enables until all the message symbols have been successfully decoded. channel information be exploited and belief propagation algorithm (BPA) be used for the packet symbol recovery. 5.5.2. Simulation Results. Same as in Section 4.2.3, the Tis scheme can guarantee the throughput under the heavy simulations in this section are carried out under diferent pollution (percentage of malicious nodes is larger than 40%). percentage of malicious relay nodes. One example of the We formulate the throughput of the LEDEC scheme through parity check matrix (transposed) generated in the linear theoretical analysis and conduct comprehensive simulations network coding is shown in Figure 12. In this example, the to evaluate the performance. Our extensive simulation results sink node receives 90 packets and decodes the linear network show that the LEDEC scheme achieves better throughput code using the BPA algorithm. In the fgure, white squares than the EDEC scheme. represent 0 and black squares represent 1. Te performance of the LEDEC scheme is shown in Figure 13. Data Availability Remark 6. When the percentage of the malicious nodes is less Te data used to support the fndings of this study are than 40%, the performance of the LEDEC scheme is slightly available from the corresponding author upon request. better than the EDEC scheme. Tis is because the sink nodes can successfully correct most of the errors in the corrupted Disclosure packets and decode the original symbols using error-free packet symbols. Jian Li’s frst part of the work was done when he was with the Department of ECE, Michigan State University, East Lansing, Remark 7. Because the sink nodes can recover extra infor- MI 48824-1226. mation from the corrupted packets, when the percentage of malicious nodes is between 40%and60%, the LEDEC Conflicts of Interest scheme outperforms the EDEC and the error-detection based schemes. Te authors declare that they have no conficts of interest.

6. Conclusion Acknowledgments In the paper, we focus on combating pollution attacks in Tis work was supported in part by the Fundamental multihop networks that utilize network coding, which can Research Funds for the Central Universities KWJB17052536 12 Wireless Communications and Mobile Computing

and in part by the National Natural Science Foundation of Cryptography-PKC 2009,vol.5443ofLecture Notes in Comput. China under Grant 61701019. Tis work is based on the work Sci., pp. 68–87, Springer, Berlin, Germany, 2009. of the PhD dissertation of Jian Li [33]. [16] T. Shang, T. Peng, Q. Lei, and J. Liu, “Homomorphic Signature for Generation-based Network Coding,” in Proceedings of the References 2016 IEEE International Conference on Smart Cloud, Smart- Cloud 2016, pp. 269–273, USA, November 2016. [1]W.Zhang,Z.Zhang,andH.Chao,“Cooperativefogcomputing [17]Z.Yu,Y.Wei,B.Ramkumar,andY.Guan,“Anefcient for dealing with big data in the internet of vehicles: architecture scheme for securing XOR network coding against pollution and hierarchical resource management,” IEEE Communications attacks,” in Proceedings of the 28th Conference on Computer Magazine,vol.55,no.12,pp.60–67,2017. Communications (INFOCOM ’09), pp. 406–414, April 2009. [2]C.Wu,X.Chen,Y.Ji,S.Ohzahata,andT.Kato,“Efcient [18] S. Agrawal and D. Boneh, “Homomorphic MACs: MAC-based broadcasting in VANETs using dynamic backbone and network integrity for network coding,”in Applied Cryptography and Net- coding,” IEEE Transactions on Wireless Communications,vol.14, work Security: 7th International Conference, ACNS 2009, Paris- no. 11, pp. 6057–6071, 2015. Rocquencourt, France, June 2–5, 2009. Proceedings,vol.5536 [3] K. Lei, S. Zhong, F. Zhu, K. Xu, and H. Zhang, “An ndn of Lecture Notes in Computer Science, pp. 292–305, Springer, iot content distribution model with network coding enhanced Berlin, Germany, 2009. forwarding strategy for 5g,” IEEE Transactions on Industrial [19] Y. Li, H. Yao, M. Chen, S. Jaggi, and A. Rosen, “RIPPLE Informatics,vol.14,pp.2725–2735,2018. authentication for network coding,” in Proceedings of the IEEE [4]K.Chi,L.Huang,Y.Li,Y.-H.Zhu,X.-Z.Tian,andM. INFOCOM, pp. 1–9, March 2010. Xia, “Efcient and reliable multicast using device-to-device [20] A. Le and A. Markopoulou, “Cooperative defense against communication and network coding for a 5G network,” IEEE pollution attacks in network coding using SpaceMac,” IEEE Network,vol.31,no.4,pp.78–84,2017. Journal on Selected Areas in Communications,vol.30,no.2,pp. [5]R.Ahlswede,N.Cai,S.Y.R.Li,andR.W.Yeung,“Network 442–449, 2012. information fow,” IEEE Transactions on Information Teory, [21] C. Cheng, J. Lee, T. Jiang, and T. Takagi, “Security analysis and vol.46,no.4,pp.1204–1216,2000. improvements on two homomorphic authentication schemes [6] S. R. Li, R. W.Yeung, and N. Cai, “Linear network coding,” IEEE for network coding,” IEEE Transactions on Information Foren- Transactions on Information Teory,vol.49,no.2,pp.371–381, sics and Security, vol. 11, no. 5, pp. 993–1002, 2016. 2003. [22] P. Zhang, Y. Jiang, C. Lin, H. Yao, A. Wasef, and X. Shenz, [7] R. Koetter and M. M´edard, “An algebraic approach to network “Padding for orthogonality: Efcient subspace authentication coding,” IEEE/ACM Transactions on Networking, vol. 11, no. 5, for network coding,” in Proceedings of the IEEE INFOCOM 2011, pp. 782–795, 2003. pp. 1026–1034, China, April 2011. [8]T.Ho,B.Leong,R.Koetter,M.Medard,M.Efros,andD. [23] W.Xiaohu, X. Yinlong, Y. Chau, and X. Liping, “Atag encoding Karger, “Byzantine modifcation detection in multicast net- scheme against pollution attack to linear network coding,” IEEE works using randomized network coding,” in Proceedings of the Transactions on Parallel and Distributed Systems,vol.25,no.1, International Symposium on Information Teory ISIT’ 04,p.44, pp. 33–42, 2014. July 2004. [24] E. Kehdi and B. Li, “Null keys: limiting malicious attacks via null [9] C. Gkantsidis and P. Rodriguez, “Cooperative security for space properties of network coding,” in Proceedings of the 28th network coding fle distribution,” in Proceedings of the IEEE Conference on Computer Communications, IEEE INFOCOM INFOCOM ’06, pp. 1–13, April 2006. ’09, pp. 1224–1232, April 2009. [10] M. Esmaeilzadeh, P. Sadeghi, and N. Aboutorab, “Random lin- [25] A. Newell and C. Nita-Rotaru, “Split Null Keys: A null space ear network coding for wireless layered video broadcast: general based defense for pollution attacks in wireless network coding,” design methods for adaptive feedback-free transmission,” IEEE in Proceedings of the 2012 9th Annual IEEE Communications Transactions on Communications,vol.65,no.2,pp.790–805, SocietyConferenceonSensor,MeshandAdHocCommunications 2017. and Networks, SECON 2012,pp.479–487,RepublicofKorea, [11] N. Cai and R. W.Yeung, “Network coding and error correction,” June 2012. in Proceedings of the IEEE Information Teory Workshop (ITW [26] W.Qiao, J. Li, and J. Ren, “Anefcient error-detection and error- ’02), pp. 119–122, 2002. correction (edec) scheme for network coding,”in Proceedings of [12] S. Jaggi, M. Langberg, S. Katti, T. Ho, D. Katabi, and M. the IEEE Globecom ’11, pp. 1–5, December 2011. M´edard, “Resilient network coding in the presence of Byzantine [27] A. Antonopoulos and C. Verikoukis, “COPS: cooperative statis- adversaries,” in Proceedings of the IEEE INFOCOM 2007: 26th tical misbehavior mitigation in network-coding-aided wireless IEEE International Conference on Computer Communications, networks,” IEEE Transactions on Industrial Informatics,vol.14, pp.616–624,USA,May2007. no.4,pp.1436–1446,2017. [13] W. Guo, D. He, and N. Cai, “On capacity of network error [28] S. Lin and D. J. Costello, Error Control Coding, Prentice Hall, correction coding with random errors,” IEEE Communications 2nd edition, June 2004. Letters, vol. 22, no. 4, pp. 696–699, 2018. [29] R. G. Gallager, “Low-density parity-check codes,” IEEE Trans- [14] M. N. Krohn, M. J. Freedman, and D. Mazi`eres, “On-the- actions on Information Teory,vol.8,pp.21–28,1962. fy verifcation of rateless erasure codes for efcient content [30] C. E. Shannon, “Amathematical theory of communication,” Bell distribution,” in Proceedings of the 2004 IEEE Symposium on Labs Technical Journal,vol.27,pp.379–423,1948. Security and Privacy,pp.226–239,May2004. [31] R. M. Tanner, “A recursive approach to low complexity codes,” [15] D. Boneh, D. Freeman, J. Katz, and B. Waters, “Signing a linear IEEE Transactions on Information Teory,vol.27,no.5,pp.533– subspace: signature schemes for network coding,” in Public Key 547, 1981. Wireless Communications and Mobile Computing 13

[32] M. G. Luby, M. Mitzenmacher, M. A. Shokrollahi, and D. A. Spielman, “Efcient erasure correcting codes,” IEEE Transac- tions on Information Teory,vol.47,no.2,pp.569–584,2001. [33] J. Li, Capacity assurance in hostile networks [Ph.D. dissertation], Michigan State University, 2015. Hindawi Wireless Communications and Mobile Computing Volume 2018, Article ID 6809796, 8 pages https://doi.org/10.1155/2018/6809796

Research Article Anonymous Communication via Anonymous Identity-Based Encryption and Its Application in IoT

Liaoliang Jiang,1 Tong Li ,1 Xuan Li,2 Mohammed Atiquzzaman,3 Haseeb Ahmad,4 and Xianmin Wang 1

1 School of Computer Science, Guangzhou University, Guangzhou 510000, China 2College of Mathematics and Informatics, Fujian Normal University, Fujian 350000, China 3School of Computer Science, Te University of Oklahoma, USA 4Department of Computer Science, National Textile University, Faisalabad, Pakistan

Correspondence should be addressed to Tong Li; [email protected]

Received 30 June 2018; Accepted 24 September 2018; Published 21 November 2018

Guest Editor: Karl Andersson

Copyright © 2018 Liaoliang Jiang et al. Tis is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Under the environment of the big data, the correlation between the data makes people have a greater demand for privacy. Moreover, the world has become more diversifed and democratic than ever before. Freedom of speech is considered to be very important; thus, anonymity is also a very important security demand. Te research of our paper proposes a scheme which can ensure both the privacy and the anonymity of a communication system, that is, the protection of message privacy while ensuring the users’ anonymity. It is based on anonymous identity-based encryption (IBE), by which the users’ �������� areprotected.Weimplement our scheme in JAVA with Java pairing-based cryptography library (JPBC); the experiment shows that our scheme has signifcant advantage in efciency compared with other anonymous communication system. Internet-of-Tings (IoT) involves many devices, and privacy of devices is very signifcant. Anonymous communication system provides a secure environment without leaking metadata, which has many application scenarios in IoT.

1. Introduction the sending and receiving entities and whether a message is valuable. Overall, such pieces of information are called In the era of big data, data privacy has become signifcant ��������.Te�������� involves crucial private data in the as more personal and organizational information is involved. anonymous communication system and it is also a critical Moreover, the world has become more diversifed and demo- parameter of the monitoring for the government and other cratic than ever before. Following this trend, researchers stakeholders [3]. Anonymous communication system is very have unveiled various ways through which adversaries can practical, and it can be applied to many realistic scenarios. access private or otherwise sensitive information by network Cloud computing has received more and more attention breach such as a called telephone number [1] or the IP in recent years, and privacy-preserving and security under address [2]. Terefore, the ultimate objective of privacy is cloud environment have become critical issues [4–7], so high- to protect not only the contents of the messages but also level anonymity is required. In the neural network [8, 9] and the identities of communication parties, the actual time of secure deduplication [10], the anonymity is also critical. a communication, and specifc user participation during Some private messaging systems are built on previous communication. For the above reasons, the research on work [11–14]. However, such schemes do not provide efcient privacy protection protocols such as anonymous commu- protocols without leaking ��������, which breaches the nication system is imperative. In an anonymous commu- purpose of anonymous communication systems. Tere exist nication system, the adversary must not know participant’s two categories of anonymous communication protocols. Te identity at any time. Further, the adversary should not know frst one is called Tor [15], which achieves the anonymity 2 Wireless Communications and Mobile Computing by encrypting the messages in layers under the public key of the message is based on the decryption by the recipient cryptography. Several servers are employed in Tor, and the rather than the cooperations of all the users, so the system message is encrypted into a data packet by all servers’ public doesnotneedtheuserstobeon-lineallthetimeduringthe key in turns (the construction of an onion). Te encrypted communication, which is a remarkable advantage compared data package is transmitted by the �ℎ��� comprised of all with the verdict and dissent. Since every user has to perform servers. As a server receives a data package, it decrypts the same operation at the same time, the adversary cannot the data package through its private key and sends the analyzewhoisthesenderandwhoistherecipientinagiven decrypted data package to the next server; each server carries round. Tis characteristic ensures strong anonymity of the out the decryption operation until the data package is sent users. to the recipient. Tor is secure as each server knows only Te primary contributions of the paper are listed as its predecessor and successor rather than other servers’ follows: position in the chain; therefore, the servers do not know the (1) Te user can send and receive the ciphertext during specifc path of a message’s transmission, which guarantees the same round, which means a user can be both a sender and the anonymity of both the sender and recipient. Tat is how a recipient at the same time. Tis can improve the efciency it ofers a practical scheme of identity protection. However, of anonymous communication system. some researchers showed that this scheme cannot defend (2) Our scheme allows the users to send or receive more against the trafc analysis attacks [16–18]. Tor’s hidden service than one valid “ciphertext” (because the message is extracted requires that every server must be honest; for instance, the afer decrypting the ciphertext) in each round. Tere is no security will be breached if the ”frst-server” and the ”last- limit to the number of communication ciphertexts. It is a huge server” make a collusion. Te second category is based on advantage compared with other anonymous communication theDC-net[19,20],forinstance,verdict[21]anddissent systems which are based on DC-net; this characteristic greatly [22]. Tose schemes work on an N-member group, who can improves the efciency of anonymous communication sys- communicate anonymously, and only one group member is tem and reduces the cost of communication. allowed to send a valid message in a given round. Each mem- (3) In the proposed scheme, some dishonest users are ber shares a value secretly with the other N-1 members, which tolerable in the communication, because a single user can means that each member has N-1 secret values that are shared successfully recover the message by himself instead of with the other N-1 members. Ten, each member performs through the cooperation of all the other users. XORs operations with the N-1 shared secret values, but the Section 3 introduces the basic concepts of bilinear pair- legitimate sender performs the XORs with the N-1 shared ing, IBE and anonymous IBE. Section 4 outlines the system secret values along with an additional message to generate architecture and our security goals, and Section 5 describes ciphertexts that are later broadcast to the other members. the specifc scheme and its two protocols. Section 6 presents Subsequently, each member receives all the ciphertexts and a security analysis of the proposed system. performs XORs operation on the N ciphertexts together to reveal the message. Because all the shared secret values had been XORed twice, they are cancelled out; thus, the message 2. Related Work is extracted but without leaking the identity of the sender. Anonymous communication system has a high-level demand In this way, the DC-net can prevent the trafc attack, as the for security; we may consider a mechanism for entering DC-net is built on �������� model that remains secure until into the anonymous communication system before starting at least one participating server remains honest. However, communication. Users can do anonymous authentication the recovery of messages must be computed by all the users, before joining the system [23]; only a legal user can be a which is unrealistic scenario as none of the users could be of- member of the system. Anonymous authentication can apply line during underlying process. Hence, the existing methods to many areas [24–27]. Besides, we can make an improvement are vulnerable to the internal dishonest member attacks of the storage pattern for users’ messages. We may combine which can easily break the security. Although, the systems oblivious RAM [28, 29] with our anonymous communication may trace the dishonest member, they cannot exclude the system; oblivious RAM hides the access patterns of data, infuence of dishonest member during the communication. and it can prevent adversary from speculating sensitive In summary, these two schemes above are of low-efciency information through users’ access patterns. In the end, we and the cost of communication is high. may use other cryptographical techniques [30–33] to enhance In this paper, we propose an efcient anonymous com- the security of our anonymous communication system. munication scheme that ofers higher security while dis- posing the aforementioned attack. Te proposed scheme incorporates anonymous identity-based encryption (IBE) to 3. Preliminaries achieve anonymous communication. In our scenario, more than one message can be sent simultaneously in each round. Tis section outlines the anonymous IBE scheme [34], Meanwhile, each user uses its ID as the public key; we the diference between ordinary IBE and anonymous IBE. assume that the ID of each user is unique and it must be Although the identity-based encryption was frstly proposed well known to other users. We set up a bulletin board from by Shamir [35], a practical IBE scheme was constructed where every user could upload or download the ciphertexts by Boneh and Franklin in 2001 [36], and afer that, many directly at specifc time during each round. Te recovery IBE schemes were proposed [37–40]. Te conception of Wireless Communications and Mobile Computing 3 anonymous identity-based encryption was frstly proposed anonymous communication system is no longer secure. An in [41]. Tis section also introduces the basic knowledge of anonymous IBE scheme must obey two properties: bilinear pairings, which are used to construct the scheme in (1) Te adversary cannot get any information about the the following sections. communication parties. (2) Te user’s identity cannot be unveiled by the cipher- � texts. 3.1. Bilinear Map. Let 1 be the additive cycle group gener- In this paper, we use the anonymous IBE scheme which ated by � and �2 be multiplicative cyclic group. Prime � is is based on bilinear map [34]. Let �1 and �2 be the group of the order of �1 and �2.Temap�:�1 ×�1 �→ � 2 is called order �;themap�:�1 ×�1 �→ � 2 is the bilinear map, and � bilinear map, if it satisfes the following properties: � �∈� ∗ � ∈� (1) �, � ∈ � ,�,�∈ � is the generator of group 1. � , 2 1 are randomly Bilinearity: For any 1 , the following � =�� �(��,��)=�(�,�)�� selected, and let 1 .Teschemeincludesfourfunctions formula holds: . as follows: (2) Nondegeneracy: Te map � does not map all the pairs �, � ,� � ×� � � Initialization: choose the public parameters 1 2 and in the 1 1 to the generator in 2.If is the generator of the master key of PKG denoted as �. �1,then�(�, �) is the generator of �2. �∈� (3) � � Private key generation: the PKG randomly selects �∗ Computability: For any and ,thereisanalgorithm and computes the private key for the corresponding recipient; that can compute �(�, �) efciently. here �� is the recipient’s ID and �� ∈ ��∗ .

�=(�,� ,� )=(�����⋅� ,��,���⋅� ) 3.2. IBE. In the IBE scheme, the participating parties include 1 2 3 2 1 (1) the users and the private key generator (PKG). Te identity �∈� of the user is considered as the public key that makes IBE Encryption: suppose a message 2 needs to be encrypted. �, � ∈ � diferent from the traditional public key cryptography. Te Te sender randomly selects �∗ and computes the PKG, which is a trusted third party, generates the private key ciphertext. based on its master key and the user’s identity. Subsequently, �=(�,� ,� ,� )=(�(�,� )� ⋅�,���(�+�) ,�� ,��) the private key is distributed to the corresponding user by the 1 2 3 4 1 2 1 1 (2) PKG. IBE is advantageous and is widely used for information Decryption: the recipient uses his own private key �= security protection. Firstly, the key management is easy and (�1,�2,�3) to decrypt the ciphertext to obtain the plaintext efcient, because it does not require distributing public key as follows. or revoking the key. Secondly, IBE removes the certifcate requirement for the public key of user who participates in �(� ,� ) �=� ⋅ 2 2 the communication. For instance, when Alice wants to send 1 (3) �(�1,�4)�(�3,�3) a message to Bob, she uses Bob’s public key ����� that is known to each user. Alice encrypts the message with Bob’s Suppose that an adversary wants to extract the recipient’s identity ����� rather than querying Bob’s public key from the identity �� from the ciphertext, it is obvious from the PKI. Tis characteristic highlights the purpose of anonymous structure of the ciphertext that the ID can only be extracted ��(�+�) communication. Suppose that Alice queries Bob’s public key through �2 =�1 by using �=(�1,�2,�3,�4).Although�1 from the PKI; the adversary can easily get information about is a public parameter, but the �, � are randomly selected from sender/recipient through the operation of querying, so the the ��∗ and it is impossible for the adversary to obtain these security goals are breached. Afer Bob gets the encrypted parameters. Meanwhile, ��(� + �) is the exponent of �2;the message, he decrypts the message with his private key that computation is complicated. he gets from the PKG. IBE consists of the following four functions [42]: Setup: inputting security parameter �,returningthe 4. Architecture of the System ������ ��� public parameter and the master key of the We describe the details of system entities and system archi- system. Te limited plaintext space �, the limited ciphertext � ������ ��� tecture in this section; furthermore, we present the security space ,andthe are public, and the master key goals of anonymous communication. is secretly kept by the PKG. Extract: inputting the master key ��� and a user’s identity 4.1. Entities. (i) Te users. Te users are an imperative part ��, generating the corresponding private key ���� for ��. Encryption: inputting the message �∈�and ��, of the system whose privacy must be assured, while the users returning the ciphertext �∈�. can communicate with any user in the system. Te system Decryption: inputting the ciphertext � and the corre- can satisfy two kinds of users. Te frst kind is those who just want to disclose some message anonymously; however, the sponding private key ���� for ��, returning the plaintext �. users of this kind do not want to disclose the identity of the sender even to the recipient, for instance, a journalist wants to 3.3. Anonymous IBE. In traditional IBE, since the recipient’s disclose a scandal about a politician who has participated in identity is used as the public key, this property may leak the presidential campaign. Te second kind of users want to the recipient’s identity through the adversary’s analysis of the communicate with someone secretly; the users of this kind ciphertexts. In other words, if the user’s identity is leaked, the do not want anyone to know who is communicating with 4 Wireless Communications and Mobile Computing them, or whether he is involved in this communication. For t1 t2 t3 t4 t5 ··· ··· t1 t2 t3 t4 t5 instance, two executives from a tendering company want to negotiate about the fnal bidding price though geographical Round 1 Round n diferences. Time (ii) Bulletin board. Te bulletin is provided to the users for uploading and downloading ciphertexts. More precisely, Figure 1: Scheduling of n rounds of communication. the sender uploads the encrypted message to the bulletin board, and the recipient downloads the message from the bulletin board. But as we mentioned above, the bulletin board is an intermediate source for communication, and there is one user can communicate with one or more users at a given no need for an interaction between users. Because there is round. no interaction between users, the adversary cannot know the identities of the communicating parties. 4.3. Security Goal. Te system ensures anonymous commu- (iii) Private key generator (PKG). In the system, the role nication in three aspects. of PKG is to generate private keys of users against their IDs (1) Te message’s security: Te contents of users’ message and we assume that PKG is honest. need to be protected, it is the basic requirement of a secure system. (2) 4.2. Architecture. Tis section describes the architecture of Te sender’s anonymity: Unauthorized users could the system. Each user has a unique identity. Te system not determine the identity of sender, so the adversary cannot consist of two components. judge which user sent a message in a given round. In other words, the identity of sender should not be leaked. (1) Te Private Key Protocol. Tis is a protocol for distributing (3) Te recipient’s anonymity: Te recipient’s anonymity the private keys among the users according to their identity; is to ensure that others cannot judge whether the message is this part works between the users and PKG. In order to received by a defnite recipient. Furthermore, the system is carry out the decryption operation later, the user should required to guarantee that the adversary cannot extract the obtain its private key before initiating a communication. Te recipient’s ID which is used during the process of encryption. PKG generates the private key through users’ identity and Similar to sender’s anonymity, the identity of recipient should its master key for the corresponding users. For instance, not be leaked. Alice’s identity is ������� and Bob’s identity is �����;thePKG Te sender’s anonymity and the recipient’s anonymity are generates private key ������� for Alice and ����� for Bob. Only the key security goals which are diferent from the traditional by obtaining the private key, the later protocol can proceed. communication system, as the system should not reveal any �������� All the users should get their private key in specifc time about users. �1. 5. Anonymous Communication via (2) Communicating Protocol. It is the most important part of the system and it contains four operations. Te user Anonymous IBE uses the recipient’s identity as the public key to encrypt the Te scheme consists of two major components: private key message. For instance, Alice wants to send a message to Bob; generation and anonymous communication. In the presented �� Alice uses Bob’s identity ��� to encrypt the message. Te solution, we construct a general scheme based on anonymous encryption operation should be done in specifc time �2.For IBE which is provided in Section 3.3. �1 and �2 are the groups consideration of senders’ anonymity, every user is required of order �,and� is the generator of group �1.Temap� is to send at least one message whether the user wants to have bilinear map which satisfes �1 ×�1 �→ � 2. �∈��∗ is the a communication or not. Afer the encryption operation, all master key of PKG, �2 ∈�1 are randomly selected, and let the ciphertexts should be uploaded to the bulletin board. Te � �1 =� . Tis section describes the construction of these two � uploading operation should be done in specifc time 3.In protocols. specifc time �4, each user downloads all the ciphertexts on the bulletin board. Since the users did not interact before and also the users do not know whether there is any message belonging 5.1. Private Key Generation Protocol. Te private keys are gen- to them in this round, all the messages in the bulletin board erated through the users’ ID by PKG, then PKG distributes must be downloaded by each user in order not to miss the those private keys to corresponding users. message. Te details of private key generation protocol are described in Algorithm 1, where PKG randomly chooses an All the users decrypt all the ciphertexts through its ∗ integer �∈Z� and uses � to compute the private key �.Afer private key in specifc time �5.Afertime�5,oneroundof communication during all the users is fnished, and then the computation, the PKG distributes the private key to the the next round of communication can start. Te subsequent corresponding users. communication rounds are scheduled as shown in Figure 1. In this paper, we have distinct advantage over the previous 5.2. Anonymous Communication Protocol. In our scheme, works. Te advantage is that there is no limit to the number the anonymous communication protocol contains four steps of messages sent by the user in each round, which means that as presented in Figure 2. Te steps include encryption, Wireless Communications and Mobile Computing 5

Upload Download

(m ) Eenid1 1 User 1 c1,...,cn User1...Usern · Een (m ) · · idi i Bulletin board · · ...... · Een (m ) User n idn n c ,...,c 1 n Dec (c ) Dec (c ) Dec (c ) sk1 1 ski 1 skn 1 · · · · Dec (c ) · ski i · · · Dec (c ) skn n Decsk (cn) Dec (c ) i skn n

Figure 2: Te four steps of anonymous communication protocol: A Encryption, B Upload, C Download, D Decryption.

Input: Auser’sidentity��, the public parameters �, �1,�2, and the master key �. Output: Te user’s private key �=(�1,�2,�3). Te PKG performs the followings: ∗ randomly chooses �∈Z�; � ��⋅� � ��⋅� computes the private key �=(�1,�2,�3)=(�2 �1 ,� ,� ); sends the private key to the corresponding users at specifc time �1;

Algorithm 1: Private key protocol.

(1) Encryption. If the sender just wants to disclose the message anonymously rather than anyone knows where is message � ��(�+�) � � from even the recipient, he computes the ciphertext �=(�1,�2,�3,�4) = (�(�1,�2) ⋅�,�1 ,�1,�); If the sender wants the ���� � recipient to know about its identity, the signature �������� should be attached with the message , compute the ciphertext �=(�,� ,� ,� ) = (�(� ,� )� ⋅ (�||���� ), ���(�+�),�� ,��) � 1 2 3 4 1 2 �������� 1 1 .TeEncryption operation should done in specifc time 2. (2) Upload. All the users uploads their ciphertexts � to the bulletin board at the specifc time �3.Eachuser’sciphertexts� are stored in the bulletin board afer this operation. (3) Download. Each involved users are required to download all the ciphertexts which stored in the bulletin board, each user does the Download operation at the specifc time �4. (4) Decryption. Te user decrypts the ciphertexts � which obtained from the bulletin board in Download operation, one by one through the private key � in specifc time �5. Te plaintext is computed as �=�1 ⋅�(�2,�2)/�(�1,�4)�(�3,�3).Ifthereisa ���� � �||���� signature �������� attached with the message , then the user can get ��������.

Algorithm 2: Anonymous communication protocol.

uploading, downloading, and decryption. Te protocol is user, it is permitted to upload more than one ciphertext at the performed collaboratively by the users and the bulletin board. same time. When a user wants to get a message, he needs to In the anonymous communication protocol, each user download all the ciphertexts from the bulletin board without � needs to encrypt his message to generate the ciphertext any interaction with the uploaders. Ten, the user just needs �=(�1,�2,�3,�4). He frst randomly selects �, � ∈ ��∗ . Here, to use his private key to decrypt the ciphertexts one by one. If � is the message which needs to be encrypted, �� is the the decryption is successfully performed, it means that the ���� recipient’s ID, �������� is the signature of sender’s identity. message belongs to the user. It should be noted that each As we mentioned in Section 4.1, our system can satisfy two user is required to download all the messages on the bulletin kinds of users. If the sender wants the recipient to know board. Since there are no interactions between the users and where the message is from, the signature can be attached the bulletin board before the communication, each user has together with the message �.Otherwise,thesignatureis no idea whether the message belongs to him in the process of not necessary to be transmitted. Subsequently, the encrypted a communication. In this way, all the users have to participate message � can be uploaded to the bulletin board. At this in the uploading and downloading operations, while the step, each user is required to upload at least one encrypted adversary does not know which parties are really involved message whether he wants to initiate a communication or not. in a given round. Te details of anonymous communication However, if a user wants to communicate with more than one protocol are shown in Algorithm 2. 6 Wireless Communications and Mobile Computing

6. Security Analysis 105 As we mentioned in Section 4.3, there are three aspects of security goals needed to be achieved: the message’s security, the sender’s anonymity, and the recipient’s anonymity, and we 4 will analyze the system security as follows. 10 Every message is encrypted before uploading and the encryption scheme we used can ensure the message’s security.

Te security of encryption scheme we used in our construc- (ms) Time tion had been proved in [34], this encryption scheme can 103 defend against an arbitrary CPA adversary while maintaining anonymity. Private Key Generation In traditional public key cryptography, there is usually Encryption Decryption a public key infrastructure (PKI), and the sender needs to 102 query the recipient’s public key before initiating a commu- 0 200 400 600 800 1000 nication. In this process, the user who does the operation of querying is likely to be the sender who wants to initiate Te total number of messages a communication, and the public key to be queried likely Figure 3: Computational consumption. belongs to the recipient. In our scheme, the sender no more needs to query the recipient’s public key (because the public key is recipient’s identity which is known to each user). On 105 the other hand, although the recipient’s identity is used as the Only send one message at a round public key, the anonymous IBE ensures that the adversary Our scheme: No limit for the number of messages cannot extract the recipient’s identity from the ciphertext. Since all the users perform the operation of uploading at time �3 and download the same amount of ciphertexts at time 104 �4, the adversary cannot know which user has the intention to participate in a communication through the operations of uploading and downloading. Obviously, our scheme can Time (ms) Time guarantee the anonymity of both the sender and recipient.

3 7. Evaluation 10 In this section, we evaluate the performance of our scheme, which has been implemented in JAVA with Java pairing- 0 100 200 300 400 500 600 700 800 900 1000 based cryptography library (JPBC). All experiments were Te total number of messages conducted on a PC with a CPU 2.13 GHz, 6 GB of RAM. In our implementation, a message’s length was set as 128 bytes, Figure 4: Communication consumption. and the time consumption of uploading and downloading was ignored.

7.1. Computational Consumption. We implemented our wants to send message. In the anonymous communication schemeinonemessageandexecuted1000rounds.Te system which limits the number of messages, users have computational consumption includes three operations: to wait for several rounds. But, in our scheme, all users private key generation, encryption, and decryption. Under cansendanarbitrarynumberofmessagesinaround. the fber confguration of 100 Mpbs, the cost of uploading and Tis property enhances the efciency of communication downloading is negligible. We calculate the time cost as in and reduces the cost of communication. Figure 4 shows 4 Figure 3. It takes approximately 3.6 × 10 ms to perform the the communication consumption of our scheme and the 5 private key generation operation for 1000 rounds, 1.12 × 10 anonymous communication system which limits the number ms to perform the encryption operation for 1000 rounds, of messages. 2 and 1.62 × 10 ms to perform the decryption operation for 1000 rounds. 8. Conclusions 7.2. Communication Consumption. Our scheme has no limit In this paper, we address a communication system which for the number of messages in a round, it is a signifcant aims to protect the users’ ��������.Tosolvethisproblem, advantage compared with other anonymous communication we propose an anonymous communication system based on systems which can send only one message in a round. Tere anonymous IBE. Our scheme has signifcant advantage in are common scenarios; for example, a user wants to commu- efciency compared with previous work and can also ofer nicate with more than one person, or more than one user strong anonymity. In the future, we will consider the user Wireless Communications and Mobile Computing 7 authentication and the application scenario in the smart convolutional neural network and principal component analy- environment. sis,” Computers, Materials & Continua,vol.53,no.3,pp.357–371, 2017. [10]J.Li,X.Chen,M.Li,J.Li,P.P.C.Lee,andW.Lou, Data Availability “Secure deduplication with efcient and reliable convergent key management,” IEEE Transactions on Parallel and Distributed Te library used to support the fndings of this study are Systems,vol.25,no.6,pp.1615–1625,2014. included within the article. [11] N. Borisov, G. Danezis, and I. Goldberg, “DP5: A Private Pres- ence Service,” Proceedings on Privacy Enhancing Technologies, Conflicts of Interest vol. 2015, no. 2, pp. 4–24, 2015. [12] D. Chaum, D. Das, F. Javani et al., “cMix: Mixing with Minimal Te authors declare that they have no conficts interests. Real-Time Asymmetric Cryptographic Operations,” in Applied Cryptography and Network Security, vol. 10355 of Lecture Notes in Computer Science, pp. 557–578, Springer International Pub- Authors’ Contributions lishing, Cham, 2017. Te frst author conducted the experiments and wrote the frst [13]A.Kwon,D.Lazar,S.Devadas,andB.Ford,“Rife:Anefcient draf of the paper. Te other coauthors helped in revising the communication system with strong anonymity,” Proceedings on paper and polished it. All authors read and approved the fnal Privacy Enhancing Technologies,vol.2016,no.2,pp.115–134, manuscript. 2016. [14]J.VanDenHoof,D.Lazar,M.Zaharia,andN.Zeldovich, “Vuvuzela: Scalable private messaging resistant to trafc anal- Acknowledgments ysis,” in Proceedings of the 25th ACM Symposium on Operating Systems Principles, SOSP 2015, pp. 137–152, USA, October 2015. Tis work was supported by National Natural Science Foun- [15]R.Dingledine,N.Mathewson,andP.Syverson,“Tor:Te dation of China (No. 61472091), Natural Science Foundation Second-Generation Onion Router,” Defense Technical Infor- of Guangdong Province for Distinguished Young Scholars mation Center, 2004. (2014A030306020), Guangzhou Scholars Project for Uni- [16] K. Bauer, D. McCoy, D. Grunwald, T. Kohno, and D. Sicker, versities of Guangzhou (No. 1201561613), and Science and “Low-resource routing attacks against Tor,” in Proceedings of Technology Planning Project of Guangdong Province, China the 6th ACM Workshop on Privacy in the Electronic Society, (2015B010129015). WPES’07, Held in Association with the 14th ACM Computer and Communications Security Conference,pp.11–20,USA,October 2007. References [17] S. Chakravarty, A. Stavrou, and A. D. Keromytis, “Identifying [1] J. Mayer, P.Mutchler, and J. C. Mitchell, “Evaluating the privacy proxy nodes in a tor anonymization circuit,” in Proceedings of properties of telephone metadata,” Proceedings of the National the 4th International Conference on Signal Image Technology Acadamy of Sciences of the United States of America, vol. 113, no. and Internet Based Systems, SITIS 2008, pp. 633–639, Indonesia, 20, pp. 5536–5541, 2016. December 2008. [2] Y.-A. de Montjoye, Computational PRIvacy: Towards PRIvacy- [18]N.Hopper,E.Y.Vasserman,andE.Chan-Tin,“Howmuch Conscientious Uses of Metadata,ProQuestLLC,AnnArbor,MI, anonymity does network latency leak?” ACM Transactions on 2015. Information and System Security,vol.13,no.2,2010. [19] D. Chaum, “Te dining cryptographers problem: unconditional [3] A. Rusbridger, “Te snowden leaks and the public,” New York sender and recipient untraceability,” Journal of Cryptology. Te Review of Books,vol.60,no.18,2013. Journal of the International Association for Cryptologic Research, [4] P. Li, J. Li, Z. Huang et al., “Multi-key privacy-preserving deep vol. 1, no. 1, pp. 65–75, 1988. learning in cloud computing,” Future Generation Computer [20] P. Golle and A. Juels, “Dining cryptographers revisited,” in Systems,vol.74,pp.76–85,2017. Advances in cryptology—EUROCRYPT 2004,vol.3027ofLec- [5]C.Gao,Q.Cheng,X.Li,andS.Xia,“Cloud-assistedprivacy- ture Notes in Comput. Sci., pp. 456–473, Springer, Berlin, 2004. preserving profle-matching scheme under multiple keys in [21]H.Corrigan-Gibbs,D.I.Wolinsky,andB.Ford,“Proactively mobile social network,” Cluster Computing,pp.1–9,2018. accountable anonymous messaging in verdict,” in Proceedings [6]P.Li,J.Li,Z.Huang,C.-Z.Gao,W.-B.Chen,andK.Chen, of the USENIX Security Symposium, pp. 147–162, 2013. “Privacy-preserving outsourced classifcation in cloud comput- [22]E.Syta,A.Johnson,H.Corrigan-Gibbs,S.Weng,D.Wolinsky, ing,” Cluster Computing,pp.1–10,2017. and B. Ford, “Security Analysis of Accountable Anonymous [7] J. Li, Y. Zhang, X. Chen, and Y. Xiang, “Secure attribute-based Group Communication in Dissent,” Defense Technical Infor- data sharing for resource-limited users in cloud computing,” mation Center, 2013. Computers & Security, vol. 72, pp. 1–12, 2018. [23]J.Shen,T.Zhou,X.Chen,J.Li,andW.Susilo,“Anonymous [8] Y. Li, G. Wang, L. Nie, Q. Wang, and W. Tan, “Distance and Traceable Group Data Sharing in Cloud Computing,” IEEE metric optimization driven convolutional neural network for Transactions on Information Forensics and Security,vol.13,no. age invariant face recognition,” Pattern Recognition,vol.75,pp. 4,pp.912–925,2018. 51–62, 2018. [24] J. Shen, C. Wang, T. Li, X. Chen, X. Huang, and Z.-H. Zhan, [9]C.Yuan,LiXinting,J.Q.M.Wu,j.Li,andX.Sun,“Fingerprint “Secure data uploading scheme for a smart home system,” liveness detection from diferent fngerprint materials using Information Sciences,vol.453,pp.186–197,2018. 8 Wireless Communications and Mobile Computing

[25]J.Xu,L.Wei,Y.Zhang,A.Wang,F.Zhou,andC.Gao, [42] D. Boneh and X. Boyen, “Secure identity based encryption “Dynamic Fully Homomorphic encryption-based Merkle Tree without random oracles,” Lecture Notes in Computer Science for lightweight streaming authenticated data structures,” Jour- (including subseries Lecture Notes in Artifcial Intelligence and nal of Network and Computer Applications,vol.107,pp.113–124, Lecture Notes in Bioinformatics): Preface,vol.3152,pp.443–459, 2018. 2004. [26]J.Shen,Z.Gui,S.Ji,J.Shen,H.Tan,andY.Tang,“Cloud- aided lightweight certifcateless authentication protocol with anonymity for wireless body area networks,” Journal of Network and Computer Applications,vol.106,pp.117–123,2018. [27]Y.Zhang,J.Li,D.Zheng,P.Li,andY.Tian,“Privacy-preserving communication and power injection over vehicle networks and 5 g smart grid slice,” Journal of Network and Computer Applications,vol.122,pp.50–60,2018. [28]B.Li,Y.Huang,Z.Liu,J.Li,Z.Tian,andS.-M.Yiu, “HybridORAM: Practical oblivious cloud storage with constant bandwidth,” Information Sciences,2018. [29]Z.Liu,Y.Huang,J.Li,X.Cheng,andC.Shen,“DivORAM: Towards a practical oblivious RAM with variable block size,” Information Sciences, vol. 447, pp. 1–11, 2018. [30] X. Zhang, Y. Tan, C. Liang, Y. Li, and J. Li, “A Covert Channel Over VoLTE via Adjusting Silence Periods,” IEEE Access,vol.6, pp. 9292–9302, 2018. [31] Q. Lin, H. Yan, Z. Huang, W. Chen, J. Shen, and Y. Tang, “An ID-based linearly homomorphic signature scheme and its application in blockchain,” IEEE Access,2018. [32] C.Gao,Q.Cheng,P.He,W.Susilo,andJ.Li,“Privacy-preserving Naive Bayes classifers secure against the substitution-then- comparison attack,” Information Sciences,vol.444,pp.72–88, 2018. [33] Y. Zhang, D. Zheng, and R. H. Deng, “Security and Privacy in Smart Health: Efcient Policy-Hiding Attribute-Based Access Control,” IEEE Internet of Tings Journal,vol.5,no.3,pp.2130– 2145, 2018. [34] B. Wang and X. Hong, “An anonymous signature scheme in the standard model,” Journal of Information Science and Engineering,vol.30,no.6,pp.2003–2017,2014. [35] A. Shamir, “Identity-based cryptosystems and signature schemes,” in Advances in Cryptology: Proceedings of (CRYPTO ’84), vol. 196 of Lecture Notes in Computer Science,pp.47–53, Springer, Berlin, Germany, 1985. [36] D. Boneh and M. Franklin, “Identity-based encryption from the Weil pairing,” SIAM Journal on Computing,vol.32,no.3,pp. 586–615, 2003. [37] B. Waters, “Efcient identity-based encryption without random oracles,” Advances in Cryptology – EUROCRYPT 2005,vol.3494, pp.114–127,2005. [38] D. Boneh and X. Boyen, “Secure identity based encryption without random oracles,” in Advances in cryptology—CRYPTO 2004,vol.3152ofLecture Notes in Comput. Sci.,pp.443–459, Springer, Berlin, 2004. [39] D. Boneh and X. Boyen, “Efcient selective identity-based encryption without random oracles,” Journal of Cryptology. Te Journal of the International Association for Cryptologic Research, vol.24,no.4,pp.659–693,2011. [40]J.Li,X.Chen,C.Jia,andW.Lou,“Identity-basedencryption with outsourced revocation in cloud computing,” IEEE Trans- actions on Computers,2013. [41] D. Boneh, G. Di Crescenzo, R. Ostrovsky, and G. Persiano, “Public key encryption with keyword search,” in Advances in Cryptology—EUROCRYPT 2004,vol.3027ofLecture Notes in Computer Science, pp. 506–522, Springer, Berlin, Germany, 2004. Hindawi Wireless Communications and Mobile Computing Volume 2018, Article ID 5452463, 8 pages https://doi.org/10.1155/2018/5452463

Research Article Secure Storage and Retrieval of IoT Data Based on Private Information Retrieval

Khaled Riad 1,2 and Lishan Ke 3

1 School of Computer Science, Guangzhou University, Guangzhou 510006, China 2Mathematics Department, Faculty of Science, Zagazig University, Zagazig 44519, Egypt 3College of Mathematics and Information Science, Guangzhou University, Guangzhou 510006, China

Correspondence should be addressed to Khaled Riad; [email protected] and Lishan Ke; [email protected]

Received 25 May 2018; Accepted 29 August 2018; Published 18 November 2018

Guest Editor: Karl Andersson

Copyright © 2018 Khaled Riad and Lishan Ke. Tis is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Te fast growth of Internet-of-Tings (IoT) strategies has actually presented the generation of huge quantities of information. Tere should exist a method to conveniently gather, save, refne, and also provide such information. On the other hand, IoT data is sensitive and private information; it must not be available to potential attackers. We propose a robust scheme to guarantee both secure IoT data storage and retrieval from the untrusted cloud servers. Te proposed scheme is based on Private Information Retrieval (PIR). It stores the data onto diferent servers and retrieves the requested data slice without disclosing its identity. In our scheme, the information is encrypted before sending to the cloud servers. It is also divided into slices of a specifc size class. Te experimental analysis on many diferent confgurations supported efciency and the efcacy of the proposed scheme, which demonstrated compatibility and exceptional performance.

1. Introduction gains no information about �. A strategy with these properties is called a Private Information Retrieval (PIR) [1, 2]. With the huge revolution of Internet-of-Tings (IoT) and Within this paper we introduce encrypted PIR that ofers cloud computing as its storage environment, the user requests a great privacy. Tat is, unbounded servers should not obtain a query to a part of information and should receive that part any information about the requested piece of information. without disclosing its identity. A great number of researches One does require at least two servers to achieve privacy. Tese have been dedicated to defend the database from curious servers do not need to store the whole database; they could users. Tere are approaches that enable questions to be asked store portions of it. We show that when those pieces are by an individual into a database by reconstructing the worth encrypted, instead of duplicated, the memory overhead can of entities in a manner that prevents him. If the user would be decreased. like to maintain his privacy (in the information-theoretic sense), then he could request a copy of the entire database. 1.1. Motivation. In addition to reducing the storage overhead Tis can cause a huge communication overhead, making it caused by replicating the data to reduce the communication unacceptable. cost in the traditional PIR protocols [2] and achieving the Before going further let us make the problem more information-theoretic privacy, in big data, the user who tangible. Let a binary string �=�1,...,�� of length �. �≥2 reconstructs data is distinct from the user who distributes server stores copies of this binary string. Te user has some them. Also, the user who distributes data should encrypt it indicator �≤�and he is interested in getting this little ��. using diferent keys and distribute the ciphertext. To attain this aim, the little �� could be calculated, the user Moreover, querying information over big data where no queries each of the servers and receives responses. Te query one can get the identity of the parts you are querying or the to each server is distributed separately of � and each server responsesobtainedisabigchallengeandsoundslikescience 2 Wireless Communications and Mobile Computing fction. But it is actually the PIR science. In modern data Reconstructing a part of discerning information from an storage systems, data are usually stored at multiple storage encrypted source is determined by the availability of the nodes in the cloud. Te privacy of information retrieval whole source, which is introduced by Rivest [3], by proposing has to be shielded. One naive method to attain PIR is by all-or-nothing transform (AONT). In this scheme any modif- simply downloading each document in the system regardless cation on the encrypted message limits the ability to decrypt of the user requirements. Te drawback of that strategy is the resource. Tus, the AONT scheme works well for the the very large restoration cost, which further increases with scenarios where the user who wants to decrypt the resource � (therangeofsaveddocuments).Tus,thereisanurgent has never accessed the key before. Tis is not realistic, because requirement to present a strong PIR strategy for storage and the cloud users frequently access the resources and request to recovery. decrypt their ciphertext. In our scheme, the user can decrypt the resource as long as he has the appropriate key and the 1.2. Contributions. One of the main contributions in this sufcient data slices that can generate the requested resource. paper is integrating PIR [1] with cloud computing, to Moreover, the requesting user can only decrypt the cipher- guarantee the efciency and security of encrypted storage text if he has successfully generated the decryption token. and retrieval of information for big data queries from the Another direction for securely uploading the data to cloud is untrusted cloud servers. Our scheme frst divides the data introduced in [4], which ensures that the cloud validates the into slices, then encrypts each of them, and stores those data integrity while avoiding malicious home gateways that encrypted slices using the Swif service on the untrusted monitor and modify the data. In our scheme the data is not cloud servers. In the reconstruction stage, the requested data stored as one block, but it is sliced into several slices based slice is reconstructed without disclosing any information on its size. Ten, it is frst hidden using a permutation hash about that requested slice, with retrieval cost independent of function. Afer that, those data slices are encrypted using and the number of stored slices. Te main contributions can be encryption algorithm and an access structure that is implicitly summarized as follows: included in the ciphertext. In this manner we are reducing the storage time overhead. Te authors in [5] formalize the (i) We have integrated the PIR scheme with cloud com- notion of verifable database with incremental updates (Inc- puting, to secuerly store the personal information as VDB). As well as in [6] pointing out to Catalano-Fiore’s VDB well as efently and smothly reterive it. framework from vector commitment is vulnerable to the so- (ii) We have implemented our scheme on the top of our called forward automatic update (FAU) attack. private cloud environment, by initializing a set of For information hiding and securely extracting that infor- virtual instances that are divided into three categories mation, there are multiple contributions based on diferent based on their confguration and their role in the schemes, such as [7]. Te authors introduced a T-private PIR proposed scheme. which is a generalization of PIR to include the requirement (iii) We have tested the proposed scheme under two difer- that even if any T of the N databases colludes, the identity of ent scenarios, client situation and overlay situation. the retrieved message remains completely unknown to them. But in our scheme whatever was the number of colluding (iv) Te experimental results have suggested that the databases, the user will only receive the encrypted data slices decryption and retrieval cost are separate from the that are only sufcient for extracting the required information amount of saved slices and harmonious with all rea- not more. Also, the authors in [8] utilized a new crypto- sonable scenarios for applying our proposed scheme. graphic primitive, called conditional disclosure of secrets, Ultimately, the throughput for a workload caused by which we believe that it may be a useful building block for the implementation of mixing get and also put fragment the design of other cryptographic protocols. In our scheme, requests on Swif is sensible and accepted by the operator and we have considered slicing the data into multiple slices before user. encrypting and storing it on the cloud storage servers. Te authors in [9] proposed a new symmetric encryption for 1.3. Organization. Te rest of this paper is organized as mobile devices. Te authors in [10] have introduced a new follows: Section 2 presents the related work and smooth operative scheme called RoughDorid for detecting malware comparison between the currently proposed information applications directly on the smartphone. A Dynamic Fully hiding and extraction approaches while not disclosing the Homomorphic Encryption-based Merkle Tree is proposed in queried information. Section 3 presents our proposed �- [11]. An outsourced revocation has been introduced in [12] private and �-out-of-� PIR scheme and its three stages. Te based on identity-based encryption in cloud computing. A proposed scheme implementation is introduced in Section 4. new privacy preserving response scheme has been proposed Te comprehensive performance analysis is presented in in [13] based on adaptive key evolution in smart grid. A novel Section 5. Tis is followed by the conclusion in Section 6. dynamic structure for cloud data has been proposed in [14] for efcient public auditing. 2. Related Work and Discussion Other strategies for applying access control in the cloud via encryption are developed together two research lines: Te notion of earning the extraction of this data content attribute-based encryption (ABE) and discerning encryption of a huge data source should consider quite rough security procedures. ABE approaches (e.g., [15–18]) supply access requirements to be able to do not disclose the queried data. control authorities by making sure that the key used to protect Wireless Communications and Mobile Computing 3

Storage Access Manager Manager

k-answers +DTK Encrypted k-queries Data Slices

Answers+DTK

Key Key Query

Data PIR Owner Protocol Figure 1: Integrated PIR with cloud computing system model. a source could be derived exclusively by the consumers that 3. PIR with Cloud Computing meet a specifed condition in their characteristics (e.g., era, function). An efcient and secure data outsourcing with Te challenge yet is the way to look for an efcient and pro- check-ability has been proposed in [19] for cloud computing tected PIR scheme (regarding costs for information storage based on ABE. Also, a novel lightweight encryption mech- and recovery). Our proposed scheme is �-private and �-out- anism for database is introduced in [20]. Te authors in of-� PIR, which means that if only � of � servers is required [21] introduced a new identity-based signcryption on lattice to respond and even though � servers collude together, without trapdoor. For the multiple sources, the authors in the queried information will not be revealed. Our scheme [22] proposed a new homomorphic signature scheme based consists of three basic stages: Data Storage and Encryption, on network coding and applied it to IoT. Te privacy is User Authorization and Query,andData Decryption and preserved in IoT using centralized duplicate removal video Reconstruction. We have employed the basic PIR scheme storage system [23]. Also, a new identity-based antiquantum defnition [36]. blind authentication for privacy preserving in wireless sensor networks has been proposed in [24]. For the blind storage, 3.1. Data Storage and Encryption. Te data D will be the authors in [25] have introduced efcient multikeyword stored on � cloud server (SRV1, SRV2,...,SRV�, ranked search for mobile cloud data. In [26] the authors ...,SRV�,) afer encrypting it with our encryption algo- proposed the personalized search in mobile clouds over rithm. We consider three types of servers in the data storage encrypted data with efcient updates. Reference [27] pro- side, as shown in our system model Figure 1. Te Storage Man- posed a new block design-based key agreement for data sharing in cloud computing. ager Sever receives the encrypted data slices and distributes Procedures based on discerning encryption (e.g., [28, them on the Data Storage Servers under its own control, and 29]) suppose to encrypt every single source using a secret the Key-Server is responsible for the key management that is that only licensed users understand or may derive. Within distributed using the PIR Protocol. this situation, the information owner either manages policy Te owner encrypts all the information bits using a upgrades, with overhead, or is assigned to the server. Tough content key using symmetric encryption procedures. Te overencryption is shown to provide functionality that is content essential is then encrypted by the owner. It requires as decent and promises a prompt authorities of policy upgrades, inputs the public key handled by the PIR Protocol,thecontent it needs trust premises that are more powerful on the key ��, and an access structure A =(�,�).Let� be a �×� machine, which has to ofer support. Also, another set of matrix, where � denotes the total number of all the attributes. contributions [30, 31] have implemented access control for Te function � associates rows of � to attributes. a private cloud computing evironment, by proposing the confdence notation for denying or granting access. In the 1 1 1 1 �1 �2 ⋅⋅⋅ �� ⋅⋅⋅ �� �1 [ ] event the host is oblivious of its adoption our scheme may [ ] be used. Te authors in [32] proposed a fexible EHR sharing [ ] [�2 �2 ⋅⋅⋅ �2 ⋅⋅⋅ �2] [�2] [ 1 2 � �] scheme supporting ofine encryption of EHR and outsourced [ ] [ ] [ . ] [ . . . . ] decryption of EHR ciphertexts in mobile cloud computing. [ . ] [ . . . . ] [ . ] [ . .⋅⋅⋅. ⋅⋅⋅ . ] Reference [33] proposed the frst lattice-based linearly homo- D = [ ] = [ ] [ ] [ � � � � ] morphic signature in the standard model, which settles this [��] [� � ⋅⋅⋅ � ⋅⋅⋅ � ] [ ] [ 1 2 � �] open problem. Te authors in [34] proposed a dependable [ . ] [ ] [ . ] [ . . . . ] distributed WSN framework for SHM. Ten the authors in [ . ] [ . .⋅⋅⋅. . ⋅⋅⋅ . ] [35] proposed a new biometrics-based authentication scheme [ ] [��] � � � � for the multiserver environment. [�1 �2 ⋅⋅⋅ �� ⋅⋅⋅ ��] 4 Wireless Communications and Mobile Computing

�ℎ(1) �ℎ(1) ⋅⋅⋅ �ℎ(1) ⋅⋅⋅ �ℎ(1) [ 1 2 � � ] user has to issue a query to the PIR protocol including [ ℎ(2) ℎ(2) ℎ(2) ℎ(2)] the user’s secret key, attributes, and certifcates. Te PIR [� � ⋅⋅⋅ � ⋅⋅⋅ � ] [ 1 2 � � ] protocol will issue k-queries (one for each server) Q = [ ] [ . . . . ] {�1,�2,...,��,...,��} for the access manager server, where [ . . ⋅⋅⋅ . ⋅⋅⋅ . ] [ ] �� = Q�(�, �, �) and � is a randomly chosen by fipping = [ ] [�ℎ(�) �ℎ(�) ⋅⋅⋅ �ℎ(�) ⋅⋅⋅ �ℎ(�)] coins. Each server will respond with a single encrypted [ 1 2 � � ] [ ] answer; the user will have an encrypted answer set �� = [ . . . . ] [ . . ⋅⋅⋅ . ⋅⋅⋅ . ] {��1,��2,...,���,...,���},where��� =���(�,�,�,��).Te [ . ] access manager server will reply with the servers’ answers and �ℎ(�) �ℎ(�) ⋅⋅⋅ �ℎ(�) ⋅⋅⋅ �ℎ(�) [ 1 2 � � ] the decryption token that is sent to the user. Ten, it will have the ability to decrypt and rebuild the requested information. =[�1 �2 ⋅⋅⋅ �� ⋅⋅⋅ ��] (1) 3.2.1. Decryption Token Generation. Te decryption token generation algorithm (Algorithm 2) is run by the access where each element of the vector [�1,�2,...,��,...,��] manager server. It requires as inputs the ciphertext �� represents the data to be stored on each server that implicitly includes an access structure A, user’s public (SRV1, SRV2,...,SRV�,...,SRV�,) respectively. key ��� generated by the PIR protocol, user’s secret key Each �� is the corresponding column afer applying ℎ(�) on ���,anduser’ssetofattributes���.When��� suits the each piece of data. ℎ(�) is our hash permutation function accessibility construction A, the algorithm could successfully that is used to hide the index of each piece of the stored data. calculate the right decryption token ��� of this ciphertext. More precisely, a hash function ℎ maps bit strings of arbitrary Tus, the PIR protocol can transform the user’s query to a set fnite length to strings of fxed length, say � bits. For a domain of k-queries for the Access Manager. � and range � with ℎ:��→�and |�| > |�|,thefunctionis many-to-one, implying that the existence of collisions (pairs 3.3. Data Decryption and Reconstruction. Once the user has of inputs with identical output) is unavoidable. received the encrypted answers set �� and its decryption token ���, he can decrypt the answers with the help of its Defnition 1 (Hash Permutation Function ℎ). A hash function own secret key ��� and get the answer set � that will be used is a function ℎ:��→�and |�| > |�| maps bit strings of for reconstruction. Since the permutation function can make arbitrary fnite length to strings of fxed length, which has the some collusions (��) probability �, the success probability is � following properties: 1−(1−�) , and the reconstruction threshold �=�−��. Based on the value of � theuserwillbeabletoreconstruct (1) Compression: ℎ maps an input �� of arbitrary fnite bit the requested slice. Once the data has been reconstructed, the length, to an output ℎ(��) of fxed bit length �. user can use its given decryption token to decrypt the data. (2) Ease of computation: given ℎ and an input ��, ℎ(��) is easy to compute. 4. Implementation (3) Preimage resistance: for essentially all prespecifed outputs, it is computationally infeasible to fnd any Our model is implemented on the top of our private cloud environment, which is based OpenStack [37]. We have input which hashes to that output, i.e., to fnd any built our proposed scheme by initializing a set of virtual preimage �� such that ℎ(��)=�when given any � for which a corresponding input is not known. instances that are divided into three categories based on their confguration: (4) 2nd-preimage resistance: it is computationally infea- sibletofndanysecondinputwhichhasthesame (i) Category 1: � virtual instances working as storage � output as any specifed input, i.e., given ��, to fnd a servers. Te confguration of those servers is 2nd-preimage �� =�̸ � such that ℎ(��)=ℎ(��). 4VCPUS,4GBRAM,and80GBdisk.Wehave considered �=50; thus, the IP addresses for (5) Collision resistance: it is computationally infeasible to 10.0.10.101 10.0.10.150 � ,� those servers are : with sub- fnd any two distinct inputs � � whichhashtothe 255.255.255.0 ℎ(� )=ℎ(�) net mask .Tosestorageservershave same output, i.e., such that � � . two basic tasks, storing the encrypted data slices We can fnd that the collision resistance implies 2nd- submitted to them through the Storage Manager. preimage resistance of hash functions, but collision resistance Te second task is sending the encrypted data slices does not guarantee preimage resistance. back as answers to the Access Manager server to be delivered to the requesting user. 3.2. User Authorization and Query. Te user can issue a query (ii) Category 2: two virtual instances. Te confguration Q to receive a fle � which is partitioned and stored on � for each of them is 4 VCPUS, 8 GB RAM, and 20 GB diferent server (SRV1, SRV2,...,SRV�,...,SRV�,). disk. Tey are working as PIR Protocol server with Considering that the information is hosted on both cloud IP address 10.0.10.151 and Key Manger server with servers. Ten there has to be a decryption token for every IP address 10.0.10.152.Aferthe�−�������are user to have the ability to synchronize the information. Te received by the PIR Protocol from the Access Manager Wireless Communications and Mobile Computing 5

Input: (i) ��� ⊳ Te content key for each �� (ii) �� ⊳ Te data D public key managed by Shamir secret sharing (iii) A =(�,�) ⊳Te data D access structure, which will be implicitly included in the ciphertext ∗ (1) Choose ��,��,�� ∈ Z� ⊳ Te random secrets of each data slice �� ∗ (2) Choose �∈Z� ⊳ Arandomencryptionexponent �→ � ∗� (3) Choose V =[�,�2,...,��] ∈ Z� ⊳ A random vector, where [�2,...,��] are used to share the random encryption exponent � (4) for �=1to � do �→ (5) Compute �� = V .�� ⊳�� is the vector corresponding to the j-th row of the matrix � (6) end for ∗ (7) Randomly choose �1,...,�� ∈ Z� � � �� �/� (8) Compute: � =� and � =� � (9) for �=1to � do �→ −�� ��� V (�)�(�) �� (10) Compute �� =� .((� �(�(�)) ) ��/�� −(��/��)�� (11) Compute �1,� =� and �2,� =� (12) end for (13) Compute the ciphertext: � � [ �� � �� ] ��� = ���.(∏�̂ (�, �) ) ,� ,� ,��,�1,�,�2,�,�(�) [ � ] Output: Te ciphertext ��

Algorithm 1: Encryption.

server, the user must not be given so much data slices 5.1. Client Situation. Our scheme requires the user to per- to guarantee the data secrecy. Also, the user should form a more intricate decryption compared to using Attribute not be given less data slices than the required slices Encryption Scheme (AES) using a conventional encryption that can generate the requested data. Tus, the PIR manner, by introducing the Token Generation Algorithm 2. Protocol server runs the PIR protocol to ensure hiding In our scheme the decryption is parallelized on several core the requested data slices identity from the cloud VCPUs, which makes the user processing much more efec- storage servers and also ensuring granting the user tive. In our experiments, we have considered fve diferent theappropriatedataslicestobeabletorecoverthe size categories: 32:127 bits; 128:511 bits; 512:2047 bits; 2048:8191 requested data. Te Key Manager server is responsible bits; and 8192:32768 bits. for assigning the keys for both the data owners and Figure 2 shows that the decryption cost can be used with requesting users. all acceptable scenarios for the use of our scheme. Specifcally, (iii) Category 3: two virtual instances. Te confguration the fgure illustrates the throughput obtained by changing for each of them is 8 VCPUS, 16 GB RAM, and 40 GB the number of slices, through implementing our scheme in disk. Tey are working as Storage Manager server with various confgurations defned by fve size groups (32:127 IP address 10.0.10.154 and Access Manager server bits, 128:511 bits, 512:2047 bits, 2048:8191 bits, and 8192:32768 with IP address 10.0.10.155.TeStorage Manager bits). Based on the execution of our encryption protocol receives the encrypted data slices and distributes them (Algorithm 1), we notice that even the largest size category on the appropriate cloud storage instances. Te Access (8192:32768 bits) ofers a throughput that is approximately Manager receives the encrypted data slices from the 85 MB/s, while considering 16 slices of that size category. cloud storage instances and sends them to the PIR Te throughput for the smallest size category (32:127 bits) Protocol. is about 140 MB/s, while considering 16 slices of that size category. Te fgure also reveals that decreasing the amount It should be mentioned that all of those virtual instances are of slices, we achieve the performance level that is 1.5 times cooperating together to construct the proposed scheme. Each that obtained from the 8192:32768 bits size group and 2 times of them has its own task that can perfectly execute it. the one obtained by 32:127 bits size group. It should be noted that the experimental results for that part are the average of 5. Performance Analysis 25 trails. Te performance analysis of our proposed scheme is intro- 5.2. Overlay Situation. In our proposed scheme, the overlay duced based on two mandatory scenarios. situation is analyzed by using the Swif (it organizes objects 6 Wireless Communications and Mobile Computing

Input: (i) ��� ⊳ Te ciphertext related to the fle � (ii) ��� ⊳ Te user’s public key given by PIR protocol (iii) ��� ⊳ Te user’s secret key (iv) �� ⊳ Te user’s set of attributes (1) Let: ��� =|A|⊳Te set of attributes involved in ��� ∗ (2) Choose a set of constants �� ∈ Z� , ∀� ∈ ��� (3) for �=1to ��� do (4) if �� ∈�ℎ���(�)then ⊳�� are valid shares of the secret � according to the data D ��� (5) Reconstruct the encryption exponent: �=∑�=1 ���� (6) end if (7) end for ∗ (8) Let {��,� ,��,� ,��,� ∈ Z� } are random constants ∀�∈��� (9) if �� ⊨ A then ⊳ Te user’s attributes satisfes A −1 ��� �(�̂ �,� ).�(�̂ ��,� ) ��� = ∏ �,� �,� �� ����� �=1 � ∏�=1 [�(�̂ �,����).�(�̂ 1,�,��,�(�) ).�(�̂ 2,�,��,� )]

�̂ (�, �)�.�.�.��� .∏��� �̂ (�, �)(��/���)� = �=1 �.�.�.��� ��� �̂ (�, �) .∑�=1 ����

��� = ∏�̂ (�, �)(��/���)� �=1 (10) else ⊳���cannot successfully computed (11) ��� = rand(���) ⊳ ��� will get a random Hexadecimal value (12) end if Output: Te Decryption Token ���

Algorithm 2: Token generation.

350 With our experiments at the overlay scenario, we have assem- bled a Swif user program in Python that implements the get 300 and also put fragment techniques that describe our strategy. We have followed two strategies: one is by fragmenting 250 an object as atomic objects; the second is by using DLO introduced by Swif. Figure 3 contrasts the period required for the imple- 200 mentation of get requests based on diferent amounts of contemplated fragments (1,4,16,64,256,and1024)to a 150 specifc object. Te operation is droved dependent on the Troughput (MB/S) Troughput network bandwidth and also the overhead imposed by the 100 management of every get request. Specifcally for get requests, the overhead introduced 50 into handling one portion for every fragment predominates 2 4 6 8 10121416 in the event of little costs, whereas the growth in object’s Number of Slices size exhausts the system’s bandwidth that causes a great bottleneck. By contemplating an item of size 1 GB, the time 32:127 512:2047 8192:32768 required for implementing the get requests is roughly 92 128:511 2048:8191 seconds for 1 considered fragment, and the time is 1000 Figure 2: Troughput varying the number of considered slices for seconds for 1024 considered fragments for the same object four diferent size categories. size. In case of considering an object of size 64 KB, the time required for executing the get requests is about 0.08 seconds for 1 considered fragment, and the time is 50 seconds for 1024 considered fragments for the same object size. within containers) service as a reference. We have adopted Figure 4 reports the throughput for a workload caused by the DLO support ofered by Swif to implement the get the implementation of blending get and also put fragment and put fragment methods that characterize our scheme. requests on Swif by changing the object size and the amount Wireless Communications and Mobile Computing 7

1000 6. Conclusion We presented a robust PIR scheme for efciently and securely 800 storing and retrieving private information from untrusted cloud servers. Our scheme lets the data owners to efectively divide and encrypt their own data into small slices of fve 600 diferent size categories. Our implementation and exper- imental analysis confrm the efcacy and efcacy of our

Time (S) Time 400 proposed scheme, which appreciates orders of magnitude of improvement in throughput concerning source protection and decryption. For an object of size 1 GB, the time required 200 for implementing the get requests is roughly 92 seconds for 1 considered fragment; the time is 1000 seconds for 1024 con- 0 sidered fragments for the same object size. Considering an get 64KB 256KB 1MB 4MB 16MB 64MB 256MB 1GB object of size 64 KB, the time required for executing the requests is about 0.08 seconds for 1 considered fragment, Object Size (KB/MB/GB) and the time is 50 seconds for 1024 considered fragments 1 64 1024 for the same object size. Te proposed scheme also supports 4 256 2048 its compatibility with all present cloud storage environments, 16 which makes it also relevant to a lot of application domains. Figure 3: Te execution time for the get requests on Swif. Data Availability

1.2 Te data used to support the fndings of this study are available from the authors upon request. 1.0 Conflicts of Interest 0.8 Te authors declare that there are no conficts of interest regarding the publication of this paper. 0.6 References 0.4 Troughput (MB/S) Troughput [1] B. Chor, O. Goldreich, E. Kushilevitz, and M. Sudan, “Private 0.2 information retrieval,” in Proceedings of the 36th Annual Foun- dations of Computer Science, pp. 41–50, Los Alamitos, Calif, USA, October 1995. 0.0 [2] B. Chor and N. Gilboa, “Computationally private information 64KB 256KB 1MB 4MB 16MB 64MB 256MB 1GB retrieval (extended abstract),”in Proceedings of the 29th Annual Object Size (KB/MB/GB) ACM Symposium on Teory of Computing (STOC ’97),pp.304– 1 64 1024 313, New York, NY, USA, May 1997. 4 256 2048 [3] R. L. Rivest, “All-or-nothing encryption and the package trans- 16 form,” in Fast Sofware Encryption,vol.1267ofLecture Notes Figure 4: Te throughput for a workload introduced by combining in Computer Science, pp. 210–218, Springer Berlin Heidelberg, the get and put fragment requests on Swif. 1997. [4]J.Shen,C.Wang,T.Li,X.Chen,X.Huang,andZ.-H.Zhan, “Secure data uploading scheme for a smart home system,” Information Sciences,vol.453,pp.186–197,2018. of contemplated fragments for the exact same object. We have [5]X.Chen,J.Li,J.Weng,J.Ma,andW.Lou,“Verifablecomputa- tion over large database with incremental updates,” Institute of assessed the behavior of our scheme based on a selection of Electrical and Electronics Engineers. Transactions on Computers, 2048 objects, where following every put fragment request, a vol.65,no.10,pp.3184–3195,2016. get succession of 100 requests were implemented on objects [6]X.Chen,J.Li,X.Huang,J.Ma,andW.Lou,“NewPublicly in precisely the exact same group and are all of the exact Verifable Databases with Efcient Updates,” IEEE Transactions same size. Tese confgurations using fragments’ throughput on Dependable and Secure Computing,vol.12,no.5,pp.546– will be orders of magnitude greater already. Te fgure also 556, 2015. demonstrates that the very best number of fragments is [7] H. Sun and S. A. Jafar, “Te capacity of robust private infor- dependent upon the resource size. Te identifcation of this mation retrieval with colluding databases,” Institute of Electrical value needs to think about the setup of the workload and this and Electronics Engineers Transactions on Information Teory, machine. vol.64,no.4,part1,pp.2361–2370,2018. 8 Wireless Communications and Mobile Computing

[8] Y. Gertner, Y. Ishai, E. Kushilevitz, and T. Malkin, “Protecting [24] H. Zhu, Y. Tan, L. Zhu, X. Wang, Q. Zhang, and Y. Li, “An data privacy in private information retrieval schemes,” Journal identity-based anti-quantum privacy-preserving blind authen- of Computer and System Sciences,vol.60,no.3,pp.592–629, tication in wireless sensor networks,” Sensors,vol.18,no.5,2018. 2000. [25]H.Li,D.Liu,Y.,T.H.Luan,andX.S.Shen,“Enabling [9]C.Gao,S.Lv,Y.Wei,Z.Wang,Z.Liu,andX.Cheng,“M-SSE: efcient multi-keyword ranked search over encrypted mobile an efective searchable symmetric encryption with enhanced cloud data through blind storage,” IEEE Transactions on Emerg- security for mobile devices,” IEEE Access,vol.6,pp.38860– ing Topics in Computing,vol.3,no.1,pp.127–139,2015. 38869, 2018. [26]H.Li,D.Liu,Y.Dai,T.H.Luan,andS.Yu,“Personalizedsearch [10] K. Riad and L. Ke, “Operative scheme for functional android over encrypted data with efcient and secure updates in mobile malware detection,” Security and Communication Networks. clouds,” IEEE Transactions on Emerging Topics in Computing, [11]J.Xu,L.Wei,Y.Zhang,A.Wang,F.Zhou,andC.Gao, vol.6,no.1,pp.97–109,2018. “Dynamic Fully Homomorphic encryption-based Merkle Tree [27]J.Shen,T.Zhou,D.He,Y.Zhang,X.Sun,andY.Xiang, for lightweight streaming authenticated data structures,” Jour- “Block design-based key agreement for group data sharing in nal of Network and Computer Applications,vol.107,pp.113–124, cloud computing,” IEEE Transactions on Dependable and Secure 2018. Computing,2018. [12]J.Li,X.Chen,C.Jia,andW.Lou,“Identity-basedencryption [28] S. D. C. Di Vimercati, S. Foresti, S. Jajodia, S. Paraboschi, with outsourced revocation in cloud computing,” IEEE Trans- and P. Samarati, “Encryption policies for regulating access actions on Computers,vol.64,no.2,pp.425–437,2015. to outsourced data,” ACM Transactions on Database Systems [13]H.Li,X.Lin,H.Yang,X.Liang,R.Lu,andX.Shen,“EPPDR: (TODS),vol.35,no.2,2010. an efcient privacy-preserving demand response scheme with [29] I. Hang, F. Kerschbaum, and E. Damiani, “ENKI: Access control adaptive key evolution in smart grid,” IEEE Transactions on for encrypted query processing,” in Proceedings of the ACM Parallel and Distributed Systems,vol.25,no.8,pp.2053–2064, SIGMOD International Conference on Management of Data 2014. (SIGMOD ’15),pp.183–196,NewYork,NY,USA,June2015. [14]J.Shen,J.Shen,X.Chen,X.Huang,andW.Susilo,“Anefcient [30] K. Riad, “Blacklisting and forgiving coarse-grained access public auditing protocol with novel dynamic structure for cloud control for cloud computing,” International Journal of Security data,” IEEE Transactions on Information Forensics and Security, and Its Applications,vol.10,no.11,pp.187–200,2016. vol. 12, no. 10, pp. 2402–2415, 2017. [31] K. Riad and Z. Yan, “Multi-factor synthesis decision-making [15]V.Goyal,O.Pandey,A.Sahai,andB.Waters,“Attribute- for trust-based access control on cloud,” International Journal of based encryption for fne-grained access control of encrypted Cooperative Information Systems,vol.26,no.04,pp.1–33,2017. data,” in Proceedings of the 13th ACM Conference on Computer and Communications Security (CCS ’06),pp.89–98,November [32]Z.Cai,H.Yan,P.Li,Z.-A.Huang,andC.Gao,“Towardssecure 2006. and fexible EHR sharing in mobile health cloud under static assumptions,” Cluster Computing,vol.20,no.3,pp.2415–2422, [16] K. Riad, “Multi-authority trust access control for cloud storage,” 2017. in Proceedings of the 4th IEEE International Conference on Cloud Computing and Intelligence Systems (CCIS ’16),pp.429–433, [33] W. Chen, H. Lei, and K. Qi, “Lattice-based linearly homomor- August 2016. phic signatures in the standard model,” Teoretical Computer Science,vol.634,pp.47–54,2016. [17] J. Hur and D. K. Noh, “Attribute-based access control with efcient revocation in data outsourcing systems,” IEEE Trans- [34] M. Z. A. Bhuiyan, G. Wang, J. Wu, J. Cao, X. Liu, and T. actions on Parallel and Distributed Systems,vol.22,no.7,pp. Wang, “Dependable structural health monitoring using wireless 1214–1221, 2011. sensor networks,” IEEE Transactions on Dependable and Secure Computing,vol.14,no.4,pp.363–376,2017. [18] K. Riad, “Revocation basis and proofs access control for cloud storage multi-authority systems,” in Proceedings of the 3rd [35] H. Shen, C. Gao, D. He, and L. Wu, “New biometrics-based International Conference on Artifcial Intelligence and Pattern authentication scheme for multi-server environment in critical Recognition (AIPR ’16), pp. 118–127, September 2016. systems,” Journal of Ambient Intelligence and Humanized Com- [19]J.Li,X.Y.Huang,J.W.Li,X.F.Chen,andY.Xiang,“Securely puting,vol.6,no.6,pp.825–834,2015. outsourcing attribute-based encryption with checkability,” IEEE [36] A. Fazeli, A. Vardy, and E. Yaakobi, “PIR with Low Storage Over- Transactions on Parallel and Distributed Systems,vol.25,no.8, head: Coding Instead of Replication,”2015, https://arxiv.org/abs/ pp. 2201–2210, 2014. 1505.06241. [20]J.Li,Z.Liu,X.Chen,F.Xhafa,X.Tan,andD.S.Wong,“L- [37] OpenStack, http://www.openstack.org/. EncDB: A lightweight framework for privacy-preserving data queries in cloud computing,” Knowledge-Based Systems,vol.79, pp. 18–26, 2015. [21] X. W. Y. Zhang, H. Zhu, and L. Jiang, “An identity-based signcryption on lattice without trapdoor,” Journal of Universal Computer Science,2018. [22] T. Li, W. Chen, Y. Tang, and H. Yan, “A homomorphic network coding signature scheme for multiple sources and its application in IoT,” Security and Communication Networks,vol.2018,Article ID 9641273, 6 pages, 2018. [23]H.Yan,X.Li,Y.Wang,andC.Jia,“Centralizedduplicate removal video storage system with privacy preservation in IoT,” Sensors,vol.18,no.6,2018. Hindawi Wireless Communications and Mobile Computing Volume 2018, Article ID 7584942, 8 pages https://doi.org/10.1155/2018/7584942

Research Article Multiresolution Face Recognition through Virtual Faces Generation Using a Single Image for One Person

Hae-Min Moon,1 Min-Gu Kim,2 Ju-Hyun Shin ,3 and Sung Bum Pan 4

 Management & Planning Division, Korea Invention Promotion Association, -, Docheun-dong, Gwangsan-gu, Gwangju, Republic of Korea Department of Control and Instrumentation Engineering, Chosun University,  Seosuk-dong, Dong-gu, Gwangju , Republic of Korea Department of ICT Convergence, Chosun University,  Seosuk-dong, Dong-gu, Gwangju , Republic of Korea Department of Electronics Engineering, Chosun University,  Seosuk-dong, Dong-gu, Gwangju , Republic of Korea

Correspondence should be addressed to Sung Bum Pan; [email protected]

Received 2 July 2018; Revised 27 September 2018; Accepted 23 October 2018; Published 11 November 2018

Academic Editor: Ilsun You

Copyright © 2018 Hae-Min Moon et al. Tis is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

In recent years, various studies have been conducted to provide a real-time service based on face recognition in Internet of things environments such as in a smart home environment. In particular, face recognition in a network-based surveillance camera environment can signifcantly change the performance or utilization of face recognition technology because the size of image information to be transmitted varies depending on the communication capabilities. In this paper, we propose a multiresolution face recognition method that uses virtual facial images by distance as learning to solve the problem of low recognition rate caused by communication, camera, and distance change. Face images for each virtual distance are generated through clarity and image degradation for each resolution, using a single high-resolution face image. Te proposed method achieved a performance that was 5.9% more accurate than methods using MPCA and SVM, when LDA and the Euclidean distance were employed for a DB that was confgured using faces that were acquired from the real environments of fve diferent streets.

1. Introduction an increase in the use of image monitoring systems based on such cameras to improve face recognition performance. IoT is an intelligent system that helps communicate with In face recognition systems, faces are recognized using high- people and things or between things and things using resolution face images that are acquired at close range, and Internet networks. With the recent advancement in various these serve as an important factor that can guarantee an efec- technologies such as high-speed networks, large-capacity tive recognition performance. However, aside from access data transmission, wired-wireless sensor networks, among control systems, it is difcult for face recognition systems to others, not only IoT but also related application technologies acquire high-resolution face images at close range because have been actively developed. With the recent developments the distance between the camera and the person varies. in hardware technology, a video surveillance system using Images that are captured through the camera inevitably sufer a high-resolution camera is commonly used. However, in in terms of the face resolution if the distance between the the case of an intelligent surveillance system, there is a camera and the person becomes too wide, regardless of the limit to data transmission due to high computational real- cameras performance. In other words, even if high-resolution time analysis. Terefore, thanks to the spread of high-speed cameras are employed in face recognition systems, the face communication technology, such as 5G, the high-resolution- recognition performance cannot be guaranteed for long range based surveillance camera has been applied to remote face low-resolution images [1]. recognition technology. Along with the recent increase in the Many studies exist regarding various face recognition availability of high-resolution cameras, there has also been methods, but issues still remain, such as the recovery 2 Wireless Communications and Mobile Computing

Surveillance Environment Networks Infrastructure Client Interface PoE Switch Cloud IP Camera

System Server Router Figure 1: Structure of network-based surveillance system combined with IoT technology. processing speed for low-resolution face images, reliance on technology applicable to various resolutions. Figure 1 shows learning data, and reduced recognition rates according to the structure of a network-based surveillance system using changes in distance that occur during actual face recognition intelligent surveillance system technology combined with IoT situations. Although some studies are limited to a single in the network environment. First, the images obtained from environment, such as high or low-resolution face environ- the camera are transmitted to the network-based surveillance ments, no existing face recognition studies are based on system for event analysis. Te surveillance system decides if environments that include changes in resolution. In other an event occurs by comparing the image received from the words, there is a need for face recognition technology based camera with the image stored in the server. Finally, when an on multiresolution that is robust to changes in resolution event occurs, an alarm is sent to the user, which enables them resulting from camera performance or changes in distance to recognize the situation, and the event image is stored on [2]. Tis paper will propose a face recognition system that the network server. employs face recognition learning according to virtual resolu- tions in order to resolve the issue of reduced recognition rates .. Video Surveillance System. Te video surveillance system resulting from changes in face image resolution that occur refers to a method of transmitting the image information through changes in distance. Face images for each virtual captured by the camera to a receiver and observing a specifc distance are generated through clarity and image degradation area that the user wants to be based on the transmitted image for each resolution, using a single high-resolution face image. information. Te video surveillance system is widely used Clarity is measured using blur metrics, and Gaussian blurring for various applications in the felds of security, industry, is applied for image degradation. Te performance of the vehicle surveillance, and trafc management. Recently, the proposed method is analyzed through a face DB that was availability of the system has been growing sharply due to confgured using fve diferent real resolutions. According to the development of communication technologies such as IoT the test results, the proposed method achieved a performance technology, 4G, and 5G. In addition, the advancement of IT that was 5.9% better than methods using MPCA and SVM, technology has transformed an analog environment based on when LDA and the Euclidean distance were employed. When VCR storage devices into a digital age that uses DVR-based the standard image size was set to the average size of all image compression and digital transmission technology. It face images at 30×30, under the conditions using LDA has evolved into an intelligent video security technology that and the Euclidean distance, the average performance was combines IP-based networks using broadband networks and improved by 40.7% for 16×16, 12×12, and 10×10, which are open protocols with automatic image analysis and recogni- low-resolutions. tion technology. CCTV cameras can be divided into dome, Tis paper is organized as follows. Section 2 explains box, IR, PTZ, and IP cameras depending on their use and related works with structure of network-based surveillance shape. Considering camera performance is as important as system. Section 3 explains face recognition using the pro- selecting the right camera in a video surveillance system, posed virtual multiresolution face images, Section 4 analyzes that is, the use of the surveillance system is determined the experimental results, and Section 5 presents conclu- depending on the quality of the image obtained through sions. the camera, high-resolution and high-quality images contain considerable information, which is a very important factor in 2. Related Work intelligent surveillance systems based on image processing. Although an image is captured from the same location, In an IoT environment, the network camera-based facial in the case of the HD-class image, the information of the recognition depends heavily on the camera because the object can be intuitively confrmed without additional image recognition performance varies depending on the obtained processing, yet in the case of normal image quality, it is ofen image quality. Recently, UHD-class high-resolution cameras difcult for us to directly understand the information of the have become commonplace thanks to the rapid advance- object. Tat is, the higher the image resolution is, the clearer ments in camera performance. However, in the case of and more detailed image we can see. However, the price is network camera-based surveillance systems, there is a limit more expensive depending on the camera performance, and to its practical use due to the limit of transmission speed. the size of the image data becomes larger, resulting in more Terefore, in this paper, we propose a facial recognition processing time. Wireless Communications and Mobile Computing 3

Training Testing

input face image input face images (high resolution face) (real multi-resolution faces)

generating training faces scale to reference face size

illumination normalization image degradation (histogram equalization)

feature extraction illumination normalization (LDA) (histogram equalization)

feature extraction DB (LDA)

similarity measurement (Euclidean distance)

classifcation

Figure 2: Flowchart of the proposed model for multiresolution face recognition.

.. Low-Resolution Based on Face Recognition Technology. confguring data for testing and identifying the most similar Recently, studies have been conducted regarding face recog- faces through training data and increasing the amount of nition technology using low-resolution images [3]. Tis face training data. Te method of confguring testing data and recognition method recognizes faces by using super reso- identifying the most similar faces through learning uses lution recovery, which converts low-resolution images into face images that are extracted from various real resolutions. high-resolution images [4–6]. Te super resolution recovery However, because this method acquires face images when the method achieves a superior high-resolution recovery per- user is moving, it requires serious cooperation from users. formance from the visual end, but it is unable to achieve a Tis paper proposes a method that generates face images at satisfactory performance when it comes to face recognition. various resolutions without requiring such cooperation from Because this method relies on learning data, it has the users. disadvantage of requiring high quantities of learning data, Figure 2 illustrates the proposed face recognition method, and processing times increase as the amount of data increases. where faces are recognized in multiresolution environments Tere also exist studies on a method that exploits structural that were acquired by generating multiple low-resolution face characteristics through the mapping and learning of a pair images using a single high-resolution face image. Virtual of low- and high-resolution images generated from the same low-resolution face images are scaled to a set reference face [3, 7, 8]. Tis method using structural characteristics is face size. Bilinear interpolation is used for scaling, and advantageous to the super resolution recovery method, owing histogram smoothing is applied to adjust the lightning. Faces to its lower calculational complexity. However, the recogni- for testing were not generated randomly, but rather fve tion rate varies signifcantly according to undefned values real resolution images were employed. At this point, image such as changes in lighting, distances, face expressions, other degradation was performed on the testing data by using face external variables, and mapping related weighted values. Face size normalization and imposing an image blur value. Real recognition using a Pan-Tilt-Zoom(PTZ) camera boasts an multiresolution faces for testing were acquired while varying extremely high face recognition rate because it can solve the the distance between the camera and the person between 1 m fundamental issue of image deterioration from long range and 5 m. In order to recognize faces, LDA was used to extract imaging [9, 10]. However, because the camera used in this features from the training and testing data, and the Euclidean method is extremely expensive, it cannot be employed for distance was employed to measure the similarity [11, 12]. general purposes. .. Generating Training Faces Using a Single Image per Person. 3. Proposed Algorithms for In order to minimize the required calculations and use the Multiresolution Face Recognition least amount of samples, this paper proposes a method that employs a single high-resolution face image to automatically Among face recognition methods that depend on learn- generate multiple face images for learning. Figure 3 illustrates ing, methods of improving the recognition rate include the process of generating face images according to virtual 4 Wireless Communications and Mobile Computing

face image size at each real resolution. Assuming that there HR face image are fve face images for the training data for each candidate, input HR face image Γ� =[��,��1,��2,��3,��4] is the group that is confgured through the learning data for each candidate. Figure 4 presents the results of comparing face images downscale to multiple face sizes from each distance with virtual multiresolution face images. Figure 4(a) shows the high-resolution face image that was extracted from a distance of 1m, and Figure 4(b) shows the 50x50 25x25 16x16 12x12 10x10 virtual face image that was generated using the proposed method. Figure 4(d) shows the face image for each resolution that was extracted from a real distance, and Figure 4(c) shows the image from Figure 4(d) afer being normalized to ft the upscale to reference face size standard image size 30×30. Te test results indicate that the virtual face image that was automatically generated was more similar to the real face image than the image produced from  blur measurement > th asetdistanceof1m.

low pass flter .. Image Degradation and Reference Face Size. Existing low- resolution face recognition technology employs a method that recovers face images to high-resolution to improve the output face images face recognition rate. In this paper, the defnition of a high- resolution face image is reduced through low pass fltering, which is the opposite of the existing method. Te process of      1 2 3 4 increasing the similarity through lowering the defnition not Figure 3: Flowchart of the proposed model for generating training only requires less calculation than existing high-resolution faces using image degradation. recovery technology but also does not require a reference image. In order to minimize the diferences between the multiresolution training face image that is automatically resolutions, using one high-resolution face image. One high- generated and the real low-resolution resting face image, the resolution face image is input and is then reduced to ft the clarity of the two images was considered. standard size according to the resolution. Te size of the Te blur value of the image that is derived through the original image that is reduced for each resolution is based on input, such as in (2), is measured and compared to the a real face image, which was detected as the distance between standard blur value for each resolution. If the blur value of � the person and the camera varied in 1 m intervals. Te size of the current input image is greater than the standard value �ℎ, the original image in which faces can be detected is 320×240, then the defnition is reduced through low pass fltering. Tis and the distance between the person and the camera varies process is performed until the value becomes lower than the from 1 m to 5 m. Images that are reduced to the appropriate standard blur value. size according to the resolution are then expanded to be the {� ⊗��� (� )≥� same size as the reference image for face recognition. Te � �� ���� �� �ℎ ��� = { (2) � �� � (� )<� defnition of an expanded face image is reduced through low { �� ���� �� �ℎ pass fltering until it reaches the target clarity. Equation (1) below defnes the method for generating virtual face images where ����� represents the clarity measurement method, and using high-resolution face images. the blur metric method [13], which does not employ a reference image, is used. � represents the blur kernel. Te � =� (�� ,�� ), �� ��� � ���� �� weighted value is generated through (3) and used to reduce (1) � the defnition of the generated image. ��� =(���,��)⊗� 1 2 2 2 � �(�,�)= �−(� +� )/2� where �� isthefaceimageattheith virtual size that is used 2��2 (3) for learning, and ������ is the expansion ratio for scaling to the standard image size for acquiring features. Te complexity of the algorithm is generally important when Furthermore, ��� represents the high-resolution image considering the processing speed of an image processing that is reduced to ft the face size according to each resolution. technique, but this speed is also infuenced by the size of At this point, � indicates the blur kernel, ⊗ represents the the processing image resolution. Face recognition techniques convolution operation, and ���� � represents the scaling. Te can be performed smoothly at 400,000 pixels, but once an high-resolution �� is 50×50, while the low-resolution ��4 is image exceeds 1 million pixels real-time processing cannot be 10×10. Finally, ��� represents the scaling ratio for generating guaranteed. Te proposed multiresolution face recognition face images according to each resolution. Te reduction ratio method adopts the average sizes of the high-resolution and for face images at each resolution is acquired from the average low-resolution images as the standard for the images that Wireless Communications and Mobile Computing 5

(a)

(b)

(c)

(d)

HR(close distance) LR(far distance) Figure 4: Comparison of real multiresolution images with proposed virtual multiresolution images. areinputforfacerecognition,insteadofusingthesizeofa illumination value from 0 to 255 as the index and accumulates high-resolution face image as the standard, as is the case in the frequency in the corresponding index table according to existing techniques. In other words, the sizes of all face images the illumination value of each pixel that confgures the image. used for training and testing are scaled to the same standard Histogram interpolation is a method that alters an images size.Teaveragesizeofthefacesusedforfacerecognitionis histogram such that it appears evenly across the entire gray calculated through the following. scale area. Tis method does not merely expand the images histogram, but rather changes the histogram distribution �� ����ℎ��� +�� ����ℎ��� ������ = (4) based on the characteristics of the original images histogram. 2 Te function that changes the images pixel values is defned in (5), where r is the inputted gray scale and s is the gray where ������ is the standard size that all face images are scale value that is outputted through the conversion function normalized to, �� ����ℎ��� is the width of the highest face T. In general, the conversion function T is assumed to be a image resolution, and �� ����ℎ��� is the width of the lowest resolution. monotone increasing function, and histogram interpolation can also be described in the form of a monotone increasing function. .. Illumination Normalization. In this paper, histogram interpolation was used to normalize the brightness of faces. A �=�(�) (5) histogram constitutes a method of assessing the distribution of the illumination, which is an important element of data for Based on the above, the histograms of the inputted and an image. Histogram confguration is a method that sets the outputted images can be represented through ��(�) and ��(�), 6 Wireless Communications and Mobile Computing

�=(�) �=(�) respectively, as probability density functions [14, 15]. Te �� �×� and �� �×�, their similarity can be conversion function of histogram interpolation is defned in calculated using the following. a form that responds to the accumulated value of ��(�),and � � � is expressed through (6), where is a dummy variable for 2 the integration. Te function that is defned by integrating � (�, �) = ∑√∑ (��� −��� ) (10) over the probability density function is called the cumulative �=0 �=0 distribution function. Afer converting a random verifcation image �� into a � feature matrix ��, the similarity with the feature matrix of �=�(�) =∫ �� (�) �� (6) all the training data is measured. Te feature matrix with 0 the least similarity is determined as shown in (11), and the class inside this feature matrix �� is determined as the fnal .. Feature Extraction. PCA has several limitations that class of the verifcation image. Te fnal face recognition is are inherent from its nature. Among others, the biggest determined by the proportion of verifcation images �� that is that it is inefective at separating objects, although it is accurately include the class compared to the total number of good at summing up data. Discrimination between objects verifcation images T,asisshownin(12),whereN is the total is important, because face recognition aims to distinguish number of learning images. objects. Terefore, we need to know whether changes in a min (� (��,��)), �=1,2,3...,� (11) face image occur because the object changes or because of changes in the illumination or face expression. LDA separates � ����������� = � × 100 (12) diferent components into groups, and discriminates between � % changesinfacecomponentsandchangesresultingfromother factors [10]. LDA consists of a linear transformation, which 4. Experimental Results maximizes the ratio of the between-class scatter matrix to the within-class scatter matrix and reduces the dimension In traditional face recognition experiments, a face DB such of a specifc vector for the data. Te between-class scatter as the Yale DB [16], MIT Face DB [17], or FERET DB [18] matrix SB and within-class scatter matrix SW can be written is normally used. It is found that existing face DBs include as follows: factors such as lighting, face twisting, and changes in face expression as external changes, but they do not consider � � changes in face image resolution according to changes in �� = ∑ (�� −�)(�� −�) (7) distance. However, there is a diference between real low- �=1 resolution face images that are extracted from long distances

� �� and low-resolution face images that have been reduced to the � � � same size using a high-resolution image. In other words, the �� = ∑ ∑ (�� −��)(�� −��) (8) �=1 �=1 high-resolution face images from existing DBs are used afer being temporarily reduced, but such images are not suitable for analyzing the performance of face recognition from real � where C is the number of objects, � is the mean image for multiresolutions. Te ETRI face DB that was used in this � � each object, and is the mean image of all images. � is paper is composed of face images from 10 candidates from � thenumberoftheimageofobjectin sequence. Terefore, a u-Robot test bed environment [19]. Table 1 illustrates the � � � where � becomes the maximum and � becomes the composition of the ETRI face DB used in this experiment. minimum can be described as below. Images for each candidate include changes in lighting and If eigenvector and eigenvalue calculated from (7) and distance. (8) are applied to (9), optimal eigenvector can be calculated Changes in lighting are achieved by using indoor lighting, that shows the level of group’s discrimination. Tis process and the distance varies from 1 m to 5 m. Te size of the is repeated when a new image is entered to calculate weight original face image was confgured to 50×50, 25×25, 16×16, and face recognition is performed by comparing the weight 12×12, and 10×10 using a high-resolution image. In this paper, of images in the database and that of a new image. LDA face recognition was performed using a 1:N search method, is widely used for research related to face recognition as it instead of a 1:1 authentication method, and the face image clearly discriminates the features of each group. that was most similar out of all of those stored in the DB was categorized through the image verifcation results. � � � ��� � � In this experiment, the DB was confgured by directly �= � � � argmax � � � (9) extracting face areas, assuming that all faces would be � �� ���� detected from the input images regardless of the distance. If faces are manually extracted, then face areas can be extracted .. Similarity Measurement. In order to verify the recog- with higher precision than in an automatic face detection nition rates for long range face recognition systems, the method. In addition, the original image was employed as similarities between feature matrices were compared in order it was, without considering any twisting or turning in the to calculate the recognition rates. For two feature matrices extracted face images. Wireless Communications and Mobile Computing 7

100 90 80 70 60 50 40 30 20

FACE RECOGNITON RATE (%) RATE RECOGNITON FACE 10 50∗50 25∗25 16∗16 12∗12 10∗10 RESOLUTION (PIXEL) average CMs PCA+L2(50∗50) 38.2 80.1 PCA+SVM(30∗30) LDA+SVM(30∗30) 39.6 80.9 CLPMs LDA+L2(50∗50) 69.4 80.6 MPCA+SVM(30∗30) LDA+L2(30∗30) 49.7 86.8 Figure 5: Comparison of recognition accuracies for diferent single sample face recognition method in ETRI DB.

Table 1: ETRI face DB. average face recognition performance was at an excellent level of 40.7%, even at low-resolutions of 16×16, 12×12, and 10×10. Description Summary When the face recognition was performed using a single (i) 1m 5m distance change Environment of sample in multiresolutions with face images generated for (ii) face position change Acquisition each virtual resolution, LDA performed better than PCA. (iii) face expression change During the process of extracting features, using the average Acquisition Method (i) frame division through video size of all resolutions performed better for improving the Resolution (ii) face position change general recognition rate than using the face size based on ∼ Distance of Acquisition (i) 1m 5m(1m intervals) a high-resolution as the standard. Moreover, on account of Total Candidate Count (i) 10 people issues occurring when the number of pixels in an image is (i) 1m: 30 images(50×50) higher than the number of images, LDA achieved a better Number of Faces per (ii) 2m: 30 images(25×25) performance than existing techniques such as CLPMs and Person (iii) 3m: 30 images(16×16) MPCA. (Average Face Size) (iv) 4m: 30 images(12×12) (v) 5m: 30 images(10×10) 5. Conclusions In camera-based face recognition, various resolutions can Te performance of the proposed method was analyzed exist, in terms of face images that are acquired in conditions bycomparingitwithPCA,LDA,andMPCA[20],aswell from close range high-resolution images to long range low- as CMs [5] and CLPMs [6], which recognize low-resolution resolution images. If high-resolution images are converted faces by considering structural features. Te experiments into low-resolution images for face recognition, then the face were all conducted under the same conditions, and only recognition performance may sufer on account of a loss of one original face data image was used in training for each data and a diference between the resolution of the face image candidate. Tere were 30 verifcation images for each dis- that was used for training and that of the verifcation image. tance, giving a total of 150 images, and for all candidates Tispaperhasproposedamethodthatrecognizesfacesby there were a total of 1,500 verifcation images, where learning generating face images for each virtual resolution in order images were not included in all verifcation images. Figure 5 to resolve the issue of reduced recognition rates resulting presents the face recognition rate change when virtual face from changes in resolution caused by changes in the distance images for training. In the experiment results, if a single between the camera and the subject. Te proposed method sample was used then all high-resolution face recognition uses one high-resolution image to automatically generate methods achieved an excellent performance. However, as face images for each resolution to use for training. Te face the resolution decreased there was a signifcant diference image size is normalized to the average size of all faces and in the performances. In terms of the average recognition used for face recognition. Te proposed method achieved the rate for all resolutions, the proposed method, which used highest performance of the tested methods, with an average LDA, achieved the highest performance, at 86.8%. When resolution of 86.8%, when LDA and the Euclidean distance the size of the standard image was based on the average were used. Moreover, when the standard image size was set resolution (30×30) instead of high-resolution (50×50), the to 30×30, which is the average image size, the performance 8 Wireless Communications and Mobile Computing

improved by an average resolution of 37.1% compared with camera system,” IEEE Transactions on Information Forensics and when the standard image size was set to 50×50, under the Security, vol. 8, no. 10, pp. 1665–1677, 2013. same experimental conditions. [11] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigen- faces vs. fsherfaces: recognition using class specifc linear Data Availability projection,” IEEE Transactions on Pattern Analysis and Machine Intelligence,vol.19,no.7,pp.711–720,1997. Te ETRI DB data used to support the fndings of this [12]R.C.GonzalezandR.E.Woods,Digital Image Processing, study have not been made available due to Electronics and Prentice Hall, 3rd edition, 2008. Telecommunications Research Institute under license and so [13] F.Crete, T. Dolmiere, P.Ladret, and M. Nicolas, “Te blur efect: cannot be open to the public. perception and estimation with a new no-reference perceptual blur metric,” in Proceedings of the Human Vision and Electronic Conflicts of Interest Imaging XII,vol.6492ofProceedings of SPIE,SanJose,Calif, USA, February 2007. Te authors declare that there are no conficts of interest [14] E. Parzen, “On estimation of a probability density function and regarding the publication of this paper. mode,” Annals of Mathematical Statistics,vol.33,pp.1065–1076, 1962. Acknowledgments [15] V.A. Epanechnikov, “Non-parametric estimation of a multivari- ate probability density,” eory of Probability & Its Applications, Tis work was supported by the National Research Founda- vol. 14, pp. 153–158, 1969. tion of Korea (NRF) grant funded by the Korea government [16] A. S. Georghiades, P.N. Belhumeur, and D. J. Kriegman, “From (no. 2018R1A2B6001984) and supported by Basic Science few to many: illumination cone models for face recognition Research Program through the National Research Founda- under variable lighting and pose,” IEEE Transactions on Pattern tion of Korea (NRF) funded by the Ministry of Education (no. Analysis and Machine Intelligence,vol.23,no.6,pp.643–660, 2017R1A6A1A03015496). 2001. [17]B.Weyrauch,B.Heisele,J.Huang,andV.Blanz,“Component- based face recognition with 3D morphable models,” in Pro- References ceedings of the   IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, CVPRW [1] H.-M.Moon,C.H.Seo,andS.B.Pan,“Alow-costmediaquality  ,Washington,Wash,USA,July2004. enhancement resolution up-conversion for mobile cloud,” e Journal of Supercomputing, vol. 73, no. 7, pp. 3098–3111, 2017. [18] P. Jonathon Phillips, H. Moon, S. A. Rizvi, and P. J. Rauss, “Te FERET evaluation methodology for face-recognition algo- [2] W.W.ZouandP.C.Yuen,“Verylowresolutionfacerecognition rithms,” IEEE Transactions on Pattern Analysis and Machine problem,” IEEE Transactions on Image Processing,vol.21,no.1, Intelligence,vol.22,no.10,pp.1090–1104,2000. pp. 327–340, 2012. [19] D. Kim, J. Lee, H.-S. Yoon, and E.-Y. Cha, “A non-cooperative [3]Z.Wang,Z.Miao,Q.M.JonathanWu,Y.Wan,andZ.Tang, user authentication system in robot environments,” IEEE Trans- “Low-resolution face recognition: a review,” e Visual Com- actions on Consumer Electronics,vol.53,no.2,pp.804–811,2007. puter,vol.30,no.4,pp.359–386,2014. [4]W.T.Freeman,T.R.Jones,andE.C.Pasztor,“Example-based [20]H.Lu,K.N.Plataniotis,andA.N.Venetsanopoulos,“MPCA: super-resolution,” IEEE Computer Graphics and Applications, multilinear principal component analysis of tensor objects,” vol. 22, no. 2, pp. 56–65, 2002. IEEE Transactions on Neural Networks and Learning Systems, vol.19,no.1,pp.18–39,2008. [5] S. Baker and T. Kanade, “Hallucinating faces,” in Proceedings of the th IEEE International Conference on Automatic Face and Gesture Recognition, FG  , pp. 83–88, Grenoble, France, March 2000. [6]J.Liu,J.Qiao,X.Wang,andY.Li,“Facehallucinationbased on independent component analysis,” in Proceedings of the  IEEE International Symposium on Circuits and Systems, ISCAS  , pp. 3242–3245, Seattle, Wash, USA, May 2008. [7] B. Li, H. Chang, S. Shan, and X. Chen, “Coupled metric learning for face recognition with degraded images,” in ACML  : Advances in Machine Learning,vol.5828ofLecture Notes in Computer Science, pp. 220–233, 2009. [8] B. Li, H. Chang, S. Shan, and X. Chen, “Low-Resolution Face Recognition via Coupled Locality Preserving Mappings,” IEEE Signal Processing Letters,vol.17,no.1,pp.20–23,2010. [9] P. Salvagnini, L. Bazzani, M. Cristani, and V. Murino, “Person re-identifcation with a PTZ camera: An introductory study,” in Proceedings of the    th IEEE International Conference on Image Processing, ICIP  ,pp.3552–3556,Melbourne, Australia, September 2013. [10]U.Park,H.-C.Choi,A.K.Jain,andS.-W.Lee,“Facetracking and recognition at a distance: a coaxial and concentric PTZ