Advances in Computer Science : an International Journal

Advances in Computer Science: an International Journal

Vol. 4, Issue 4, July 2015

ISSN : 2322-5157

ACSIJ Reviewers Committee 2015

 Prof. José Santos Reyes, Faculty of Computer Science, University of A Coruña, Spain  Dr. Dariusz Jacek Jakóbczak, Technical University of Koszalin, Poland  Dr. Artis Mednis, Cyber-Physical Systems Laboratory Institute of Electronics and Computer Science, Latvia  Dr. Heinz DOBLER, University of Applied Sciences Upper Austria, Austria  Dr. Ahlem Nabli, Faculty of sciences of Sfax,Tunisia  Prof. Zhong Ji, School of Electronic Information Engineering, Tianjin University, Tianjin, China  Prof. Noura AKNIN, Abdelmalek Essaadi University, Morocco  Dr. Qiang Zhu, Geosciences Dept., Stony Brook University, United States  Dr. Urmila Shrawankar, G. H. Raisoni College of Engineering, Nagpur, India  Dr. Uchechukwu Awada, Network and Cloud Computing Laboratory, School of Computer Science and Technology, Dalian University of Technology, China  Dr. Seyyed Hossein Erfani, Department of Computer Engineering, Islamic Azad University, Science and Research branch, Tehran, Iran  Dr. Nazir Ahmad Suhail, School of Computer Science and Information Technology, Kampala University, Uganda  Dr. Fateme Ghomanjani, Department of Mathematics, Ferdowsi University Of Mashhad, Iran  Dr. Islam Abdul-Azeem Fouad, Biomedical Technology Department, College of applied Medical Sciences, SALMAN BIN ABDUL-AZIZ University, K.S.A  Dr. Zaki Brahmi, Department of Computer Science, University of Sousse, Tunisia  Dr. Mohammad Abu Omar, Information Systems, Limkokwing University of Creative Technology, Malaysia  Dr. Kishori Mohan Konwar, Department of Microbiology and Immunology, University of British Columbia, Canada  Dr. S.Senthilkumar, School of Computing Science and Engineering, VIT-University, INDIA  Dr. Elham Andaroodi, School of Architecture, University of Tehran, Iran  Dr. Shervan Fekri Ershad, Artificial intelligence, Amin University of Isfahan, Iran  Dr. G.UMARANI SRIKANTH, S.A.ENGINEERING COLLEGE, ANNA UNIVERSTIY, CHENNAI, India  Dr. Senlin Liang, Department of Computer Science, Stony Brook University, USA  Dr. Ehsan Mohebi, Department of Science, Information Technology and Engineering, University of Ballarat, Australia  Sr. Mehdi Bahrami, EECS Department, University of California, Merced, USA  Dr. Sandeep Reddivari, Department of Computer Science and Engineering, Mississippi State University, USA  Dr. Chaker Bechir Jebari, Computer Science and information technology, College of Science, University of Tunis, Tunisia  Dr. Javed Anjum Sheikh, Assistant Professor and Associate Director, Faculty of Computing and IT, University of Gujrat, Pakistan  Dr. ANANDAKUMAR.H, PSG College of Technology (Anna University of Technology), India  Dr. Ajit Kumar Shrivastava, TRUBA Institute of Engg. & I.T, Bhopal, RGPV University, India

ACSIJ Published Papers are Indexed By:

Google Scholar EZB, Electronic Journals Library ( University Library of Regensburg, Germany) DOAJ, Directory of Open Access Journals Bielefeld University Library - BASE ( Germany ) Academia.edu ( San Francisco, CA ) Research Bible ( Tokyo, Japan ) Academic Journals Database Technical University of Applied Sciences ( TH - WILDAU Germany) AcademicKeys WorldCat (OCLC) TIB - German National Library of Science and Technology The University of Hong Kong Libraries Science Gate OAJI Open Academic Journals Index. (Russian Federation) Harvester Systems University of Ruhuna J. Paul Leonard Library _ San Francisco State University OALib _ Open Access Library Université Joseph Fourier _ France CIVILICA ( Iran ) CiteSeerX _ Pennsylvania State University (United States) The Collection of Computer Science Bibliographies (Germany) Indiana University (Indiana, United States) Tsinghua University Library (Beijing, China) Cite Factor OAA _ Open Access Articles (Singapore) Index Copernicus International (Poland) Scribd QOAM _ Radboud University Nijmegen (Nijmegen, Netherlands) Bibliothekssystem Universität Hamburg The National Science Library, Chinese Academy of Sciences (NSLC) Universia Holding (Spania) Technical University of Denmark (Denmark)

TABLE OF CONTENTS

A Survey on the Privacy Preserving Algorithm and techniques of Association Rule Mining – (pg 1-6) Maryam Fouladfar, Mohammad Naderi Dehkordi

« ACASYA »: a knowledge-based system for aid in the storage, classification, assessment and generation of accident scenarios. Application to the safety of rail transport systems – (pg 7-13) Dr. Habib HADJ-MABROUK, Dr. Hinda MEJRI

Overview of routing algorithms in WBAN – (pg 14-20) Maryam Asgari, Mehdi Sayemir, Mohammad Shahverdy

an Efficient Blind Signature Scheme based on Error Correcting Codes – (pg 21-26) Junyao Ye, Fang Ren, Dong Zheng, Kefei Chen

Multi-lingual and -modal Applications in the Semantic Web: the example of Ambient Assisted Living – (pg 27-36) Dimitra Anastasiou

An Empirical Method to Derive Principles, Categories, and Evaluation Criteria of Differentiated Services in an Enterprise – (pg 37-45) Vikas S Shah

A comparative study and classification on web service security testing approaches – (pg 46-50) Azadeh Esfandyari

Collaboration between Service and R&D Organizations – Two Cases in Automation Industry – (pg 51-59) Jukka Kääriäinen, Susanna Teppola, Antti Välimäki

Load Balancing in Wireless Mesh Network: a Survey – (pg 60-64) Maryam Asgari, Mohammad Shahverdy, Mahmood Fathy, Zeinab Movahedi

Mobile Banking Supervising System- Issues, Challenges & Suggestions to improve Mobile Banking Services – (pg 65-67) Dr.K.Kavitha

A Survey on Security Issues in Big Data and NoSQL – (pg 68-72) Ebrahim Sahafizadeh, Mohammad Ali Nematbakhsh

Classifying Protein-Protein Interaction Type based on Association Pattern with Adjusted Support – (pg 73-79) Huang-Cheng Kuo, Ming-Yi Tai

Digitalization Boosting Novel Digital Services for Consumers – (pg 80-92) Kaisa Vehmas, Mari Ervasti, Maarit Tihinen, Aino Mensonen

GIS-based Optimal Route Selection for Oil and Gas Pipelines in Uganda – (pg 93-104) Dan Abudu, Meredith Williams

Hybrid Trust-Driven Recommendation System for E-commerce Networks – (pg 105-112) Pavan Kumar K. N, Samhita S Balekai, Sanjana P Suryavamshi, Sneha Sriram, R. Bhakthavathsalam

Correlated Appraisal of Big Data, Hadoop and MapReduce – (pg 113-118) Priyaneet Bhatia, Siddarth Gupta

Combination of PSO Algorithm and Naive Bayesian Classification for Parkinson Disease Diagnosis – (pg 119-125) Navid Khozein Ghanad, Saheb Ahmadi

Automatic Classification for Vietnamese News – (pg 126-132) Phan Thi Ha, Nguyen Quynh Chi

Practical applications of spiking neural network in information processing and learning – (pg 133-137) Fariborz Khademian, Reza Khanbabaie

ACSIJ Advances in Computer Science: an International Journal, Vol. 4, Issue 4, No.16 , July 2015 ISSN : 2322-5157 www.ACSIJ.org

A Survey on the Privacy Preserving Algorithm and techniques of Association Rule Mining

Maryam Fouladfar1, Mohammad Naderi Dehkordi2

Faculty of Computer Engineering, Najafabad Branch, Islamic Azad University, Najafabad, Isfahan, Iran

[email protected], [email protected]

Abstract in order to support a variety of domains marketing, In recent years, data mining is a popular analysis tool to extract weather forecasting, medical diagnosis, and national knowledge from collection of large amount of data. One of the security. But it is still a challenge to mine certain kinds great challenges of data mining is finding hidden patterns of data without violating the data owners ’privacy .For without revealing sensitive information. Privacy preservation data mining (PPDM) is answer to such challenges. It is a major example, how to mine patients private data is an ongoing research area for protecting sensitive data or knowledge while problem in health care applications .As data mining data mining techniques can still be applied efficiently. become more pervasive, privacy concerns are increasing. Association rule hiding is one of the techniques of PPDM Commercial concerns are also concerned with the to protect the association rules generated by association rule privacy issue. Most organizations collect information mining. In this paper, we provide a survey of association about individuals for their own specific needs. Very rule hiding methods for privacy preservation. Various frequently, however, different units within an algorithms have been designed for it in recent years. In this organization themselves may find it necessary to share paper, we summarize them and survey current existing information. In such cases, each organization or unit techniques for association rule hiding. must be sure that the privacy of the individual is not

violated or that sensitive business information is not Keywords: Association Rule Hiding, Data Mining, revealed .Consider, for example, a government, or more Privacy Preservation Data Mining. appropriately, one of its security branches interested in 1. Motivation developing a system for determining, from passengers whose baggage has been checked, those who must be computers have promised us a fountain of wisdom but subjected to additional security measures. The data delivered a deluge of information. this huge amount of indicating the necessity for further examination derives data makes it crucial to develop tools to discover what is from a wide variety of sources such as police records; called hidden knowledge. these tools are called data airports; banks; general government statistics; and mining tools. so, data mining promises to discover what passenger information records that generally include is hidden, but what if that hidden knowledge is sensitive personal information; demographic data; flight and owners would not be happy if this knowledge were information; and expenditure data. In most countries, this exposed to the public or to adversaries? this problem information is regarded as private and to avoid motivates for write this paper. intentionally or unintentionally exposing confidential information about an individual, it is against the law to 2. Introduction make such information freely available. While various means of preserving individual information have been The problem of privacy preserving data mining has developed, there are ways for circumventing these become more important in recent years because of the methods. In our example, in order to preserve privacy, increasing ability to store personal data about users and passenger information records can be de-identified the increasing sophistication of data mining algorithm to before the records are shared with anyone who is not leverage this information. A number of data mining permitted directly to access the relevant data. This can be techniques have been suggested in recent years in order accomplished by deleting from the dataset unique to perform privacy preserving Data mining techniques identity fields. However, even if this information is have been developed successfully to extracts knowledge

deleted, there are still other kinds of information, discovering all itemsets and computation time. personal or behavioral that, when linked with other Generally, only those item sets that fulfill a certain available datasets, could potentially identify subjects. To support requirement are taken into consideration. avoid these types of violations, we need various data Support and confidence are the two most important mining algorithm for privacy preserving. We review quality measures for evaluating the interestingness of a recent work on these topics. In this paper, it has been rule. The support of the rule X →Y is the tried to focus on data -mining background in advance, percentage of transactions in T that contain X ∩Y . while the important part of the paper has been focusing It determines how frequent the rule is applicable to the on introduction of different approaches of data-mining transaction set T . The support of a rule is represented by and algorithms of data mining privacy preserving for the formula (1): sanitizing sensitive knowledge in context of mining | | association rules or item sets with brief descriptions. It ( ) has been tried to concentrate on different classifications | | (1) of data mining privacy preserving approaches. where | X∩Y| is the number of transactions that contain 3. Privacy Preserving Data Mining Concepts all the items of the rule and n is the total number of transactions. The confidence of a rule describes the Today as the usage of data mining technology has percentage of transactions containing X which also contain Y . It is given by (2): been increasing, the importance of securing information against disclosure of unauthorized access is | | ( ) one of the most important issues in securing of privacy | | (2) of data mining [1]. The state or condition of being isolated from the view or presence of others is privacy Confidence is a very important measure to determine [2] which is associated with data mining so that we are whether a rule is interesting or not. The process of ab le to conceal sensitive information from revelation to mining association rules consists of twomain steps. The public [1]. Therefore to protect the sensitive rule from first step is, identifying all the itemsets contained in the unauthorized publishing, privacy preserving data data that are adequate for mining association rules. These mining (PPDM) has focused on data mining and combinations have to show at least a certain frequency database security field [3]. and are thus called frequent itemsets. The second step generates rules out of the discovered frequent itemsets. 3.1 Association Rule Mining Strategy All rules that has confidence greater than minimum Association rules are an important class of regularities confidence are regarded as interesting. within data which have been extensively studied by the 3.2 Side Effects data mining community. The problem of mining association rules can be stated as follows: Given I = {i1 , As it is presented in (Fig. 1), R is denoted as all i2 , ... , im } is a set of items, T = {t1, t2 , ... , tn} is a set of association rules in the database D, as well as SR for transactions, each of which contains items of the itemset the sensitive rules, the none sensitive rules ~SR, discovered rules R’ in sanitized database D’. The circles I . Each transaction ti is a set of items such that ti ⊆I . An with the numbers of 1, 2, and 3 are possible problems association rule is an implication of the form: X →Y, that respectively represent the sensitive association rules where X ⊂I , Y ⊂I and X ∩Y = Ø. X (or Y ) is a set of that were failed to be censored, the legitimate rules items, called itemset. In therule X→Y, X is called the accidentally missed, and the artificial association rules antecedent, Y is the consequent. It is obvious that the created by the sanitization process. value of the antecedent implies the value of the consequent. The antecedent, also called the “left handside” of a rule, can consist either of a single item or of a whole set of items. This applies for the consequent, also called the “right hand side”, as well. Often, a compromise has to be made between

artifactual patterns created by the adopted privacy preserving technique. For example, in [4], Oliveira and Zaiane define two metrics misses cost and artifactual pattern which are corresponding to lost information and artifactual information respectively. In particular, misses cost measures the percentage of nonrestrictive patterns that are hidden after the sanitization process. This happens when some non-restrictive patterns lose support in the database due to the sanitization process. The misses cost (MC) is computed as (4):

( ) ( ) (4) Fig. 1 Side Effects ( )

The percentage of sensitive information that is still where # ∼ RP (D) and # ∼ RP(D′) denote the number of discovered, after the data has been sanitized, gives an estimate of the hiding failure parameter. Most of the non-restrictive patterns discovered from the original developed privacy preserving algorithms are designed database D and the sanitized database D′ respectively. In with the goal of obtaining zero hiding failure. Thus, they the best case, MC should be 0%. Notice that there is a hide all the patterns considered sensitive. However, it is compromise between the misses cost and the hiding well known that the more sensitive information we hide, failure in their approach. The more restrictive patterns the more non-sensitive information we miss. Thus, some they hide, the more legitimate patterns they miss. The PPDM algorithms have been recently developed which other metric, artifactual pattern (AP), is measured in allow one to choose the amount of sensitive data that should be hidden in order to find a balance between terms of the percentage of the discovered patterns that privacy and knowledge discovery. For example, in [4], are artifacts. The formula is (5): Oliveira and Zaiane define the hiding failure (HF) as the percentage of restrictive patterns that are discovered | || | (5) from the sanitized database. It is measured as (3): | |

( ) (3) where |X | denotes the cardinality of X . According to

( ) their experiments, their approach does not have any artifactual patterns, i.e., AP is always 0. In case of where #RP (D) and #RP(D′) denote the number of association rules, the lost information can be modeled as restrictive patterns discovered from the original data base the set of non-sensitive rules that are accidentally hidden, D and the sanitized database D′ respectively. Ideally, HF referred to as lost rules, by the privacy preservation should be 0. In their framework, they give a specification technique, the artifactual information, instead, represents of a disclosure threshold φ , representing the percentage the set of new rules, also known as ghost rules, that can of sensitive transactions that are not sanitized, which be extracted from the database after the application of a allows one to find a balance between the hiding failure sanitization technique. and the number of misses. Note that φ does not control

the hiding failure directly, but indirectly by controlling 4. Different Approaches Sin PPDM the proportion of sensitive transactions to be sanitized for each restrictive pattern. Many approaches have been proposed in PPDM in order to censor sensitive knowledge or sensitive When quantifying information loss in the context of the association rules [5,6]. Two classifications in existing other data usages, it is useful to distinguish between: lost sanitizing algorithm of PPDM shown in (fig. 2). information representing the percentage of non-sensitive patterns (i.e., association, classification rules) which are hidden as side-effect of the hiding process; and the artifactual information representing the percentage of

M. Atallah et al [13], tried to deal with the problem of

Item Restriction- limiting disclosure of sensitive rules. They attempt to Based selectively hide some frequent item sets from large Item Addition- Data-Sharing databases with as little as possible impact on other, non- Based sensitive frequent item sets. They tried to hide sensitive Item Obfuscation- Sanitizing Algorithm Based rules by modifying given database so that the support of

Rule Restriction- a given set of sensitive rules, mined from the database, Pattern-Sharing Based

decreases below the minimum support value. Data distortion Heuristic Based techniques N. Radadiya [14] proposed an algorithm called techniques Data blocking PPDM ADSRRC which tried to improve DSRRC algorithm. Border Based techniques techniques DSRRC could not hide association rules with multiple items in the antecedent (L.H.S) and consequent (R.H.S.), Sanitizing Exact techniques techniques so it uses a count of items in consequence of the sensible Reconstruction rules and also modifies the minimum number of Based techniques transactions to hide maximum sensitive rules and Cryptography Based maintain data quality. techniques Y. Guo [15] proposed a framework with three phases: mining frequent set, performing sanitation algorithm Fig. 2 Classification of Approaches over frequent item sets, and generate released database Sanitizing Alghorithm by using FP-tree-based inverse frequent set mining. Border-based: In this approach by the concepts of data-sharing: In data-sharing technique, without borders, the algorithm tries to preprocess the sensitive analyzing or any statistical techniques, data will be rules, so the minimum number of them will be censored. communicated between parties. In this approach, the Afterward, Database quality will maintain as well while algorithms suppose change database by producing side effects will be minimized [14,9]. One of the distorted data in the data base [6,7,8]. approaches used are as follows. pattern-sharing: In pattern-sharing technique, the Y. Jain et al [16] proposed two algorithms called ISL algorithm tries to sanitize the rules which are mined from (Increase Support of Left hand side) and DSR (Decrease the data set [6,8,9]. Support of Right hand side) to hide useful association Sanitizing techniques rule from transaction data. In ISL method, confidence of a rule is decreased by increasing the support value of Heuristic-Based: Heuristic-based techniques resolves Left Hand Side (L.H.S.) of the rule, so the items from how to select the appropriate data sets for data L.H.S. of a rule are chosen for modification. In DSR modification. Since the optimal selective data method, confi dence of a rule is decreased by decreasing modification or sanitization is an NP-Hard problem, the support value of Right Hand Side (R.H.S.) of a rule, heuristics is used to address the complexity issues. The so items from R.H.S. of a rule are chosen for methods of Heuristic based modification include modification. Their algorithm prunes number of hidden perturbation, which is accomplished by the alteration of rules with the same number of transactions scanned, less an attribute value by a new value (i.e., changing a 1- CPU time and modification. value to a0- value, or adding noise), and blocking, which Exact: In this approach it tries to formulate the hiding is the replacement of an existing attribute value with a problem to a constraint satisfactory problem (CSP). The “?” [10,11,12]. Some of the approaches used are as solution of CSP will provide the minimum number of follows. transactions that have to be sanitized from the original database. Then solve it by helping binary integer

programming (BIP), such as ILOG CPLEX, GNU GLPK supports and support counts from original database or XPRESS-MP [14, 9]. Although this approach D. The second phase runs sanitization algorithm over presents a better solution among other approaches, high time complexity to CSP is a major problem. Gkoulalas frequent itemset FS and get the sanitized frequent and Verykios proposed an approach in finding an itemsets of FS’. The third phase is to generate released optimal solution for rule hiding problems [17]. database D’ from FS’ by using inverse frequent set mining algorithm. But this algorithm is very complex as Reconstruction-Based: A number of recently proposed it involves generation of modified dataset from frequent techniques address the issue of privacy preservation by set. perturbing the data and reconstructing the distributions at an aggregate level in order to Cryptography-Based: In many cases, multiple parties perform the association rules mining. That is, these may wish to share aggregate private data, without algorithms are implemented by perturbing the data first leaking any sensitive information at their end. This and then reconstructing the distributions. According to requires secure and cryptographic protocols for sharing different methods of reconstructing the distributions and the information across the different parties[24,25,26,27]. data types, the corresponding algorithm is not the same. one of the approache used are as follows. Some of the approaches used are as follows. The paper proposed by Assaf Schuster et al.[28] presents a cryptographic privacy-preserving association Agrawal et al. [18] used Bayesian algorithm for rule mining algorithm in which all of the cryptographic distribution reconstruction in numerical data. Then, primitives involve only pairs of participants. The Agrawal et al.[19] proposed a uniform randomization advantage of this algorithm isits scalability and the approach on reconstruction-based association rule to disadvantage is that, a rule cannot be found correct deal with categorical data. Before sending a transaction before the algorithm gathers information from k to the server, the client takes each item and with resources. Thus, candidate generation occurs more probability p replaces it by a new item not slowly, and hence the delay in the convergence of the originally present in this transaction. This process is recall. The amount of manager consultation messages is called uniform randomization. It generalizes Warner’s also high. “randomized response” method. The authors of [20] improved the work over the Bayesian-based 5. Conclusion reconstruction procedure by using an EM algorithm for distributionreconstruction. We present a classification and an extended description and clustering of various algorithms of association rule Chen et. al. [21] first proposed a Constraint-based mining. The work presents in here, which indicates the Inverse Itemset Lattice Mining procedure (CIILM) for ever increasing interest of researchers in the area of hiding sensitive frequent itemsets. Their data securing sensitive data and knowledge from malicious reconstruction is based on itemset lattice. Another users. At present, privacy preserving is at the stage of emerging privacy preserving data sharing method development. Many privacy preserving algorithms of related with inverse frequent itemset mining is association rule mining are proposed, however, privacy inferring original data from the given frequent preserving technology needs to be further researched itemsets. This idea was first proposed by Mielikainen because of the complexity of the privacy problem. [22]. He showed finding a dataset compatible with a given collection of frequent itemsets is NPcomplete. References

A FP-tree based method is presented in [23] for inverse [1] S.R.M. Oliveira, O.R. Zaıane, Y. Saygin, “Secure frequent set mining which is based on reconstruction association rule sharing, advances in knowledge discovery and data mining, in: Proceedings of the 8th technique. The whole approach is divided into three Pacific-Asia Conference (PAKDD2004), Sydney, phases: The first phase uses frequent itemset mining Australia, 2004, pp.74–85. algorithm to generate all frequent itemsets with their [2] Elena Dasseni, Vassilios S. Verykios, Ahmed K.Elmagarmid, and Elisa Bertino, “Hiding Association

Rules by using Confidence and Support,” In Proceedings 15th ACM Int. Conf. Inf. Knowl. Manag. ACM Press, New of the 4th Information Hiding Workshop (2001), pp.369– York, New York, USA, pp 748–757 383. [18] Chris Clifton, Murat Kantarcioglou, XiadongLin and [3] Verykios, V.S., Elmagarmid, A., Bertino, E., Saygin, Michaed Y.Zhu, “Tools for privacy preserving distributed Y., and Dasseni, E. Association rule hiding. IEEE data mining,” SIGKDD Explorations 4, no. 2, 2002. Transactions on Knowledge and Data Engineering, 2004, 16(4):434-447 [19] Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke. Privacy Preserving Mining of [4] Oliveira, S.R.M., Zaiane, O.R.: Privacy preserving frequent Association Rules. SIGKDD 2002, Edmonton, Alberta itemset mining. In: IEEE icdm Workshop on Privacy, Canada. Security and Data Mining, vol. 14, pp. 43–54 (2002). [20] D. Agrawal and C. C. Aggarwal, "On the design and [5] Oliveira SRM, Zaiane OR (2006) A unified framework for quantification of privacy preserving data mining protecting sensitive association rules in business algorithms", In Proceedings of the 20th Symposium on collaboration. Int J Bus Intell Data Min 1:247–287. Principles of Database Systems, Santa Barbara, California, USA, May, 2001. [6] HajYasien A (2007) Preserving privacy in association rule mining. Ph. D Thesis, University of Griffith. [21] Chen, X., Orlowska, M., and Li, X., "A new framework for privacy preserving data sharing.", In: [7] Oliveira SRM, Za OR, Zaiane OR, Saygin Y (2004) Secure Proc. of the 4th IEEE ICDM Workshop: Privacy and association rule sharing. Adv. Knowl. Discov. Data Min. Security Aspects of Data Mining. IEEE Computer Springer, pp 74–85. Society, 2004. 47-56.

[8] Verykios VS, Gkoulalas-Divanis A (2008) Chapter 11 A [22] Mielikainen, T. "On inverse frequent set mining". In: Survey of Association Rule Hiding Methods for Privacy. Proc. of the 3rd IEEE ICDM Workshop on Privacy Privacy-Preserving Data Min 267–289. Preserving Data Mining. IEEE Computer Society, 2003. 18-23. [9] Gkoulalas-Divanis A, Verykios VS (2010) Association rule hiding for data mining. Springer. [23] ZongBo Shang; Hamerlinck, J.D., “Secure Logistic Regression of Horizontally and Vertically Partitioned [10]Oliveira SRM, Zaiane OR (2002) Privacy preserving Distributed Databases,” Data Mining Workshops, ICDM frequent itemset mining. Proc. IEEE Int. Conf. Privacy, Workshops 2007. Seventh IEEE International Conference Secur. data mining-Volume 14. pp 43–54 on 28-31 Oct. 2007, pp.723–728.

[11] Verykios VS, Pontikakis ED, Theodoridis Y, Chang L [24] DuW., AtallahM.: SecureMulti-party Computation: A (2007) Efficient algorithms for distortion and blocking Review and Open Problems.CERIAS Tech. Report techniques in association rule hiding. Distrib Parallel 2001-51, Purdue University, 2001. Databases 22:85–104. doi: 10.1007/s10619-007-7013-0 [25] Ioannidis, I.; Grama, A, Atallah, M., “A secure protocol [12] Saygin Y, Verykios VS, Clifton C, Saygm Y (2001) Using for computing dot-products in clustered and distributed unknowns to prevent discovery of association rules. ACM environments,” Proceedings of International Conference SIGMOD Rec 30:45–54. on Parallel Processing, 18-21 Aug. 2002, pp.379–384.

[13] Atallah M, Bertino E, Elmagarmid a., et al. (1999) [26] A. Sanil, A. Karr, X. Lin, and J. Reiter, “Privacy Disclosure limitation of sensitive rules. Proc. 1999 Work. preserving analysis of vertically partitioned data using Knowl. Data Eng. Exch. (Cat. No.PR00453) secure matrix products,” Journal of Official Statistics, 2007. [14] Radadiya NR, Prajapati NB, Shah KH (2013) Privacy Preserving in Association Rule mining. 2:208–213. [27] M. Kantarcioglu, C. Clifton, “Privacy-preserving distributed mining of association rules on horizontally [15] Guo Y (2007) Reconstruction-based association rule partitioned data,” The ACM SIGMOD Workshop on hiding. Proc. SIGMOD2007 Ph. D. Work. Innov. Database Research Issues on Data Mining and Knowledge Res. pp 51–56 Discovery (DMKD’02). ACM SIGMOD’2002 [C]. Madison, Wisconsin, 2002, pp.24–31. [16] Jain YK, Yadav VK, Panday GS (2011) An Efficient Association Rule Hiding Algorithm for Privacy Preserving [28]Assaf Schuster, Ran Wolff, Bobi Gilburd," Privacy- Data Mining. Int J Comput Sci Eng 3:2792–2798 Preserving Association Rule Mining in LargeScale Distributed Systems", fourth IEEE symposium on Cluster [17] Gkoulalas-Divanis A, Verykios VS (2006) An integer Computing and Grid, 2004. programming approach for frequent itemset hiding. Proc.

« ACASYA »: a knowledge-based system for aid in the storage, classification, assessment and generation of accident scenarios. Application to the safety of rail transport systems

Dr. Habib HADJ-MABROUK1, Dr. Hinda MEJRI2

1Ability to supervise research French Institute of Science and Technology for Transport, Land and networks [email protected]

2Assistant Professor Higher Institute of Transport and Logistics University of Sousse, Tunisia [email protected]

Abstract interest appears thus by the creation of Community Various researches in artificial intelligence are conducted to institutions to the image of the European railway agency understand the transfer of expertise problem. Today we perceive (ERA), with which France will have to collaborate; but two major independent research activities: the acquisition of also by the installation of safety checking and evaluation knowledge which aims to define methods inspired specially from tool like the statistic statement of rail transport or the software engineering and cognitive psychology to better understand the transfer of expertise, and the automatic learning safety common goals and methods. These measurements proposing the implementation of inductive, deductive, abductive will be essential on France as it was the case for the techniques or by analogy to equip the system of learning abilities. introduction of the railway infrastructure’s manager and The development of a knowledge-based support system like the case for the national authorities of safety (NAS). “ACASYA” for the analysis of the safety guided transport Parallel to this European dash, one also notes an systems insisted us to use jointly and complementary both awakening in France since the decree 2000-286 of the approaches. 30/03/00 relative to the railway security, which replaces The purpose of this tool is to first, to evaluate the completeness the decree of the 22/03/42 which constituted hitherto, the and consistency of accidents scenarios and secondly, to only legal reference on the matter. contribute to the generation of new scenarios that could help experts to conclude on the safe character of a new system. France also sets up new mechanisms, contained in laws “ACASYA” consists of three learning modules: CLASCA, and regulations in order to improve the security level. We EVALSCA and GENESCA dedicated respectively to the note the introduction of organisms or independent classification, evaluation and generation accident scenarios. technical services (ITS) in charge of certification, technical organization of investigation or even the decree related to Key-words: Transport system, Safety, Accident scenario, the physical and professional ability conditions of staff. Acquisition, Assessment, Artificial intelligence, Expert system, Concerning the aptitude of staff, it is necessary to stress Machine learning. that the next challenge to take up for Europe passes by the necessary harmonization of the work conditions which is at the same time a requirement for the safety and 1. Regulatory context of research interworking.

the safety railway formerly within the competence of the This study thus, shows that the safety from a theoretical only Member States and occulted a long time by the and legal perspective undergoes and will undergo many European Union, gradually will become a nearly exclusive changes. We notice in particular the presence of a field of the Community policy, this in particular by the multiplicity of actors who support and share the means of the project of interworking. The European responsibility for the railway safety in France and Europe.

That they are public or are deprived, they have all of the intellectual tasks and has the ambition to giving computers obligations to respect and partly subjected to the some of the human mind functions: learning, recognition, independent organisms control. reasoning or linguistic expression. Our research has involved three specific aspects of artificial intelligence: knowledge acquisition, machine learning and knowledge 2. Introduction based systems (KBS).

As part of its missions of expertise and technical assistance, A development of the knowledge base in a KBS requires IFSTTAR evaluates the files of safety of guided the use of techniques and methods of knowledge transportation systems. These files include several acquisition in order to collect structure and formalize hierarchical analysis of safety such as the preliminary knowledge. It has not been possible with knowledge analysis of risks (PAR), the functional safety analysis acquisition to extract effectively some types of expert (FSA), the analysis of failure modes, their effects and of knowledge to analysis and evaluate safety. Therefore, the their criticality (AFMEC) or analysis of the impact of the use of knowledge acquisition in combination with machine software errors [2] and [3]. These analyses are carried out learning appears to be a very promising solution. The by the manufacturers. It is advisable to examine these approach which was adopted in order to design and analyses with the greatest care, so much the quality of implement the tool “ACASYA” involved the following those conditions, in fine, the safety of the users of the two main activities: transport systems. Independently of the manufacturer, the  Extracting, formalizing and storing hazardous experts of IFSTTAR carry out complementary analyses of situations to produce a library of standard cases safety. They are brought to imagine new scenarios of which covers the entire problem. This is called a potential accidents to perfect the exhaustiveness of the historical scenario knowledge base. This process safety studies. In this process, one of the difficulties then entailed the use of knowledge acquisition techniques, consists in finding the abnormal scenarios being able to  Exploiting the stored historical knowledge in order to lead to a particular potential accident. It is the fundamental develop safety analysis know-how which can assist point which justified this work. experts to judge the thoroughness of the manufacturer’s suggested safety analysis. This second The ACASYA tool [4], which is the subject of this paper, activity involves the use of machine learning provides assistance in particular during the phase in which techniques. the completeness of functional safety analysis (FSA) is If cognitive psychology and software engineering evaluated. Generally, the aim of FSA is to ensure that all generated support methods and tools for the knowledge safety measures have been considered in order to cover the acquisition, the exploitation of these methods remains still hazards identified in the preliminary hazard analyses and limited, in a complex industrial context. We estimate that, therefore, to ensure that all safety measures are taken into located downstream, machine learning can advantageously account to cover potential accidents. These analyses contribute to complete and strengthen the conventional provide safety criteria for system design and means of knowledge acquisition. implementation of hardware and software safety. They also, impose a safety criteria related to sizing, exploitation and The application of knowledge acquisition means, described maintenance of the system. They can bring out adverse in addition in [5], led primarily on the development of a security scenarios that require taking the specification. generic model of accident scenarios representation and on the establishment of a historical knowledge base of the 3. Approach used to develop the scenarios that includes about sixty scenarios for the risk of collision. “ACASYA” system The acquisition of knowledge is however faced the difficulty to extract the expertise evoked in each step of the The modes of reasoning which are used in the context of safety evaluation process. This difficulty emanates from safety analysis (inductive, deductive, analogical, etc.) and the complexity of the expertise which encourages the the very nature of safety knowledge (incomplete, evolving, experts naturally, to decline their know-how through empirical, qualitative, etc.) mean that a conventional significant examples or accident scenarios lived on computing solution is unsuitable and the utilization of automated transport systems already certified or approved. artificial intelligence techniques would seem to be more Consequently, the update of expertise must be done from appropriate. The aim of artificial intelligence is to study examples. Machine learning [[6] and [7]] makes it possible and simulate human intellectual activities. It attempts to to facilitate the transfer of knowledge, in particular from create machines which are capable of performing experimental examples. It contributes to the development

of KBS knowledge bases while reducing the intervention considered by the manufacturer. These situations provide a of the knowledge engineer. stimulus to the expert in formulating new accident scenarios. Indeed, the experts generally consider that it is simpler to describe experimental examples or cases rather than to 4.1. Functional organization of the “ACASYA” clarify processes of decision making. The introduction of system the automatic learning systems operating on examples allows generating new knowledge that can help the expert As is shown in figure 1, this organization consists of four to solve a particular problem. The expertise of a field is not main modules. The first formalization module deals with only held by the experts but also, implicitly, distributed the acquisition and representation of a scenario and is part and stored in a mass of historical data that the human mind of the knowledge acquisition phase. The three other finds it difficult to synthesize. To extract from this mass of modules, CLASCA, EVALSCA and GENESCA, under the information a relevant knowledge for an explanatory or previously general principle, cover the problems of decisional aim, constitutes one of the automatic learning classification, evaluation and generation. objectives.

CLASCA The learning from examples is however insufficient to Scenario classification acquire all the know-how of experts and requires Static application of the knowledge acquisition to identify the description Class Ck problem to solve, extract and formalize accessible knowledge by the usual means of acquisition. In this EVALSCA Historical scenario Formalization Scenario Evaluation direction, each of the two approaches can fill the knowledge base

New scenario New Summarized failures weaknesses of the other. To improve the transfer process likely to induce a system fault expertise, it is thus interesting to reconcile these two Dynamic approaches. description GENESCA Our approach is to exploit by learning, the base of Scenario Generation scenarios examples, in order to produce knowledge that Generated scenarios can help the experts in their mission of a system safety Validation of Validated scenarios evaluation. generated scenarios

4. The “ACASYA” system of aid to safety analysis Fig. 1: Functional organization of the ACASYA system [1]

The ACASYA system [[1] and [4]] is based on the 4.2. Functional architecture of the “CLASCA” combined utilization of knowledge acquisition techniques system mock-up and machine learning. This tool has two main characteristics. The first is the consideration of the CLASCA [8] is a learning system which uses examples in incremental aspect which is essential to achieve a gradual order to find classification procedures. It is inductive, improvement of knowledge learned by the system. The incremental and dedicated to the classification of accident second characteristic is the man/machine co-operation scenarios. In CLASCA, the learning process is which allows experts to correct and supplement the initial nonmonotonic, so that it is able to deal with incomplete knowledge produced by the system. Unlike the majority of accident scenario data, and on other hand, interactive decision making aid systems which are intended for a non- (supervised) so that the knowledge which is produced by expert user, this tool is designed to co-operate with experts the system can be checked and in order to assist the expert in order to assist them in their decision making. The in formulating his expertise. CLASCA incrementally ACASYA organization is such that it reproduces as much develops disjunctives descriptions of historical scenarios as possible the strategy which is adopted by experts. classes with a dual purpose of characterizing a set of Summarized briefly, safety analysis involves an initial unsafe situations and recognizing and identifying a new recognition phase during which the scenario in question is scenario which is submitted to the experts for evaluation. assimilated to a family of scenarios which is known to the CLASCA contains five main modules (figure 2): expert. This phase requires a definition of scenarios classes. 1. A scenario input module ; In a second phase, the expert evaluates the scenario in an 2. A predesign module which is used to assign values to attempt to evolve unsafe situations which have not been the parameters and learning constraints which are

required by the system. These parameters mainly combination of a set of elementary failures having the affect the relevance and quality of the learned same effect on the system behavior. This evaluation knowledge and the convergence speed of the system; approach allows to attract the attention of the expert on 3. An induction module for learning descriptions of eventual failures not taken into account during the design scenario classes ; phase and can cause danger to the safety of the 4. A classification module, that aims to deduct the transportation system. In this sense, it can promote the membership of a new scenario from the descriptions generation of new accident scenarios. classes induced previously and by referring to adequacy rate; The second level of processing considers the class deduced 5. A dialogue module for the reasoning of the system by CALASCA in order to evaluate the scenario and the decision of experts. In justification the system consistency. The evaluation approach is centered on the keeps track from the deduction phase in order to summarized failures which are involved in the new construct its explanation. Following this rationale scenario to evaluate. The evaluation of this scenario type phase of classification decisions, the expert decides involves the two modules below [4] (figure 3): either to accept the proposed classification (in which  A mechanism for learning CHARADE’s rules [9] case CLASCA will learn the scenario) or to reject this which makes it possible to deduce sf recognition classification. In the second case it is the expert who functions and so to generate a basic evaluation rules ; decides what subsequent action should be taken. He  An inference engine which exploits the above base of may, for example, modify the learning parameters, rules in order to deduce which sfs are to be create a new class, edit the description of the scenario considered in the new scenario to assess. or put the scenario on one side for later inspection. These two steps are detailed below-after:

Validation and expert Base of historical decision scenarios

Enrichment A classification Parameters of the base adjustment

Classification Learning Classification parameters parameters Learning (deduction) Predesign (induction)

Acceptability conditions for a scenario New scenario for classification Historical scenario

Scenarios input

Current knowledge learnt (descriptions of scenario classes)

Accident scenario

Fig. 2: Architecture of the CLASCA system mock-up

Fig. 3: Architecture of the EVALSCA system mock-up [3]

4.3. Functional architecture of the “EVALSCA” 4.3.1. Learning from failures summarized system mock-up recognition functions

The objective of the module EVALSCA [[1] and [4]] is to This phase of learning attempts, using the base of confront the list of the summarized failures (sf) proposed examples which was formed previously, to generate a in the scenario to evaluate with the list of archived system of rules reflecting the functions of recognition historical summarized failures, in order to stimulate the summarized failures. The purpose of this stage is to formulation of unsafe situations not considered by the generate a recognition function for each sf associated with manufacturer. A sf is a generic failure, resulting from the a given class. The sf recognition function is a production

rule which establishes a link between a set of facts 4.3.2. . Deduction of the summarized failures which (parameters which describe a scenario or descriptors) and are to be considered in the scenario to evaluate the sf fact. There is a logic dependency relationship, which can be expressed in the following form: During the previous step, the CHARADE module created a If Principe of cantonment (PC) system of rules from the current basis of learning examples and and which is relative to the class Ck offered by the Potential risks or accidents (R) and CALASCA system. The sf deduction stage requires Functions related to the risk (FRR) beforehand, a transfer phase of rules which have been and generated and transferred to an expert system in order to Geographical _ zones (GZ) construct a scenario evaluation knowledge base. This and Actors involved (AI) evaluation contains (figure3): and Incidents _ functions (IF)  The base of rules, which is split into two parts: a Then Summarized failures (SF) current base of rules which contains the rules which CHARADE has generated in relation to a class which A base of evaluation rules can be generated for each class CLASCA has suggested at the instant t and a store base of scenarios. Any generated rule must contain the PR of rules, which composed of the list of historical bases descriptor in its conclusion. It has proved to be inevitable of rules. Once a scenario has been evaluated, a current to use a learning method which allows production rules to base of rules becomes a store base of rules ; be generated from a set of historical examples (or  The base of facts, which contains the parameters which scenarios). The specification of the properties required by describe the manufacturer's scenarios to evaluate and the learning system and analysis of the existing has led us that’s enriched, over interference, from facts or to choose the CHARADE’s mechanism [9]. To generate deducted descriptors. automatically a system of rules, rather than isolated rules,

and its ability to produce rules in order to develop sf This scenario evaluation knowledge base which has been recognition functions make an undeniable interest to described above (base of facts and base of rules) exploited CHARADE. A sample of some rules generated by by forward chaining by an inference engine, generates the CHARADE is given below. These relate to the summarized failures which must be involved in the initialization sequence class. description of the scenario to evaluate.

If Actors involved = operator _ itinerant, Incident _functions = instructions The plausible sfs deduced by the expert system are Elements-involved = operator _in _cc. analyzed and compared to the sfs which have actually been considered by the scenario to evaluate. This confrontation Then Summarized failures = SF11 can generate one or more sfs not taken into account in the (Invisible element on the zone of completely automatic driving) Actors involved = AD _ with _redundancy, design of protective equipment and likely to affect the Functions related to the risk =train localization, safety of the transport system. The above suggestion may Geographical _zones = terminus assist in generating unsafe situations which have not been foreseen by the manufacturer during the specification and If Principle of cantonment = fixed _cantonment [0] Functions related to the risk = initialization design phases of system. Incident _functions = instructions 4.4. Functional architecture of the “GENESCA” Then Summarized failures = SF10 system mock-up (erroneous _re-establishment of safety frequency/high voltage), Functions related to the risk = SF10 (erroneous _re-establishment of safety frequency/high voltage In complement as of two previous levels of treatment permission), which involve the static description of the scenario Functions related to the risk (descriptive parameters), the third level [10] involves in Functions related to the risk = alarm _management, particular the dynamic description of the scenario (the Functions related to the risk = train _localization. [0] model of Petri) like to the three mechanisms of reasoning: the induction, the deduction and the abduction. The aid in the generation of a new scenario is based on the injection of a sf, declared possible by the previous level, in a particular sequencing of Petri network marking evolution.

exploitable scenarios systematically, but only the embryos This approach of generation includes two distinct of scenarios which will stimulate the imagination of the processes: the static generation and the dynamic generation experts in the formulation of accident scenarios. Taking (figure 4). The static approach seeks to derive new static into account the absence of work relative to this field, descriptions of scenarios from evaluating a new scenario. It originality and complexity of problem, this difficulty was exploits by automatic learning the whole of the historical predictable and solutions are under investigation. scenarios in order to give an opinion on the static description of a new scenario. 5. Conclusion If the purpose of the static approach is to reveal static elements which describe the general context in which the The ACASYA system created to assist safety analysis for new scenario proceeds, the dynamic approach is concerned automated terrestrial transit systems satisfies classification, to create a dynamics in this context in order to suggest evaluation and generation objectives of accident scenario. sequences of events that could lead to a potential accident. It demonstrates that machine learning and knowledge The method consists initially, to characterize by learning acquisition techniques are able to complement each other the knowledge implied in dynamic descriptions of in the transfer of knowledge. Unlike diagnostic aid systems, historical scenarios of the same class as the scenario to ACASYA is presented as a tool to aid in the prevention of evaluate and to represent them by a “generic” model. The design defects. When designing a new system, the next step is to animate by simulation this generic model in manufacturer undertakes to comply with the safety order to discover eventual scenarios that could eventually objectives. He must demonstrate that the system is lead to one or more adverse safety situations. designed so that all accidents are covered. At the opposite, the experts of certification aim to show that the system is More precisely, the dynamic approach involves two not safe and, in this case, to identify the causes of principal phases (figure 3): insecurity. Built in this second approach, ACASYA is a tool that evaluates the completeness of the analysis  A modeling phase which must make it possible to proposed by the manufacturer. ACASYA is at the stage of work out a generic model of a class of scenarios. The a model whose first validation demonstrates the interest of Modeling attempts to transform a set of Petri the aid to safety analysis method and which requires some networks into rules written in logic of proposals; improvements and extensions.  A simulation phase which exploits the previous model to generate possible dynamic descriptions of scenarios. References

[1] Hadj-Mabrouk H. "Apport des techniques d'intelligence artificielle à l'analyse de la sécurité des systèmes de transport guidés", Revue Recherche Transports Sécurité, no 40, INRETS, France, 1993.

[2] Hadj-Mabrouk H. "Méthodes et outils d’aide aux analyses de sécurité dans le domaine des transports terrestres guidés", Revue Routes et Transports, Montréal-Québec, vol. 26, no 2, pp 22-32, Été 1996.

[3] Hadj-Mabrouk H. "Capitalisation et évaluation des analyses de sécurité des automatismes des systèmes de transport guidés", Revue Transport Environnement Circulation, Paris, TEC no 134, pp Fig. 4: Approach help to generating embryos accident scenarios 22-29, Janvier-février 1996.

During the development of model GENESCA, we met with [4] Hadj-Mabrouk H. "ACASYA: a learning system for methodological difficulties. The produced model does not functional safety analysis", Revue Recherche make it yet possible to generate new relevant and Transports Sécurité, no 10, France, Septembre 1994, p 9-21.

[5] Angele J., Sure Y. "Evaluation of ontology –based tools workshop", 13th International Conference on Knowledge Engineering and Knowledge management EKAW 2002, Siguenza (Spain), September 30th (pp: 63-73)

[6] Cornuéjols A., Micelet L., Kodratoff Y. " Apprentissage artificiel: Concepts et algorithmes", Eyrolles éd, Août 2002.

[7] Ganascia J.-G "L’intelligence artificielle", Cavalier Bleu Eds, Mai 2007.

[8] Hadj-Mabrouk H. "CLASCA, un système d'apprentissage automatique dédié à la classification des scénarios d'accidents", 9ème colloque international de fiabilité & maintenabilité. La Baule, France, 30 Mai-3 Juin 1994, p 1183 - 1188.

[9] Ganascia J.-G. "AGAPE et CHARADE : deux mécanismes d'apprentissage symbolique appliqués à la construction de bases de connaissances", Thèse d'Etat, Université Paris-sud, mai 1987.

[10] Mejri L. "Une démarche basée sur l’apprentissage automatique pour l’aide à l’évaluation et à la génération de scénarios d’accidents", Application à l’analyse de sécurité des systèmes de transport automatisés. Thèse de doctorat, Université de Valenciennes, 6 décembre 1995, 210 p.

Overview of routing algorithms in WBAN

Maryam Asgari 1, Mehdi Sayemir 2, Mohammad Shahverdy 3

1Computer Engineering Faculty, Islamic Azad University, Tafresh, Iran [email protected]

2Safahan Institute Of Higher Education, Esfahan, Iran [email protected] 3Computer Engineering Faculty, Islamic Azad University, Tafresh, Iran [email protected]

Abstract increased medical costs As research on this subject shows that medical expenses in the year 2022 will The development of wireless computer networks and specialty 20% of America's GDP in which in its own advances in the fabrication of integrated electronic circuits is one of the key elements in making miniature sensors, field is a major problem for the government. As Makes it possible to use the wireless sensor networks for another proof of this claim, we can mention the environmental monitoring in and around the bodies of growth of medical costs in America 85/1 trillion in animals. This precinct of researches is called the wireless 1980 to $ 250 billion in 2004. This is despite the fact research around the body or WBAN and IEEE Institute has that 45 million people in America are without health assigned two standards to this matter being LEEE.802.15.6 insurance. Checking these statics only brings one and IEEE.802.15.4. WBAN aim to facilitate, accelerate and thing to the researchers mind and that is the need of improve the accuracy and reliability of medical care and change in health systems so that the costs of Because of its wide range of challenges, many studies have treatment is lowered and the health care in form of been devoted to this precinct. according to IEEE.802.15.6 , the topology in WBAN is in star form and one step and two Prevention is raised [2, 3, 4, 5, 6 and 7]. step communications are supported but Due to changes in WBAN has come to increase the speed and Accuracy body position and the different states of the human that of health care, provide quality for human life by body takes (for example walking , running , sitting and …) connecting nodes in one or two step mode via sync or PDA providing cost savings. The sensors in WBAN is not always possible . The possibility of using multi-step networks be put inside or on the body. In both ways, communication and in result the existence of multiple ways nodes need to wirelessly communicate with sink and between Source and destination brings up this question that as a result making radiation that can increase the in which way and by which adjoining the transmitter sends temperature of nodes and its surrounding areas in the data to the receiver. So far, many routing algorithms long periods and as the result be harmful for body have been proposed to address this question in this article and bring serious injuries to surrounding tissues [1]. we are going to evaluate them. Broadly speaking, proposing any way to reduce the Keywords: Routing Algorithms, WBAN amount of damage to the tissues, is based on the following two rules: 1. Introduction 1. Reducing the power of sent signals via the According to the latest census and statistical analysis, transmitter of sink the population of the world is increasing and on the other hand, with the development of medical 2. Using multi step communication instead of one technologies and social security, increased life step communication expectancy and therefore, the aging of the population, It is clear that the lower power of sent signals are , is inevitable [1]. The aging of the population, the lower area surrounding node is damaged but with however, causes problems such as the need for lowering the power of sent signals ,communications medical care for the elderly, and thus leads to between transmitter and sink are more likely to

disconnect and in other word , Reliability link will be have more initiative in times of crisis [2, 3]. Statistics reduced. show that more than 30% of the causes of death in developed countries is due to cardiovascular Due to need of keeping the connection active due to problems However, if monitoring technology is used sensitivity of its usage being so important, the need it can greatly reduce the number. For example with of making guarantied links is of high priority. using WBAN you can steadily monitor blood Providing the availability of the reliability of links is pressure, body temperature and heart rate of the needed in high levels. All of these challenges make it patient, which are all vital signs. WBAN sensors can Inevitable to change the one step connections to sync send amount of vital signals via a device connected to and multi-step connections. the internet for example cell phone after they are As mentioned on IEEE 802.15.6 standard , topologies measured. Cell phone can send the data via phone’s of WBAN are in star shape motion , so that internet connection to the doctor or the medical team connection between nodes and sink (hub) is one or and at the end; medical team can decide what is two stepped because human body experiences necessary to do. different motions in limited time (motions like Ways of using WBAN in medical field is parted to running , walking , sitting ,sleeping , ...)there is three sections: always a chance that connection between nodes and sink to be broken and network become partition [8, 9, 1.hideable WBANs : these clothing or in more formal 10]. A solution to solve this problem is that nodes way , wearable equipment , normally can be cameras improve their signal power but as mentioned, this , sensors for checking vital signals ,sensors for solution will result in nodes to have temperature rise communication with a central unit and sensors for and as the result to have tissues surrounding the controlling the surrounding area of the person .for nodes to injure and as the result using multi step example for military use , with equipping soldiers connection is inevitable. [11, 12, 13 and 14] with these clothes they can be tracked ,measured their activity ,tiredness or even check their vital signals Thus, for any reason and in any position to look at and plus that if athletes use this clothes they can the relationship between wireless nodes, replacing a check their medical symptoms online and at will that single communication step with the sink, the tie will will result in lowering the possibility of injuries for be a useful step. The ability of using multi step another example There may be cases in which a communication and as of result being multiple ways patient is allergic to substances or gases Thus, using between source and destination brings up this this type of clothing, the patient may be alerted question that in which way and with which tool the before dangerous disorders happen and will the place transmitter sends its data to the receiver. So far, many routing algorithms have been proposed to address this 2. WBANs placed inside of the body: Statistics show question in this article we are going to evaluate them. that in 2012, 4.6 percent of people in the world, nearly 285 million people suffer from diabetes and it Other parts of this article have been sorted as follows: is expected that in 2030 this figure will reach 438 Second part is devoted to the usage of WBAN in million Research also shows that in the absence of medical field. Third part describes the problems in control of the disease, many problems such as loss of navigation of WBAN and in forth section is devoted vision, will threaten the patient. Using sensors and to analyzing some well-known navigation algorithms functions that are embedded in the body, such as a and comparison and assessment are provided in fifth syringe that when it’s necessary it will insert suitable Section. The conclusion is in sixth section. dosage of insulin to patient’s body can greatly facilitate the process of controlling diabetes. Plus, as 2. Usage of WBAN in medical field mentioned, one of the leading causes of death worldwide is cancer and is predicted that in 2020, Due to the growth of technology, usage of medical more than 15 million people will die from this care services will result in a Massive transformation disease. If the built-in WBAN is used, the ability to in health field. There are expeditions that using monitor the growth of cancer cells is provided thus WBAN will significantly change the systems of Control tumor growth and reduce the death toll is health care and will make doctors able to have more easily accessible speed and accuracy in finding out the illnesses and to

3.control tools and medical equipment’s in long- power usage between different nodes and by doing distance : the ability of WBAN sensors to connect to this , prevent early death. internet ,being able to have network between tools and medical equipment’s and Provides an acceptable 4. Increased longevity: A good routing algorithm controlling of the equipment’s from long distance must be able to transfer data paths selected so that the that is called living with limited assist or AAL , In total time of network activity increases. addition to saving time, costs are greatly reduced. 5. Efficient communication radius: A good routing 3. Routing challenges in WBAN algorithm should consider an efficient communication radius of nodes. The higher So far, numerous routing algorithms for ado networks communication range of a node is the higher the and wireless sensor networks have been presented energy usage will be but if communication range of a WBAN networks are very much maligned to node is very low there is a chance that the mentioned MANET in the position and motion of the nodes, of node will lose communication with other nodes and course, the movement of the nodes in the WBAN are network divide to several pieces. Also If the radius of usually grouped This means that all network nodes communication is very low, usually a number of move with keeping their position toward one another options for routing to the destination are reduced this while in MANET each node moves independently results usage a same way by a node and this result in from other nodes. In addition, energy consumption is temperature rise of the neighbor node and increase of more restrictions on WBAN networks because a node energy usage in node insertion or replacement battery in WBAN, especially 6. Finite number of jumps: as mentioned before, when the node is placed inside the patient's body, is number of jumps in WBAN standard must be one to much harder than replacing a node in a traditional two steps. Use of higher-quality channels can sensor networks because surgery is usually required. increase the reliability of packets but at the same time Hence, it is more important to have more longevity in usually the number of steps are increased, however, WBAN networks also the rate of change of topology despite restrictions on the number of steps in the and speed in WBAN nodes is far greater than sensor IEEE 802.15.6 standard is intended, routing networks. Based on what was said, routing protocols algorithms usually do not pay attention to these designed for MANET and WSN are not useable in limitations. WBAN. Challenges that are raised in the WBAN networks, are summarized below: 7. Usage in Heterogeneous Environments: WBANs usually consists of different sensors with different 1. body movements: moving of nodes because of data transfer rates. Routing algorithms must be able human body position causes serious problems for to provide quality services in a variety of different providing service in WBAN Because the quality of applications. the communication channel between nodes with each other, as a function of time and due to changes in 4. The routing algorithms in WBAN posture of body, is not stable As a result, an appropriate routing algorithm must be able to adapt So far, numerous routing algorithms for a WBAN itself to a variety of changes in the network topology. networks have been provided and each has been trying to resolve basic variety of challenges posed in 2. The temperature and interference: The temperature the previous section. of a node for computing activities or relationships with other nodes, usually increases and this increase 4.1. OFR & DOR routing algorithms in temperature may cause damage to the human body. OFR routing algorithm is the same flooding A good routing algorithm must manage the data algorithm used in other types of networks. As the sending schedule so that a specified node is not name suggests, in this algorithm ,for sending a always chosen as relay node. package between transmitter and receiver , no 3. Reduce energy consumption: a good routing navigation is done but the transmitter sends the copy algorithm must be able to use intermediate nodes as of the package to its neighbors .each neighbor the relay nodes instead of sending the data directly to (basically each node it network) sends the package to a remote destination so that it prorate the overhead its neighbor after receiving it . With this method,

multiple copies of a package arrives at the receiver. not only consider current state of the channel but also Based on this, the receiver saves the first package consider the state of channel in t period of time unit that has less delay and sweep away other packages. before current state. Obviously, the larger the value OFR method is usable in different varieties, for of t is the less impact on the instantaneous channel example, it has high reliability (because it uses all the quality will be. In other word, by determining a value potential of the network) and has a small delay but of t large enough, the channel quality of real-time because of using too much resources, energy usage changes will in no way affect the amount of LLF. and temperature created in it will rise and also it has low throughput. The quality factor of the link between node I and j at

time t is shown with and it always has the amount On the opposite side of OFR method is the DOR between zero (no connection) and one (full method that its function is completely opposite of connection) and after each time cutting , the amount OFR method. Is routing algorithm of DOR sender will update via (1) relation : only sends its data to the receiver when a direct

communication link is established between them and ( ) { if a link is unavailable, transmitter holds its data in the buffer until it establishes the link? In other word, there is no routing in DOR. However, unlike OFR, (1) DOR algorithm uses fewer resources, but because it In each section satisfied between the link nodes I and does not benefit from multi step Communication, it j will increase rate of w.as mentioned , suffers a lot of delay and sometimes it’s determining the amount of w will have great impacts unacceptable, it’s because of this reason that its only of usage of PRPLC algorithm so that the lower the usage is in networks that are sensitive to delay. Plus, amount w is , the amount of speed of in having by increasing the distance between transmitter and connection will reduce but if channel lose

receiver, even the possibility that the sender will not connection , will decrease fast.it is expected that be able to send data to a receiver. the amount of w is in a way that for channel that have had long amount of connection will decrease DOR and OFR algorithms are basically useless but low processing overhead and other benefits and slow and increase fast and vice versa , For channels features, are usually used to compare other that have been cut for a long time and have poor algorithms. quality, and slowly increase and the decrease fast. In other word, the amount of w in each time cutting, 4.2. PRPLC routing algorithm must be updated, number 2 relation show the way of updating w: In this algorithm [15] meters known as a living link factor (LLF) is defined. Each node has a duty to

calculate LLF for its link to sink and other nodes and

give these information to other nodes. This factor ∑ )2(

determines how the quality of the channel between the transmitter and the other nodes is. Method of calculating LLF is that higher values for a link show In this relation is the amount of time that link is more likely to be in next period of time. window, also the amount of in r time cutting is 1 As you know, there is always the possibility that the if the channel between I and j is connected, if not so quality of the channel between two nodes drop due to the amount is 0. When node I wants to send data to changes in the body temporarily and after a few node d and node j is in the neighborhood of node I , if

moments revert back to normal. For example, node I will send its data to node j. in other Assume that the communication between two nodes word , be considering that LLF has a better position one on the wrist and the other one on the chest is fine between node j and destination, node I prefers to send in normal mode but when the person puts his hand its data to destination via node j . behind his back, this channel technically will have disorder. PRPLC algorithm uses time window in 4.3. ETPA Routing algorithm calculating LLF to ignore instantaneous channel In [17] an energy-aware routing algorithm that changes, in other word, while calculating LLF it will considers measured temperature and transmitting

power at the same time has been presented as ETPA. efficiency because in calculating LLD the inertia of This multi-step algorithm uses a cost estimate the moving body is not important. Inertial function for choosing best neighbor. Cost of each measurement sensors can easily collect data on neighbor, is a function of temperature, energy level acceleration and direction of motion of the body. In and signal strength received from the neighbor. In addition, the sensors can lead to sudden changes in this algorithm to reduce interference and eliminating body movement that can detect sudden changes in the the time listening to a channel, the TDMA method is quality of the links. used in other word , Each time frame is divided into N slots so that N is the number of network nodes and Algorithm BAPR [16] in summary is a routing each node to send its own time slice. At the method that combines information from the relay beginning of each period (includes 4 time frames), node selection algorithms that have emerged with each node for example node j sends a hello message inertial motion (such as ETPA & PRPLC). In this to neighbor nodes ,then each node tries to test the algorithm, each node has a routing table. Routing signal power sent from each neighbor node and table contains of records that have 3 parts. The first record in a table. After sending hello messages, each part of the destination node ID, second part the ID of node will be able to calculate the cost of sending via the relay node and the third part shows the each neighbor, and then send the data is sent through connection fee. the meaning of connection fee is a fee the cheapest neighbor of connection between transmitter node and relay node, unlike routing algorithms in MANET , routing Equation (3) shows how to calculate the cost of algorithms in BARR can have several records for one sending from node j to node I destination. In BAPR relay node is chosen via communication fee in this way that nodes with

( ) ( ) ( ) )3( highest fee are in priority for selection .the reason for this kind of choosing is that based of fee calculating

In this equation a is the non-negative factor , is the method in BAPR, link with higher fee has the higher power of signal received in node I , is highest reliability and from there BPR wants to improve the power received , is highest energy in a node chance of sending the package successfully so it (starting energy) and is highest temperature chooses a relay node with higher fee. permitted in a node. Each node chooses lowest costing neighbor while sending and sends the In BAPR, Information relating to motor inertia and package to that node. If a suitable neighbor is not local topology is considered in calculating the cost of found transmitter saves the package in its buffer and connection. The cost function of this algorithm calculates the possibility of sending again in time collects the data of the motor inertia to cover frame. ETPA suggests that the packages for more immediate changes to network topology and network than two time frames remain in the buffer, are topology history to cover long-term changes in discarded. The simulation results show that this topology. That is why in BAPR when topology algorithm has good performance changes are quick, information about the movements of the body are more valuable, otherwise the history 4.4. BAPR routing algorithm of link is more important. In this algorithm it is assumed that the momentum vector of the body ⃗ ⃗⃗⃗ ⃗⃗⃗ ⃗ As we saw, PRPLC algorithm tries to minimize the can be measured via inertial measurement sensors effects of instantaneous channel quality vibrations in estimating function of channel quality. This way of 5. Comparison and Analysis viewing the channel, has a big problem and that is Topology changes that occur due to changes in body In this part of the article we are going to review, position, the will not affect the channel quality analyze and evaluate routing algorithms described in measurement functions in speed. In other word , the previous section. The most important criteria in although occurring things like getting blocked , does evaluating the performance of a routing algorithm in not affect the salary factor of channel in PRPLC WBAN networks are mainly longevity and energy algorithm but accruing an event for a long time will efficiency, reliability, successful delivery rate, packet slowly effect LLF. Considering that in situations in delay. Therefore we will appraisal BAPR, ETPA and terms of walking or running, the body is constantly PRPLC algorithms in terms of the criteria for changing, PRPLC algorithm will basically lose its successful and use OFR and DOR algorithms as

Indicators to measure the performance of these expected, the number of jumps in OFR routing algorithms. algorithm is more than other algorithms while the DOR algorithm has a minimum number of jumps in 5.1. Average rate of successful delivery between other algorithms (just one jump).number of As figure 1 shows, fee of delivering the massage in jumps in BAPR algorithm is in better place than OFR algorithm is higher than every other algorithms PRPLC and ETPR but there is not a high difference and BAPR algorithm is in second place with a small between BAPR and ETPA algorithms. Of course we difference from OFR. As you see, delivery fee price need to mention that the number of jumps in just in BAPR in 30 percent higher than PRPLC algorithm calculated for packages that have been delivered that is a significant improvement successfully thus eliminating packages in PRPLC and ETPA prevents the increase of steps in algorithms.

Fig1. The average rate of successful delivery Figure 3: Average number of jumps 5.2. Average end to end delay 5.4. Other parameters Connection in every algorithm is the same except the DOP algorithm. Since OFR algorithm uses flooding A class of routing algorithms does not pay attention method, delay in this algorithm is a lower bound for to temperatures generated by the nodes, which in routing algorithms. In other word, none of the routing some cases can even cause damage to body tissues of algorithms will have less delay than OFR algorithm. the patient. While the ETPA pays special attention to Based on this, delay in all three algorithms of this issue, as the temperature of the relay nodes, is PRPLC, ETPA and BAPR is acceptable. being placed in fee estimate function. On the other side, BAPR routing algorithm is opposite of named algorithms and need equipment such as measurement sensor and inertial measurement. Although OFR algorithm has an acceptable performance most of the times but because of using network resources to much, is never used. Plus the overhead processing in ETPA and BARP are high compared to PFR and DOR but PRPLC algorithm has a medium overhead processing compared to other algorithms. 6. Conclusions

Due to the growing population and increasing life expectancy, the traditional methods of treatment, will Figure 2: The average end-to-end delay not be efficient because it imposes heavy cost to the economy of a country. With regard that prevention 5.3. the average number of jumps and care, are one of the simplest ways to reduce deaths and medical costs, WBAN networks for The number of jumps in a message, in a sense monitoring patient's vital parameters and injection represents the amount of usage of resources. Thus, as materials needed for patient’s body in specific times

have been released. In the standard created for [8] J. Xing and Y. Zhu, “A survey on body area WBAN that is known with the name of LEEE network,” in 5th International Conference on 802.15.6 suggests star topology and one and multi- Wireless Communications, Networking and Mobile step communications for sending data from nodes to Computing (WiCom ’09), pp. 1 –4, Sept. 2009. sink. However, due to the change in body position [9] S. Wang and J.-T. Park, “Modeling and analysis during the day, one step connection of nodes to the of multi-type failures in wireless body area networks with semi-markov model,” Comm. Letters., vol. 14, sink will not be continuously connected. To solve this pp.6–8,Jan.2010. problem, using a multi-step communication has been [10] K. Y. Yazdandoost and K. Sayrafian-Pour, proposed. Using multi-step communication has “Channel model for body area network (BAN),” always coincided with the concept of Networks,p.91,2009. synchronization, for this reason, much research has [11] M. Shahverdy, M. Behnami & M. Fathy ” A been done on routing algorithms in WBAN. In this New Paradigm for Load Balancing in WMNs” article, we reviewed some of the proposed routing International Journal of Computer Networks (IJCN), algorithms within the WBAN, discussed the strengths Volume(3):Issue(4):2011,239 and weaknesses of them and finally we compared [12] “IEEE p802.15.6/d0 draft standard for body them with each other. area network,” IEEE Draft, 2010. [13] D. Lewis, “IEEE p802.15.6/d0 draft standard References for body area network,” in 15-10-0245-06-0006, May.2010. [1] Milenkovic, C. Otto, and E. Jovanov, “Wireless [14] “IEEE p802.15-10 wireless personal area sensor networks for personal health monitoring: networks”July,2011. Issues and an implementation,” Computer [15] M. Quwaider and S. Biswas, “Probabilistic Communications (Special issue: Wireless Sensor routing in on-body sensor networks with postural Networks: Performance, Reliability, Security, and disconnections,” Proceedings of the 7th ACM Beyond, vol. 29, pp. 2521–2533, 2006. international symposium on Mobility management [2] C. Otto, A. Milenkovic’, C. Sanders, and E. and wireless access (MobiWAC), pp. 149–158, Jovanov, “System architecture of a wireless body 2009. area sensor network for ubiquitous health [16] S. Yang, J. L. Lu, F. Yang, L. Kong, W. Shu, monitoring,” J. Mob. Multimed., vol. 1, pp. 307–326, M, Y. Wu, “Behavior-Aware Probabilistic Routing Jan.2005. For Wireless Body Area Sensor Networks,” In [3] S. Ullah, P. Khan, N. Ullah, S. Saleem, H. Proceedings of IEEE Global Communication Higgins, and K. Kwak, “A review of wireless body Conference (GLOBECOM), Atlanta, Ga, pp. 444- area networks for medical applications,” arXiv 4449,Dec,2013. preprint arXiv:1001.0831, vol. abs/1001.0831, 2010. [17] S. Movassaghi, M. Abolhasan, and J. Lipman, [4] M. Chen, S. Gonzalez, A. Vasilakos, H. Cao, and “Energy efficient thermal and power aware (ETPA) V. Leung, “Body area networks: A survey,” Mobile routing in body area networks,” in 23rd IEEE Networks and Applications, vol. 16, pp. 171–193, International Symposium on Personal Indoor and 2011. Mobile Radio Communications (PIMRC), Sept. [5] K. Kwak, S. Ullah, and N. Ullah, “An overview 2012. of IEEE 802.15.6 standard,” in 3rd International Symposium on Applied Sciences in Biomedical and Communication Technologies (ISABEL), pp. 1 –6, Nov.2010. [6] S. Ullah, H. Higgin, M. A. Siddiqui, and K. S. Kwak, “A study of implanted and wearable body sensor networks,” in Proceedings of the 2nd KES International conference on Agent and multi-agent systems: technologies and applications, (Berlin, Heidelberg), pp. 464–473, Springer-Verlag, 2008. [7] E. Dishman, “Inventing wellness systems for aging in place,” Computer, vol. 37, pp. 34 – 41, May. 2004.

an Efficient Blind Signature Scheme based on Error Correcting Codes

Junyao Ye1, 2, Fang Ren3 , Dong Zheng3 and Kefei Chen4

1 Department of Computer Science and Engineering, Shanghai Jiaotong University, Shanghai 200240, China [email protected]

2 School of Information Engineering, Jingdezhen Ceramic Institute, Jingdezhen 333403,China [email protected]

3 National Engineering Laboratory for Wireless Security, Xi’an University of Posts and Telecommunications Xi’an 710121, China [email protected]

4 School of Science, Hangzhou Normal University, Hangzhou 310000, China [email protected]

The concept of blind signature was first proposed by Abstract Chaum et al.[2] in CRYPTO'82. In a blind signature Cryptography based on the theory of error correcting codes and mechanism, the user can get a valid signature without lattices has received a wide attention in the last years. Shor’s revealing the message or relevant information to the signer. algorithm showed that in a world where quantum computers are What's more, the signer won't be able to connect the assumed to exist, number theoretic cryptosystems are insecure. signature with the corresponding signature process in the Therefore, it is important to design suitable, provably secure future. In 1992, Okamoto proposed a blind signature post-quantum signature schemes. Code-based public key cryptography has the characteristic of resisting the attack from scheme[3] based on schnorr signature[4]. In 2001, Chien post-quantum computers. We propose a blind signature scheme et al.[5] proposed a partial blind signature scheme based based on Niederreiter PKC, the signature is blind to the signer. on RSA public key cryptosystem. In 2007, Zheng Cheng Our scheme has the same security as the Neiderreiter et al.[6] proposed a blind signature scheme based on PKC.Through performance analysis, the blind signature scheme elliptic curve. There are a variety of mature blind signature is correct; also it has the characteristic of blindness, scheme used in electronic cash scheme[7]. Hash unforgeability and non-repudiation. In addition, its efficiency is function[8] can compress the message of arbitray length to higher than the signature scheme based on RSA scheme. In the fixed length. A secure hash function has the characteristic near future, we will focus our research on the group signature of onewayness and collision-resistance, which is widly and threshold ring signature based on error correcting codes. used in digtal signature. Keywords: Code-based PKC, Blind Signature, Unforgeability, Non-repudiation, Error Correcting Codes. There are many blind signature schemes at present, but the development of post-quantum computers has posed a huge threat to them. Code-based public key cryptography can 1. Introduction resist the attack from post-quantum algorithm. Until now, just a literature[9] related to blind signature based on error Digital signature algorithms are among the most useful correcting codes. In this paper[9], the authors proposed a and recurring cryptographic schemes. Cryptography based conversion from signature schemes connected to coding on the theory of error correcting codes and lattices has theory into blind signature schemes, then give formal received a wide attention in the last years. This is not only security reductions to combinatorial problems not because of the interesting mathematical background but as connected to number theory. This is the first blind well because of Shor’s algorithm[1], which showed that in signature scheme which can not be broken by quantum a world where quantum computers are assumed to exist, computers via cryptanalyzing the underlying signature number theoretic cryptosystems are insecure. Therefore, it scheme employing Shor's algorithms[1]. In our paper, we is of utmost importance to ensure that suitable, provably propose a blind signature scheme based on Niederreiter secure post-quantum signature schemes are available for [10] public key cryptosystem. Our scheme realizes the deployment, should quantum computers become a blindness, unforgeability, non-repudiation of the bind technological reality. signature scheme, lastly we analyze the security of our scheme.

The remainder of this paper is organized as follows. the vector subspace , i.e. it holds that, * Section 2 discusses theoretical preliminaries for the + . presentation. Section 3 describes the digital signature, blind signature and RSA blind scheme. Section 4 describes 2.2 SDP and GDP the proposed blind signature based on Niederreiter PKC. Section 5 formally analyses the proposal scheme and A binary linear error-correcting code of length and proves that the scheme is secure and efficient. We dimension , denoted , - -code for short, is a linear

conclude in Section 6. subspace of having dimension . If its minimum distance is , it is called an , --code. An , --code

is specified by either a generator matrix or by 2. Preliminaries ( ) parity-check matrix as

We now recapitulate some essential concepts from coding * | + * | +. theory and security notions for signature schemes. The syndrome decoding problem(SDP), as well as the closely related general decoding problem(GDP), are 2.1 Coding Theory classical in coding theory and known to be NP- complete[11]. The idea is to add redundancy to the message in order to Definition 5(Syndrome decoding problem). Let r,n, and w be able to detect and correct the errors. We use an be integers, and let (H,w,s) be a triple consisting of a

encoding algorithm to add this redundancy and a decoding matrix , an integer w

algorithm to reconstruct the initial message, as is showed Does there exist a vector of weight wt(e) such in Fig1, a message of length is transformed in a message that ? of length with . Definition 6(General decoding problem). Let k,n, and w Noise be integers, and let (G,w,c) be a triple consisting of a

e matrix , an integer , and a vector .

Does there exist a vector such that ( ) c= m r Channel y=c+e ?

2.3 Niederreiter Public Key Cryptosystem Fig1. Encoding Process A dual encryption scheme is the Niederreiter[10] Definition 1(Linear Code). An (n, k)-code over is a cryptosystem which is equivalent in terms of security to

linear subspace of the linear space . Elements the McEliece cryptosystem[12]. The main difference between McEliece and Niederreiter cryptosystems lies in of are called words, and elements of are codewords. We call n the length, and k the dimension of . the description of the codes. Definition 2(Hamming Distance, Weight). The Hamming The Niederreiter encryption scheme describes codes distance d(x, y) between two words x, y is the number of through parity-check matrices. But both schemes have to positions in which x and y differ. That is, ( ) hide any structure through a scrambling transformation |* +|, where ( ) and ( ). and a permutation transformation. The Niederreiter cryptosystem includes three algorithms. Here, we use | | to denote the number of elements, or cardinality, of a set S. In particular, d(x, 0) is called the ( ) Hamming weight of x, where 0 is the vector containing n 1.Choose n, k and t according to ; 0’s. The minimum distance of a linear code is the 2.Randomly pick a parity-check matrix of an [n, k, minimum Hamming distance between any two distinct 2t+1] binary Goppa code; codewords. 3.Randomly pick a permutation matrix ; Definition 3(Generator Matrix). A generator matrix of an 4.Randomly pick a ( ) ( ) invertible matrix ; (n, k)-linear code is a matrix G whose rows form a 5.Calculate ; basis for the vector subspace . We call a code systematic 6.Output ( ), and ( ) where is if it can be characterized by a generator matrix G of the an efficient syndrome decoding algorithm. ( ) form ( | ( )) , where is the identity matrix and A, an ( ) matrix. algorithm maps any bit strings to codewords of Definition 4(Parity-check Matrix). A parity-check matrix length n and constant weight t.

of an (n, k)-linear code is an ( ) matrix H 1.Calculate ; whose rows form a basis of the orthogonal complement of 2.Output c.

( )

1.Calculate ; When data are transmitted through the Internet, it is better 2.Calculate ( ) ; that the data are protected by a cryptosystem beforehand to 3.Output . prevent them from tampering by an illegal third party. The security of the Niederreiter PKC and the McEliece Basically, an encrypted document is sent, and it is PKC are equivalent. An attacker who can break one is able impossible for an unlawful party to get the contents of the to break the other and vice versa [12]. In the following, by message, except he gets the sender’s private key to decrypt “Niederreiter PKC” we refer to the dual variant of the the message. Under a mutual protocol between the senders McEliece PKC and to the proposal by Niederreiter to use and receivers, each sender holds a private key to encrypt GRS codes by “GRS Niederreiter PKC”. his messages to send out, and a public key used by the The advantage of this dual variant is the smaller public key receiver to decrypt his sent-out messages. When the two size since it is sufficient to store the redundant part of the message digests are verified to be identical, the recipient matrix . The disadvantage is the fact, that the mapping can have the true text message. Thus, the security of data algorithm slows down encryption and decryption. In a transmission can be made sure. setting, where we want to send random strings, only, this disadvantage disappears as we can take ( ) as random 3.2 Blind Signature string, where is a secure hash function. The signer signs the requester’s message and knows nothing about it; moreover, no one knows about the 3. Digital Signatures and Blind Signatures correspondence of the message-signature pair except the requester. A short illustration of blind signature is 3.1 Digital Signature described in the following. 1. Blinding Phase: Under a protocol among all related parties, the digital A requester firstly chooses a random number called a blind signatures are used in private communication. All factor to mess his message such that the signer will be messages are capable of being encrypted and decrypted so blind to the message. as to ensure the integrity and non-repudiation of them. The 2. Signing Phase: concept of digital signatures originally comes from When the signer gets the blinded message, he directly cryptography, and is defined to be a method that a sender’s encrypts the blinded message by his private key and then messages are encrypted or decrypted via a hash function sends the blinding signature back to the requester. number in keeping the messages secured when transmitted. 3. Unblinding Phase: Especially, when a one-way hashing function is performed The requester uses his blind factor to recover the signer’s to a message, its related digital signature is generated digital signature from the blinding signature. called a message digest. A one-way hash function is a 4. Signature Verification Phase: mathematical algorithm that makes a message of any Anyone uses the signer’s public key to verify whether the length as input, but of a fixed length as output. Because its signature is valid. one-way property, it is impossible for the third party to decrypt the encrypted messages. Two phases of the digital 3.3 RSA Blind System signature process is described in the following. 1. Signing Phase: The first blind signature protocol proposed by Chaum is A sender firstly makes his message or data as the input of based on RSA system [2]. For each requester, he has to a one-way hashing function and then produces its randomly choose a blind factor first and supplies the corresponding message digest as the output. Secondly, the encrypted message to the signer, where message digest will be encrypted by the private key of the ( ) . Note that is the product of two large sender. Thus, the digital signature of the message is done. secret primes and , and is the public key of the signer Finally, the sender sends his message or data along with its along with the corresponding secret key such that related digital signature to a receiver. ( )( ) . The integer is called a 2. Verification Phase: blind factor because the signer will be blind to the message Once the receiver has the message as well as the digital after the computation of ( ). While getting , signature, he repeats the same process of the sender does, the signer makes a signature on it directly, where letting the message as an input into the one-way hashing ( ) and then returns the signed message to function to get the first message digest as output. Then he the requester. The requester strips the signature to yield decrypts the digital signature by the sender’s public key so an untraceable signature , where ( ), and as to get the second message digest. Finally, verify announces the pair ( ) . Finally, anyone uses the whether these two message digests are identical or not. signer’s public key to verify whether the signature is valid by checking the formula ( ) holds.

4. Proposed Blind Signature Scheme and anyone can verify the correctness by computing the following: 4.1 Initialization Phase

( ) We randomly choose a t degree irreducible polynomial ( ) ( ) ( ) ( ) ( ) in the finite field ( ), and we get an irreducible ( ) Goppa code ( ). The generating matrix of the Goppa Because equation is equal to ( ) , is the valid code is of order , the corresponding parity check signature of the message m. matrix is of order ( ) . We then choose invertible matrix of order ( ) ( ) and 5.2 Security Analysis permutation matrix of order . Let

( ) ( ) ( ) . The private key is ( ), the The blind signature scheme is based on Niederreiter PKC, public key is ( ). so the security of the proposed signature scheme is up to the security of Niederreiter PKC. There have been several 4.2 Proposed Blind Signature Scheme methods proposed for attacking McEliece’s system,[13],[14],etc. Among them, the best attack with There are two parties in the proposed blind signature least complexity is to repeatedly select k bits at random scheme, the requester and the signer. The requester who from the n-bit ciphertext vector c to form in hope that wants the signature of a message, the signer who can sign none of the selected k bits are in error. If there is no error the message into a signature. Before signing the message, in them, then is equal to m where is the the requester has to hash the message in order to hide the matrix obtained by choosing k columns of G according to information of the message. the same selection of . If anyone can decomposite public 1. Hash Phase key , he will get , and , therefore the blind Assume the message m is of n dimension sequences, signature scheme is invalid. However, there are too many denoted as ( ). We can use a secure ( ) ways in decompositing , it’s about ∏ ( ), hash function, for example, MD5, to obtain message digest , numbers of , and respectively[15]. When ( ), where is a selected secure hash function. 2. Blinding Phase n and t are large, it’s impossible to calculate, so the The requester randomly chooses a invertible matrix as decomposition method is unfeasible. blinding factor, computes ( ) ( ). Then sends the At present, the most efficient method on attacking the blinding message B(m) to the signer. Niederreiter PKC is solving linear equations. Under such

3. Signing Phase an attack, the work factor is ( ) ( ), when After the signer has received the blinding message ( ), , , ,the work factor of Niederreiter computes ( ) , then sends the signature PKC is approximately , so we consider the to the user. Niederreiter PKC is secure enough. That is to say, the 4. Unblinding Phase blind signature scheme is secure because the bind After the requester has received the signature , the signature scheme is based on the Niederreiter PKC, they requester uses invertible B to recover signature, computes have the same security. as the following:

( ) ( ) 5.3 Blindness So, is the real signature of the message digest ( ). 5. Verification Process The blind factor is choosed randomly by the requester, Anyone can verify whether the signature is valid by only the requester knows , others can’t obtain from any computing , is the public key of the signer. If other ways, the blinding process is computed as the ( ) , then is the valid blind signature of the following: message m, otherwise, reject. ( ) ( ) Because of the privacy and randomness of , the blinding message ( ) is unknown to the signer. 5. Performance Analysis 5.4 Unforgeability 5.1 Correctness From the signature process, we can see that anyone else If the requester and the signer execute the process can’t forge the signer’s signature. If someone wants to according to the above protocol, then the signature is the forge the signer’s signature, firstly, he must get the exact correct signature of message m signed by the signer, blinding message ( ) from the requester, then forges a

signature. In order to forge a signature, the adversary will Niederreiter PKC. Firstly, we use hash function to hash the encounter two handicap, one is the blind factor which is message to get the message digest ( ), then select random and secret, only the requester knows . The other randomly an invertible matrix B as blind factor to blind problem is that even the adversary obtains the blinding ( ) and get blinding message ( ). After the signer has message ( ), because he doesn’t know the private key received ( ), he will sign the ( ) by his private key. , and of the signer, it’s impossible to forge a The user then unblinds what he receives, he will get the message to satisfy the equation ( ). The requester signature. By constructing the invertible matrix cleverly, himself can’t forge the signer’s signature, in the first step, we can assure the signature is correct and is verifiable. we use the hash function to hash the message and get Through performance analysis, the blind signature scheme ( ), the process of the hash function is invertible. is correct, also it has the characteristic of blindness, unforgeability and non-repudiation. The security of our 5.5 Non-repudiation scheme is the same as the security of Niederreiter PKC scheme, in addition, it’s efficiency is higher than the The signature of the signer is signed by his private key signature scheme based on RSA scheme. The code-based , and , no others can obtain his private key, so, at cryptography can resist the attack of post-quantum any time, the signer can’t deny his signature. computers, so the scheme is very applicable and considerable. In the near future, we will focus our research 5.6 Untraceability on the group signature and threshold ring signature based on the error-correcting code. After the signature message pair ( ) is published, even the signer has the signature information, he can't connect the blinding signature with the blinding message ( ), Acknowledgments that is to say, he can't trace the original message . We are grateful to the anonymous referees for their 5.7 Compared with RSA invaluable suggestions. This work is supported by the National Natural Science Foundation of China (Nos. 61472472). This work is also supported by JiangXi Education Department (Nos. GJJ14650 and GJJ14642).

References [1] P. W. Shor. Polynomial-Time Algorithms for Prime Factorization and Discrete Logarithms on a Quantum Computer. SIAM J.SCI.STATIST.COMPUT., 26:1484, 1997. [2] Chaum D. Blind Signatures system. Advances in cryptology:proceedings of Crypto 1982, Heidelberg: Springer-Verlag, 1982:199-203. [3] Okamoto T． Provable secure and practical identification schemes and corresponding digital signature schemes. CRYPTO'92． 1992: 31-52. Fig2 Signature Time [4] C. P. Schnorr. Efficient Identification and Signatures for Smart Cards. In Advances in Cryptology – CRYPTO ’89, We compare the blind signature time between RSA and LNCS, pages 239–252. Springer, 1989. our scheme, as is showed in Fig2. We compare four [5] Chien H Y ， Jan J K ， and Tseng Y M ． RSA-Based partially blind signature with low computation. IEEE 8sth different situations, when the length of the plaintext is 128 International Conference on Parallel and Distributed bits, 256 bits, 512 bits and 1024 bits. From the Fig2, we Systems． Kyongju : Institute of Electrical and Electronics can draw the conclusion that the signature time of our Engineers Computer Soeiety, 2001: 385-389. scheme is smaller than the signature time of the RSA [6] Zheng Cheng, Guiming Wei, Haiyan Sun. Design on blind shceme. So, our blind signature scheme based on signature based on elliptic curve. Chongqing University of Niederreiter PKC is very efficient. Posts and Telecommunications, 2007, (1):234-239. [7] T.Okamoto. An efficient divisible electronic cash scheme. In CRYPTO, pages 438-451, 1995. 6. Conclusions [8] I.Damgard. A design principle for hash functions. Crypto 89, LNCS 435, 416–427. We propose a blind signature scheme based on [9] Overbeck, R.: A Step Towards QC Blind Signatures. IACR Niederreiter PKC whose security based on the security of Cryptology ePrint Archive 2009: 102 (2009).

[10]Niederreiter H． Knapsack-type cryptosystems and algebraic coding theory[J]. Problems of Control and Information Theory, 1986, 15 (2) :159-166. [11]E. Berlekamp, R. McEliece, and H. van Tilborg. On the Inherent Intractability of Certain Coding Problems. IEEE Transactions on Information Theory, IT-24(3), 1978. [12]Li, Y., Deng, R., and Wang, X. the equivalence of McEliece's and Niederreiter's public-key cryptosystems. IEEE Transactions on Information Theory, Vol.40, pp.271- 273(1994). [13]T.R.N. Rao and K.-H. Nam. Private-key algebraic-coded cryptosystems. Proc.Crypt0 '86, pp.35-48, Aug, 1986. [14]C. M. Adams and H. Meijer. Security-related comments regarding McEliece's public-key cryptosystem. Roc. Crypto '87, Aug,1987. [15]P. J. Lee and E. F. Brickell. An Observation on the Security of McEliece’s Public-Key Cryptosystem. j-LECT-NOTES- COMP-SCI, 330:275–280, 1988.

Junyao Ye is a Ph.D. student in Department of Computer Science and Engineering, Shanghai JiaoTong University, China. His research interests include information security and code-based cryptography.

Fang Ren received his M.S. degree in mathematics from Northwest University, Xi’an, China, in 2007. He received his Ph.D. degree in cryptography from Xidian University, Xi’an, China, in 2012. His research interests include Cryptography, Information Security, Space Information Networks and Internet of Things.

Dong Zheng received his Ph.D. degree in 1999. From 1999 to 2012, he was a professor in Department of Computer Science and Engineering, Shanghai Jiao Tong University, China. Currently, he is a Distinguished Professor in National Engineering Laboratory for Wireless Security, Xi’an University of Posts and Telecommunications, China. His research interests include subliminal channel, LFSR, code-based systems and other new cryptographic technology.

Kefei Chen received his Ph.D. degree from Justus Liebig University Giessen, Germany, in 1994. From 1996 to 2013, he was a professor in Department of Computer Science and Engineering, Shanghai Jiao Tong University, China. Currently, he is a Distinguished Professor in School of Science, Hangzhou Normal university, China. His research interests include cryptography and network security.

Multi-lingual and -modal Applications in the Semantic Web: the example of Ambient Assisted Living

Dimitra Anastasiou

Media Informatics and Multimedia Systems, Department of Computing Science, University of Oldenburg, 26121 Oldenburg, Germany [email protected]

digital libraries. The challenges of the SW at that time Abstract were the development of ontologies, formal semantics of Applications of the Semantic Web (SW) are often related only to SW languages, and trust and proof models. Zhong et al. [3] written text, neglecting other interaction modalities and a large were in search of the “Wisdom Web” and Web portion of multimedia content that is available on the Web today. Intelligence where “the next-generation Web will help Processing and analysis of speech, hand and body gestures, gaze, people achieve better ways of living, working, playing, and haptics have been the focus of research in human-human and learning.“ The challenges described in [2] have now interactions and have started to gain ground in human-computer interaction in the last years. Web 4.0 or Intelligent Web, which been sufficiently addressed, whereas the vision presented follows Web 3.0, takes these modalities into account. This paper in [3] has not yet gained ground. d‟ Aquin et al. [4] examines challenges that we currently face in developing multi- presented the long-term goal of developing the SW into a lingual and -modal applications and focuses on some current and large-scale enabling infrastructure for both data integration future Web application domains, particularly on Ambient and a new generation of intelligent applications with Assisted Living. intelligent behavior. They added that some of the Keywords: Ambient Assisted Living, Multimodality, requirements of an application with large-scale semantics Multilinguality, Ontologies, Semantic Web. are to exploit heterogeneous knowledge sources and combine ontologies and resources. In our opinion, multimedia data belong to such heterogeneous sources. 1. Introduction The intelligent behavior of next-generation applications can already be found in some new research fields, such as Ambient Assisted Living (AAL) promotes intelligent AAL and Internet of Things. assistant systems for a better, healthier, and safer life in the Many Web applications nowadays offer user interaction in preferred living environments through the use of different modalities (haptics, eye gaze, hand, arm and Information and Communication Technologies (ICT). finger gestures, body posture, voice tone); few examples AAL systems aim to support elderly users in their are presented here. Wachs et al. [5] developed GESTIX, a everyday life using mobile, wearable, and pervasive hand gesture tool for browsing medical images in an technologies. However, a general problem of AAL is the operating room. As for gesture recognition, Wachs et al. digital divide: many senior citizens and people with [6] pointed out that no single method for automatic hand physical and cognitive disabilities are not familiar with gesture recognition is suitable for every application; each computers and accordingly the Web. In order to meet the algorithm depends on each user‟s cultural background, needs of its target group, AAL systems require natural application domain, and environment. For example, an interaction through multilingual and multimodal entertainment system does not need the gesture- applications. Already in 1991 Krüger [1] said that “natural recognition accuracy required of a surgical system. An interaction” means voice and gesture. Another current application based on eye gaze and head pose in an e- issue to bring AAL systems into the market is learning environment is developed by Asteriadis et al. [7]. interoperability to integrate heterogeneous components Their system extracts the degree of interest and from different vendors into assistance services. In this engagement of students reading documents on a computer article, we will show that Semantic Web (SW) screen. Asteriadis et al. [7] stated that eye gaze can also be technologies can go beyond written text and can be applied used as an indicator of selection, e.g. of a particular exhibit to design intelligent smart devices or objects for AAL, like in a museum, or a dress at a shop window, and may assist a TV or a wardrobe. or replace mouse and keyboard interfaces in the presence More than 10 years ago, Lu et al. [2] provided a review of severe handicaps. about web-services, agent-based distributed computing, This survey paper presents related work on multilingual semantics-based web search engines, and semantics-based and multimodal applications within the field of Semantic

Web. We discuss challenges of developing such devices, like PC, mobile phone, PDA. In this paper, applications, such as the Web accessibility by senior though, by “multimodality” we refer to multimodal people. This paper is laid out as follows: in Sect. 2 we input/output: present how the multilingual and multimodal Web of Data i) Multimodal input by human users (in)to Web is envisioned. Sect. 3 presents the challenges of applications, including modalities, like speech, body developing multi-lingual and -modal applications. In Sect. gestures, touch, eye gaze, etc.; for processing purposes, 4 we look at some current innovative applications, this input involves recognition of these modalities; including Wearable Computing, Internet of Things, and Pervasive Computing. The domain of AAL and its ii) Multimodal output by Web applications to human connection with the SW and Web 4.0 is presented in detail users; this involves face tracking, speech synthesis, and along with some scenarios in Sect. 5. Finally, we gesture generation. Multimodal output can be found in summarize the paper in Sect. 6. browser-based applications, e.g. gestures are performed by virtual animated agents, but it is even more realistic 2. Multi-linguality and -modality in the to be performed by pervasive applications, such as Semantic Web robots. Most SW applications are based on ontologies; regarding 2.1 Breaking the digital divide: heterogeneous target the multilingual support in ontologies, W3C recommends group in the OWL Web Ontology Language Use Cases and Requirements [8] that the language should support the use Apart from the so-called “computer-literate” people, there of multilingual character sets. The impact of the are people who do not have the skills, the abilities, or the Multilingual Semantic Web (MSW) is a multilingual “data knowledge to use computers and accordingly the Web. network” where users can access information regardless of The term “computer literacy” came into use in the mid- the natural language they speak or the natural language the 1970‟s and usually refers to basic keyboard skills, plus a information was originally published in (Gracia et al. [9]). working knowledge of how computer systems operate and Gracia et al. [9] envision the multilingual Web of Data as a of the general ways in which computers can be used [12]. layer of services and resources on top of the existing The senior population was largely bypassed by the first Linked Data infrastructure adding multilinguality in: wave of computer technology; however, they find it more and more necessary to be able to use computers (Seals et i) linguistic information for data and vocabularies in al. [13]). In addition to people with physical or cognitive different languages (meaning labels in multiple disabilities, people with temporal impairments (e.g. having languages and morphological information); a broken arm) or young children often cannot use ii) mappings between data with labels in different computers efficiently. All the above groups profit by the languages (semantic relationships or translation interaction with multimodal systems, where recognition of between lexical entries); gesture, voice, eye gaze or a combination of modalities is iii) services to generate, localize, link, and access Linked implemented. For the “computer-literate” people, Data in different languages. multimodality brings additional advantages, like Other principles, methods, and applications towards the naturalness, intuitiveness, and user-friendliness. To give MSW are presented by Buitelaar and Cimiano [10]. some examples, senior people with Parkinson have difficulties controlling the mouse, so they prefer speech; As far as multimodality is concerned, with the deaf-mute people are dependent on gesture, specifically development of digital photography and social networks, it sign language. Sign language, as with any natural has become a standard practice to create and share language, is based on a fully systematic and multimedia digital content. Lu et al. [5] stated that this conventionalized language system. Moreover, the selection trend for multimedia digital libraries requires of the modality, e.g. speech or gesture, can also be interdisciplinary research in the areas of image processing, context-dependent. In a domestic environment, when a computer vision, information retrieval, and database person has a tray in their hand, (s)he might use speech to management. Traditional content-based multimedia open the door. Thus, as the target group of the Web is very retrieval techniques often describe images/videos based on heterogeneous, the current and future applications should low-level features (such as color, texture, and shape), but be context-sensitive, personalized, and adaptive to the their retrieval is not satisfactory. Here the so-called target‟s skills and preferences. Semantic Gap becomes relevant, defined by Smeulders et al. [11] as a “lack of coincidence between the information that one can extract from the visual data and the interpretation that the same data has for a user in a given situation.” Besides, multimodality may refer to multimodal

2.2 Multimodal applications in the Semantic Web 3. Challenges in developing multi-lingual and -modal applications Historically, the first multimodal system was the “Put that there” technique developed by Bolt [14], which allowed In this section we discuss some challenges for multi- the user to manipulate objects through speech and manual lingual and -modal applications from a development pointing. Oviatt et al. [15] stated that real multimodal perspective. A basic challenge and requirement of the applications range from map-based and virtual reality future Web is to provide Web accessibility to everybody, systems for simulation and training over field medic bearing in mind the heterogeneous target group. Web systems for mobile use in noisy environments, through to accessibility means to make the content of a website Web-based transactions and standard text-editing available to everyone, including the elderly and people applications. One type of multimodal application is the with physical or cognitive disabilities. According to a multimodal dialogue system. They are applicable both in United Nations report [21], 97% of websites fail to meet desktop and Web applications, but also in pervasive the most basic requirements for accessibility by using units systems, such as in the car or at home (see AAL scenarios of measurement (such as pixels instead of percentages), in 5.2.2). Smartkom [16] is such a system that features which restrict the flexibility of the page layout, the font speech input with prosodic analysis, gesture input via size or both. Today worldwide 650 million people have a infrared camera, recognition of facial expressions and disability and approximately 46 million of these are emotional states. On the output side, the system features a located in the EU. By 2015 20% of the EU will be over 65 gesturing and speaking life-like character together with years of age, the number of people aged 60 or over will displayed generated text and multimedia graphical output. double in the next 30 years and the number aged 80 or Smartkom provides full “symmetric multimodality”, over will increase by 10% by 2050. These statistics defined by Wahlster [17] as the possibility that all input highlight the timeliness and importance of the need to modes are also available for output, and vice versa. make the Web accessible to more senior or impaired Another multimodal dialogue system is VoiceApp people. W3C has published a literature review [22] related developed by Griol et al. [18]. All applications in this to the use of the Web by older people to look for system can be accessed multimodally using traditional intersections and differences between the accessibility GUIs and/or by means of voice commands. Thus, the guidelines and recommendations for web design and results are accessible to motor handicapped and visually development issues that will improve accessibility to older impaired users and are easier to access by any user in people. W3C has a Web Accessibility Initiative [23], small hand-held devices where GUIs are in some cases which has released accessibility guidelines, categorized difficult to employ. into: He et al. [19] developed a dialogue system called Semantic Restaurant Finder that is both multimodal and i) Web Content: predictable and navigable content; semantically rich. Users can interact through speech, ii) User Agents: access to all content, user control of how typing, or mouse clicking and drawing to query restaurant content is rendered, and standard programming information. SW services are used, so that restaurant interfaces, to enable interaction with assistive information in different city/country/language are technologies; constructed, as ontologies allow the information to be iii) Authoring Tools: HTML/XML editors, tools that sharable. produce multimedia, and blogs. Apart from dialogue systems, many web-based systems Benjamins et al. [24] stated that the major challenges of are multimodal. In the assistive domain, a portal that offers SW applications, in general, concern: (i) the availability of 1 access to products is EASTIN [20]. It has a multilingual content, (ii) ontology availability, development and (users should forward information requests, and receive evolution, (iii) scalability, (iv) multilinguality, (v) results, in their native language) and multimodal (offering visualization to reduce information overload, and (vi) a speech channel) front-end for end-users. Thurmair [20] stability of SW languages. As far as multilinguality is tested the usability of the portal and found that most concerned, they state that any SW approach should people preferred to use free text search. provide facilities to access information in several languages, allowing the creation and access to SW content independently of the native language of content providers and users. Multilinguality plays an important role at various levels [24]: i) Ontologies: WordNet, EuroWordnet etc., might be explored to support multilinguality;

1 www.eastin.eu, 10/09/14

ii) Annotations: proper support is needed that allows 4. Current domain applications providers to annotate content in their native language; iii) User interface: internationalization and localization In the last years the usage of the Web has shifted from techniques should make the Web content accessible in desktop applications and home offices to smart devices at several languages. home, in entertainment, the car, or in the medical domain. Some of the latest computing paradigms are the following: As far as the challenges related to multimodal applications are concerned, He et al. [19] pointed out that the existing  Wearable computing is concerned with miniature multimodal systems are highly domain-specific and do not electronic devices that are worn on the body or woven allow information to be shared across different providers. into clothing and access the Web, resulting in In relation with the SW, Avrithis et al. [25] stated that intelligent clothing. A commercial product is the MYO there is a lot of literature on multimodality in the domains armband by Thalmic Labs 1 with which users can of entertainment, security, teaching or technical control presentations, video, content, games, browse documentation, however the understanding of the the Web, create music, edit videos, etc. MYO detects semantics of such data sources is very limited. Regarding gestures and movements in two ways: 1) muscle the combination of modalities, Potamianos & Perakakis activity and 2) motion sensing. The most recent Apple [26], among other authors, stated that multimodal Watch 2 is designed around simple gestures, such as interfaces pose two fundamental challenges: the zooming and panning, but also senses force (Force combination of multiple input modalities, known as the Touch). Moreover, a heart rate sensor in Apple Watch fusion problem and the combination of multiple can help improve overall calorie tracking. presentation media, known as the fission problem. Atrey et  The Internet of Things (IoT) refers to uniquely al. [27] provided a survey about multimodal fusion for identifiable objects and their virtual representations in multimedia analysis. They made several classifications an Internet structure. Atzori et al. [30] stressed that the based on the fusion methodology and the level of fusion IoT shall be the result of the convergence of three (feature, decision, and hybrid). One other challenge of visions: things-oriented, Internet-oriented, and multimodal systems is low recognition. Oviatt & Cohen Semantic-oriented visions. Smart semantic [28], on comparing GUIs with multimodal systems, stated middleware, reasoning over data, and semantic that, whereas input to GUIs is atomic and certain, machine execution environments belong to the semantic- perception of human input, such as speech and gesture, is oriented visions. A recent survey of IoT from an uncertain; so any recognition-based system‟s interpreta- industrial perspective is published by Perera et al. [31]. tions are probabilistic. This means that events, such as [31] stated that “despite the advances in HCI, most of object selection, which were formerly basic events in a the IoT solutions have only employed traditional GUI (point an object by touching it) are subject to computer screenbased techniques. Only a few IoT misinterpretation in multimodal systems. They see that the solutions really allow voice or object-based direct challenge for system developers is to create robust new communications.“ They also see a trend from smart time-sensitive architectures that support human home products that it also increasingly uses touch- communication patterns and performance, including based interactions. processing users‟ parallel input and managing the  Pervasive context-aware systems: Pervasive/ uncertainty of recognition-based technologies. ubiquitous computing means that information Apart from the above challenges, an additional challenge processing is integrated into everyday objects and is twofold: i) develop multi-lingual and -modal activities. Henricksen et al. [32] explored the applications in parallel and ii) tie them with a language- characteristics of context in pervasive systems: it enhanced SW. Today there are not many applications that exhibits temporary characteristics, has many alternative combine multiple modalities as input and/or output and representations, and is highly interrelated. Chen et al. support many natural languages at the same time. Cross [33] developed the Context Broker Architecture, a [29] states that current multimodal applications typically broker agent that maintains a shared model of context provide user interaction in only a single language. When a for all computing entities in the space and enforces the software architect desires to provide user interaction in privacy policies defined by the users when sharing more than one language, they often write a multimodal their contextual information. They believe that a application for each language separately and provide a requirement for realizing context-aware systems is the menu interface to a user that permits the user to select the ability to understand their situational conditions. To language that the user prefers. The drawback is that having achieve this, it requires contextual information to be multiple versions of the same multimodal application in represented in ways that are adequate for machine various languages increases complexity, which leads to an increased error rate and additional costs. 1 https://www.thalmic.com/myo/, 08/06/15 2 https://www.apple.com/watch/, 08/06/15

processing and reasoning. Chen et al. [33] believe that are adaptivity, individualization, self-configuration and SW languages are well suited for this purpose for the learning aptitude. These features have been traditionally following reasons: i) RDF and OWL have rich achieved with methods developed within the field of expressive power that are adequate for modeling Artificial Intelligence (AI). However, they believe that the various types of contextual information, ii) context Internet has been the driving force for the further ontologies have explicit representations of semantics; development of those methods and mentioned the systems with the ability to reason about context can problems of Web services: i) integration and evaluation of detect inconsistent context knowledge (result from data sources, like ambient technical devices; ii) imperfect sensing), iii) SW languages can be used as cooperation between services of different kinds, such as meta-languages to define other special purpose device services; and iii) interoperability of the above languages, such as communication languages for mentioned services. These problems are very similar to the knowledge sharing. interoperability issues arising between AAL system  Location-based services and positioning systems: components, hence Eichelberg & Lipprandt [38] state that Positioning systems have a mechanism for determining the success of AAL systems is tightly coupled with the the location of an object in space, from sub-millimeter progress of Semantic Web technologies. The goal of using to meter accuracy. Coronato et al. [34] developed a SW technologies in AAL is to create interoperability service to locate mobile entities (people/devices) at any between heterogeneous devices (products from different time in order to provide sets of services and vendors) and/or IT services to promote cooperation information with different modalities of presentation between AAL systems and the emergence of innovative and interaction. business models [38].  Semantic Sensor Web: Sheth et al. [35] proposed the Web 4.0, the so-called Intelligent/Symbiotic Web, follows semantic sensor Web (SSW) where sensor data are the Web 2.0 (Social Web) and Web 3.0 (Semantic Web) annotated with semantic metadata that increase and is about knowledge-based linking of services and interoperability and provide contextual information devices. It is about intelligent objects and environments, essential for situational knowledge. The SSW is the intelligent services, and intelligent products. Thus there is answer to the lack of integration and communication tight connection between Web 4.0 and AAL, since AAL is between networks, which often isolates important data realized in smart and intelligent environments. In such streams [35]. environments, intelligent things and services are available,  Ambient Assisted Living (AAL): AAL is a research such as sensors that monitor the well being of users and domain that promotes intelligent assistant systems for a transfer the data to caregivers, robots that drive users to better, healthier, and safer life in the preferred living their preferred destination, TVs that can be controlled environments through the use of Information and through gestures, etc. Communication Technologies (ICT). More information Aghaei et al. [39] points out that Web 4.0 will be about a on AAL is provided in the next Section. linked Web that communicates with humans in a similar manner that humans communicate with each other, for 5. Ambient Assisted Living example, taking the role of a personal assistant. They believe that it will be possible to build more powerful The aging population phenomenon is the primary interfaces, such as mind-controlled interfaces. Murugesan motivation of AAL research. From a commercial [40] stated that Web 4.0 will harness the power of human perspective, AAL is rich in terms of technology (from tele- and machine intelligence on a ubiquitous Web in which health systems to robotics) but also in terms of both people and computers not only interact but also stakeholders (from service providers to policy makers, reason and assist each other in smart ways. including core technology or platform developers) Moreover, Web 4.0 is characterized by the so-called (Jacquet et al. [36]). The program AAL JP [37] is a ambient findability. Google allows users to search the web funding initiative that aims to create a better quality of life and users‟ desktop and also extend this concept to the for older adults and to strengthen the industrial physical world. Some examples are to tag physical objects opportunities in Europe through the use of ICT. In the next with the mobile phone, such as wallet, documents, but sections we will discuss the connection between AAL and even people or animals. Users can use Google to see what the SW and Web 4.0, the reason why ontologies play a objects have been tagged and Google can also locate the role in AAL, and the way AAL is realized along with two objects for the user. In this concept, RFID-like technology, scenarios GPS and mobile phone tricorders are needed. Also here the connection between findability and AAL is present, as 5.1 Ambient Assisted Living and Web 3.0 – Web 4.0 smart objects with RFID are an important component of AAL and IoT (see scenarios in 5.2.2). According to Eichelberg & Lipprandt [38], the typical features of AAL systems, as standard interactive systems,

To sum up, Web 4.0 can support AAL by linking CARE [44], which develops a fall detector, more than 200 intelligent things and services through Web technology end users in Austria, Finland, Germany and Hungary were keyed to sensors, like RFID and GPS. However, according questioned regarding the need for a fall detector; they to Eichelberg & Lipprandt [38], until the integration of answered that the current fall detectors (wearable systems) AAL systems into the era of Web 4.0, there is still are not satisfactory and do not have high acceptance in the significant progress needed concerning the semantic independent living context. Thus generally speaking, end- technologies. For instance, development of tools for users are involved in current AAL related projects either collaborative development of formal semantic knowledge through answering questionnaires or participating in user representations; integration of domain experts and studies. Their involvement includes analysis of technical standardization; ontology matching, reasoning and achievements/ requirements of the developed product, evaluation as well as semantic sensor networks and acceptance and usability of the prototypes, and also often “Semantic Enterprise” methods for the migration of IT- ergonomic, cognition, and psychological aspects. processes in linked systems. Information of how As for adoption of AAL systems by end users, this ontologies are related to AAL is given in 6.2.1. An depends on various aspects, such as an application‟s example project which combines sensor networks is SHIP obtrusiveness and the willingness of users. In many AAL (Semantic Heterogeneous Integration of Processes) [41]. It systems, bio-sensors (activity, blood pressure- and weight combines separate devices, components and sensors to sensors) are employed to monitor the health conditions of yield one coherent, intelligent and complete system. The the users. Sensors/cameras are placed at home, so that the key concept of SHIP is a semantic model, which brings seniors’ activities are monitored and shared between together the data of the physical environment and the informal carers, families and friends. The assisted have to separate devices to be integrated. One application domain decide whether their well being should be monitored in of SHIP is the Bremen Ambient Assisted Living Lab 1, order to avoid undesired situations, but also to keep the where heterogeneous services and devices are combined in technology as unobtrusive as possible, so that they integrated assistants preserve dignity and maintain privacy and confidentiality. Weber [45] stated that an adequate legal framework must 5.2 Realization and evaluation of AAL take the technology of the IoT into account and would be AAL is primarily realized in domestic environments, i.e. established by an international legislator, which is the houses of senior people. Homes equipped with AAL supplemented by the private sector according to specific technology are called smart homes. Moreover, AAL needs. systems can be applied in hospitals and nursing homes. Sun et al. [46] referred to some other challenges of AAL Generally speaking, the objective of AAL systems is to systems: i) dynamic of service availability and ii) service increase the quality of life of the elderly, maintain their mapping. The Service Oriented Architecture, which well-being and independence. However, achieving these supports the connection of various services, tackles the outcomes requires the involvement of third parties (e.g. dynamicity problem. For service mapping, ontology caregivers, family) through remote AAL services. Nehmer libraries are required to precisely describe the services. et al. [42] distinguished three types of remote AAL There should be a so-called “mutual assistance services: emergency treatment, autonomy enhancement, community” where a smart home is managed by a local and comfort services. coordinator to build up a safety environment around the The projects funded by the AAL JP programme cover assisted people and the elderly should find themselves solutions for prevention and management of chronic with a more active living attitude [46]. conditions of the elderly, advancement of social interaction, participation in the self-serve society, and advancement of mobility and (self-) management of daily 5.2.1 Ontologies and AAL life activities of the elderly at home. Thus AAL is multifaceted with specific sub-objectives depending on the AAL applications are trans-disciplinary, because they mix kind of application to be developed. automatic control with modeling of user behavior. Thus, Regarding the involvement of end users in AAL, the the ability to reuse knowledge and integrate several project A2E2 [43] involves users in several phases of the knowledge domains is particularly important [36]. project, including focus groups, pilots, and an Furthermore, AAL is a very open and changing field, so effectiveness study. Three groups are used: eldery clients, extensibility is key. In addition, an AAL environment care professionals, and care researchers. Users are requires a standard way of exchanging knowledge between interviewed to find out which requirements they have on software and hardware devices. Therefore [36] believe that the particular interface to be developed. In the project ontologies are well adapted to these needs: i) are extensible to take into account new applications; 1 www.baall.org, 05/09/13 ii) provide a standard infrastructure for sharing knowledge;

iii) semantic relationships, such as equivalence, may be networked ontologies. They developed the OASIS expressed between various knowledge sources, thus Common Ontological Framework, a knowledge permitting easy integration. representation paradigm that provides: (i) methodological principles for developing interoperable ontologies, (ii) a Jacquet et al. [36] presented a framework in which „hyper-ontology‟ that facilitates formal semantic ontologies enable the expression of users‟ preferences in interoperability across ontologies and (iii) an appropriate order to personalize the system behavior: it stores software infrastructure for supporting heterogeneous preferences and contains application-specific modules. ontologies. Another ontology-centered design is used in the SOPRANO Ambient Middleware (SAM) [47]. SAM receives user commands and sensor inputs, enriches them 5.2.2 AAL Scenarios semantically and triggers appropriate reactions via actuators in a smart home. The ontology is used as a Two AAL scenarios will now be presented that blueprint for the internal data models of the components, demonstrate how multi-lingual and -modal applications, for communication between components, and for coupled with SW and Web 4.0, can improve the quality of communication between the technical system and the life of senior citizens: typically non-technical user. In AAL there is often a problem of disambiguation . Scenario 1: John is 70 years old and lives alone in a smart home equipped with intelligent, height-adaptable between complex situations and simple sensors events. For devices. He just woke up and wants to put on his clothes. example, if the person does not react to a doorbell ring, it His wardrobe suggests to him to wear brown trousers may indicate that they have a serious problem, or and a blue pullover. Then he goes to the supermarket for alternatively it may indicate that they are unavailable, e.g. taking a bath [48]. Therefore, Muñoz et al. [48] proposed his daily shopping. He comes back and puts his an AAL system based on a multi-agent architecture purchased products into the fridge. The fridge registers the products. Then he wants to take a rest and watch TV. responsible for analyzing the data produced by different He lies on the bed; the bed is set to his favourite position types of sensors and inferring what contexts can be with the headrest and footrest being set slightly higher. associated to the monitored person. SW ontologies are adopted to model sensor events and the person‟s context. While he was at the supermarket, his daughter called The agents use rules defined on such ontologies to infer him. The TV informs him about this missed call. Then he wants to cook his favourite meal; he goes to the information about the current context. In the case that kitchen and the kitchen reminds him about the recipe agents discover inconsistent contexts, argumentation going through all the steps. The next day, when he goes techniques are used to disambiguate the situation by again to the supermarket, his mobile reminds him that he comparing the arguments that each agent creates. In their ontology, concepts represent rooms, home elements, and has to buy milk. sensors along with relationships among them. . Scenario 2: Svetlana is from Ukraine and lives together Furthermore, Hois [49] designed different modularized with Maria, 85 years old from England, at Maria‟s smart spatial ontologies applied to an AAL application. This home. Svetlana is caring staff, i.e. she cooks, cleans, application has different types of information to define: (1) helps in shopping, etc. Svetlana does not speak English architectural building elements (walls), (2) functional very well; thus she speaks in Ukrainian to Maria, but information of room types (kitchen) and assistive devices also to electronic devices (TV, oven, etc.) and Maria (temperature sensors), (3) types of user actions (cooking), hears it back in English. Alternatively to the voice (4) types of furniture or devices inside the apartment and commands, they can control the devices through a GUI their conditions (whether the stove is in use), and (5) or through haptics on the devices that this is available. requirements and constraints of the AAL system (temperature regulations). Hois [49] designed different, The above scenarios have a lot of hardware and software but related, ontologies to manage this heterogeneous requirements, some of which are currently under 1 information. Their interactions determine the system‟s development in the project SyncReal at the University of characteristics and the way it identifies potential abnormal Bremen; we will study these scenarios in more detail in the situations implemented as ontological query answering in subsequent paragraphs. order to monitor the situation in concrete contexts. Last but not least, in the project OASIS, one of the . Intelligent devices: these are the wardrobe, fridge, challenges was to achieve interoperability spanning cupboards, bed, and TV. The clothes in the wardrobe are complex services in the areas of Independent Living, marked with an RFID tag (IoT – Web 4.0) and the Autonomous Mobility and Homes and Workplaces, wardrobe can suggest to the user what to wear through including AAL. Due to the diversity of types of services, Bateman et al. [50] suggested the support of cross-domain 1 http://www.syncreal.de, 25/06/2013

these RFIDs and motion sensors (Beins [51]). This is the elderly and challenged people. The AAL market is useful, among other benefits, for people with memory changing and is expected to boom in the next few years as deficit or visual impairments. It can also remind people a result of demographic developments and R&D to wash clothes if there are not many clothes left in the investment by industries and stakeholders. Currently the wardrobe. Similarly, the fridge and the cupboards ICT for AAL is very expensive; projects test AAL register all products that are placed in and taken out by prototypes in living labs that can be applied in domestic storing them in a database. This information is then environments in the future. The technology is still often transferred to other devices, such as mobile phones, so obtrusive (motion sensors), although researchers are that John could see the next day that there is no more working towards a goal of invisible technology. In milk in the fridge (Voigt [52]). The bed is set addition, often the data is “noisy”, as it is based on fuzzy automatically to a specific height position every time techniques, probabilistic systems, or Markov-based that he wants to watch TV (context-sensitive). models. Generally speaking, in regards to the future of intelligent ambient technologies, not only intelligent . Semantic integration of ambient assistance: John devices (Web 4.0), but also semantic interoperability could see the missed call of his daughter on the TV between devices and IT-services (Web 3.0) are necessary. owing to formal semantic modeling and open standards; In our opinion, as emphasized by the term “semantic”, the the semantic interoperability allows the integration of the SW should be context-sensitive, situation-adaptive, telephone with the TV. negotiating, clarifying, meaningful, and action-triggering. All these aspects are important both for SW-based . Speech-to-speech dialogue system: the language barrier dialogue systems and multimodal interfaces including between Svetlana and Maria is not a problem due to the various input and output modalities. We share the opinion speech-to-speech technology that is implemented in the of O Grady et al. [54], on their vision about evolutionary home system. It includes three technologies: i) speech AAL systems, about the necessity for an adaptive (robust recognition, ii) Machine Translation, iii) speech and adapting in real-time), open (not propriety AAL synthesis; advantages and drawbacks of all three techno- systems), scalable (integration of additional hardware logies have to be balanced. The dialogue system is also sensors), and intuitive (support for many interaction multimodal giving the possibility to interact with either modalities) software platform that incorporates autonomic through GUI, voice commands or haptics. It can be and intelligent techniques. applied not only in electronic appliances, but also in robots. Information about speech-to-speech translation in References AAL can be found in Anastasiou [53]. [1] Krueger, M.W., Artificial Reality, Second Ed. Addison, 6. Summary and Conclusion Wesley, Redwood City CA, 1991. [2] Lu, S., Dong, M., Fotouhi, F., “The Semantic Web: In this paper we focused on multimodal applications of the opportunities and challenges for next-generation Web SW and presented some challenges involved in the applications”, Information Research, 2002, 7 (4). development of multi-lingual and -modal applications. We [3] Zhong, N., Liu, J., Yao, Y., In search of the wisdom web. provided some examples of current and future application Computer, 2002, 37 (11), pp. 27-31. domains, focusing on AAL. As there are large individual [4] D‟Aquin, M.D., Motta, E., Sabou, M., Angeletou, S., Gridinoc, L., Lopez, V., Guidi, D., “Toward a New differences in people‟s abilities and preferences to use Generation of Semantic Web Applications”, IEEE Intelligent different interaction modes, multi-lingual and -modal Systems, 2008, pp. 20-28. interfaces will increase the accessibility of information [5] Wachs, J., Stern, H., Edan, Y., Gillam, M., Feied, C., Smith, through ICT technology for users of different ages, skill M., and Handler, J., “A hand-gesture sterile tool for browsing levels, cognitive styles, sensory and motor impairments, or MRI images in the OR”, Journal of the American Medical native languages. ICT and SW applications gain ground Informatics Association, 2008, 15, pp. 3321-3323. rapidly today in everyday life and are available to a [6] Wachs, J.P., Kölsch, M., Stern, H., Edan, Y., “Vision-based broader range of everyday users and usage contexts. Thus hand-gesture applications”, Communications of the ACM, the needs and preferences of many users should be taken 2011, 54 (2), pp. 60-71. [7] Asteriadis, S., Tzouveli, P., Karpouzis, K. Kollias, S., into account in the development of future applications. “Estimation of behavioral user state based on eye gaze and High customization and personalization of applications is head pose-application in an e-learning environment.”, needed, both because the limitations of challenged people Multimed Tools Appl, 2009, 41, pp. 469-493. can vary significantly and change constantly and in order [8] OWL Web Ontology Language Use Cases and Require- to minimize the learning effort and cognitive load. ments: http://www.w3.org/TR/webont-req/, June 2015 AAL can efficiently combine multimodality and SW [9] Gracia, E. M. Ponsoda, P. Cimiano et al. “Challenges for the applications in the future to increase the quality of life of multilingual Web of Data”, Journal of Web Semantics, 11: 2012, 63-71.

[10] Buitelaar, P., Cimiano, P. (Eds.), Towards the Multilingual [26] Potamianos, A., Perakakis, “Human-computer interfaces to Semantic Web, Principles, Methods and Applications, multimedia content: a review”, in Maragos, P., Potamianos, Springer, 2014. A. Gros, P. (Eds), Multimodal Processing and Interaction: [11] Smeulders, A., Worring, M., Gupta, A., Jain, R., “Content- Audio, Video, Text, 2008, pp. 49-90. Based Image Retrieval at the End of the Early Years.”, in: [27] Atrey, P.K., Hossain, M.A, El Saddik, A., Kankanhalli, Proceedings of IEEE Transactions on Pattern Analysis and M.S. “Multimodal fusion for multimedia analysis: a Machine Intelligence, 2000, 22 (12), pp. 1349-1380. survey” Multimedia Systems, 2010, 16, pp. 345-379. [12] Computerized Manufacturing Automation: Employment, [28] Oviatt, S., Cohen, P. “Multimodal interfaces that process Education, and the Workplace (Washington, D. C., U.S. what comes naturally”, Communications of the ACM, Congress, Office of Technology Assessment, OTACIT- 2000, 43 (3), pp. 45-52. 235, 1984. [29] Cross, C.W., Supporting multi-lingual user interaction with [13] Seals, C.D., Clanton, K., Agarwal, R., Doswell, F., Thomas, a multimodal application, Patent Application Publication, C.M.: Lifelong Learning: Becoming Computer Savvy at a United States, Pub. No: US/2008/0235027, 2008. Later Age. Educational Gerontology, 2008, 34 (12), pp. [30] Atzori, L., Iera, A., Morabito, G. “The Internet of Things: A 1055-1069. survey”, in Computer Networks, 2010, 54, pp. 2787-2805. [14] Bolt, R.A.: Put-that-there: Voice and gesture at the graphics [31] Perera, C., Liu, C.H., Jayawardena, S., Chen, M., “A Survey interface. ACM Computer Graphics, 1980, 14 (3), pp. 262- on Internet of Things From Industrial Market Perspective”, 270. in IEEE Access, 2015, 2, pp. 1660-1679. [15] Oviatt, S.L., Cohen, P.R., Wu, L. et al. “Designing the user [32] Henricksen, K., Indulska, A., Rakotonirainy, A. “Modeling interface for multimodal speech and gesture applications: context information in pervasive computing systems”, in State-of-the-art systems and research directions for 2000 Proceedings of the 1st International Conference on and beyond”, in Carroll, J. (Ed.), Human-Computer Pervasive Computing, 2002, pp. 167-180. Interaction in the New Millennium, 2000, 15 (4), pp. 263- [33] Chen, H., Finin, T., Joshi, A. “Semantic Web in the Context 322. Broker Architecture”, in Proceedings of the Second IEEE [16] Wahlster, W., Reithinger, N., Blocher, A. “SmartKom: Annual Conference on Pervasive Computing and Multimodal communication with a life-like character”, in Communications, 2010, pp. 277-286. Proceedings of the 7th European Conference on Speech [34] Coronato, A., Esposito, M., De Pietro, G. “A multimodal Communication and Technology, 2001, pp. 1547-1550. semantic location service for intelligent environments: an [17] Wahlster, W. “Towards Symmetric Multimodality: Fusion application for Smart Hospitals”, in Personal and and Fission of Speech, Gesture and Facial Expression”, in Ubiquitous Computing, 2009, 13 (7), pp. 527-538. G nter, A., Kruse, R., Neumann, B. (Eds.): KI 2003: [35] Sheth, A., Henson, C., Sahoo, S. “Semantic Sensor Web” Advances in Artificial Intelligence. Proceedings of the 26th IEEE Internet Computing, 2008, pp- 78-83. German Conference on Artificial Intelligence, 2003, pp. 1- [36] Jacquet, C., Mohamed, A., Mateos, M. et al. “An Ambient 18. Assisted Living Framework Supporting Personalization [18] Griol, D., Molina, J.M., Corrales, V., “The VoiceApp Based on Ontologies. Proceedings of the 2nd International System: Speech Technologies to Access the Semantic Conference on Ambient Computing, Applications, Services Web”, in Advances in Artificial Intelligence, 2011, pp. and Technologies, 2012, pp. 12-18. 393-402. [37] Ambient Assisted Living Joint Programme (AAL JP): [19] He, Y.; Quan, T., Hui, S.C. “A multimodal restaurant finder http://www.aal-europe.eu/. Accessed 5 Mai 2015 for semantic web”, in Proceedings of the 4th International [38] Eichelberg, M., Lipprandt, M. (Eds.), Leitfaden inter- Conference on Computing and Telecommunication operable Assistenzsysteme - vom Szenario zur Technologies, 2007. Anforderung. Teil 2 der Publikationsreihe “Interoperabilität [20] Thurmair, G. Searching with ontologies – searching in von AAL-Systemkomponenten”. VDE-Verlag, 2013. ontologies: Multilingual search in the Assistive Technology [39] Aghaei, S., Nematbakhsh, M.A., Farsani, H.K. “Evolution domain. Towards the Multilingual Semantic Web, 2013. of the World Wide Web: From Web 1.0 to Web 4.0.”, in [21] United Nations Open Audit of Web Accessibility: International Journal of Web & Semantic Technology, http://www.un.org/esa/socdev/enable/documents/fnomensa 2010, 3 (1), pp. 1-10. rep.pdf [40] Murugesan, S. “Web X.0: A Road Map. Handbook of [22] Web Accessibility for Older Users: A Literature Review: Research on Web 2.0, 3.0, and X.0: Technologies, http://www.w3.org/TR/wai-age-literature/. Accessed 25 Business, and Social Applications”, in Information Science Aug 2013 Reference, 2010, pp. 1-11. [23] W3C Web Accessibility Initiative (WAI): [41] Autexier, S., Hutter, D., Stahl, C. “An Implementation, http://www.w3.org/WAI/ Execution and Simulation Platform for Processes in [24] Benjamins, V.R., Contreras, J., Corcho, O., Gómez-Pérez, Heterogeneous Smart Environments”, in Proceedings of the A. “Six Challenges for the Semantic Web”, in KR2002 4th International Joint Conference on Ambient Intelligence, Semantic Web Workshop, 2002. 2013. [25] Avrithis, Y., O‟Connor, N.E., Staab, S., Troncy, R. [42] Nehmer, J., Becker, M., Karshmer, A., Lamm, R. “Living “Introduction to the special issue on “semantic assistance systems: an ambient intelligence approach”, in multimedia”, in Multimedia Tools Applic, 2008, 39, pp. Proceedings of the 28th International Conference on 143-147. Software Engineering, ACM, New York, NY, USA, 2006, pp. 43-50.

[43] A2E2 project: http://www.a2e2.eu/. Accessed 11 June 2015 informal communication in spatially distributed groups by [44] CARE project: http://care-aal.eu/en. Accessed 11 June exploiting smart environments and ambient intelligence. 2015 In 2015 she has been awarded a Marie Curie-Individual [45] Weber, R.H., “Internet of Things – New security and Fellowship grant on the topic of Tangible User Interfaces. In the last years, she has supervised numerous BA, MA and PhD privacy challenges”, Computer Law & Security Review, students. In total she has published a book (PhD version), 17 2010, 26 (1), pp. 23-30. journal/magazines papers, 32 papers in conference and [46] Sun, H., De Florio, V., Gui, N., Blondia, C. “Promises and workshop proceedings, and she is editor of 6 workshop Challenges of Ambient Assisted Living Systems”, in proceedings. In addition, she is a member of 13 programme Proceedings of the 6th International Conference on committees for conferences and journals (such as Jounral of Information Technology: New Generations, 2009, pp. Information Science, Computer Standards and Interfaces 1201-1207. journal). She has 6-year teaching experience mainly in the field [47] Klein, M., Schmidt, A., Lauer, R. “Ontology-centred design of Computational Linguistics.

of an ambient middleware for assisted living: The case of SOPRANO”, in the 30th Annual German Conference on Artificial Intelligence, 2007. [48] Muñoz, A., Augusto, J.C., Villa, A., Botia, J.A. “Design and evaluation of an ambient assisted living system based on an argumentative multi-agent system, Pers Ubiquit Comput 15: 377-387, (2011) [49] Hois, J. “Modularizing Spatial Ontologies for Assisted Living Systems”, in Proceedings of the 4th International Conference on Knowledge Science, Engineering and Management, 2010, 6291, pp. 424-435. [50] Bateman, J., Castro, A., Normann, I., Pera, O., Garcia, L., Villaveces, J.M. OASIS common hyper-ontological framework, OASIS Project, Tech. Rep., 2009. [51] Beins, S.: Konzeption der Verwaltung eines intelligenten Kleiderschranks, Bachelor Thesis, Fachbereich 3: Mathematik und Informatik, University of Bremen, 2013. [52] Voigt, M. Entwicklung einer mittels Barcode-Lesegerätes automatisierten Einkaufsliste, Bachelor Thesis, Fachbereich 3: Mathematik und Informatik, University of Bremen, 2013. [53] Anastasiou, D. “Speech-to-Speech Translation in Assisted Living. Proceedings of the 1st Workshop on Robotics in Assistive Environments”, in the 4th International Conference on Pervasive technologies for Assistive Environments, 2011. [54] O‟Grady, M.J., Muldoon, C., Dragone, M., Tynan, R., O'Hare, G.M.P. “Towards evolutionary ambient assisted living systems”, Journal of Ambient Intelligence and Humanized Computing, 2010, 1 (1), pp. 15-29.

Dr. Dimitra Anastasiou finished her PhD in 2010 within five years on the topic of “Machine Translation“ at Saarland University, Germany. Then she worked for two years as a post- doc in the project “Centre for Next Generation Localisation” at the University of Limerick, Ireland. There she designed guidelines for localisation and internationalisation as well as file formats for metadata, leaded the CNGL-metadata group, and was a member of the XML Interchange File Format (XLIFF) Technical Committee. In the next two years she continued with the project “SFB/TR8 Spatial Cognition” at the University of Bremen, Germany. Her research focused on multimodal and multilingual assistive environments and improvement of dialogue systems with relation to assisted living environments. She run user studies in the “Bremen Ambient Assisted Living Lab (BAALL)” with participants interacting with intelligent devices and a wheelchair/robot and did a comparative analysis of cross- lingual spatial spoken and gesture commands. Currently she is working at the University of Oldenburg, Germany in the DFG project SOCIAL, which aims at facilitating spontaneous and

An Empirical Method to Derive Principles, Categories, and Evaluation Criteria of Differentiated Services in an Enterprise

Vikas S Shah1

1Wipro Technologies, Connected Enterprise Services East Brunswick, NJ 08816, USA [email protected]

Abstract BPs’ associations to their activities and reorganize based Enterprises are leveraging the flexibilities as well as on either changes to the or new BP requirements [5] and consistencies offered by the traditional service oriented [19]. It allows accommodating the desire level of architecture (SOA). The primarily reason to imply SOA is its alterations and respective association in the BPs across ability to standardize way for formulating separation of concerns enterprise by means of combining capabilities of more and combining them to meet the requirements of business processes (BPs). Many accredited research efforts have proven granular services or nested operations. the advantages to separate the concerns in the aspects of one or more functional architectures such as application, data, platform, DSs deliver the framework to place and update BPs as well and infrastructure. However, there is not much attention to as other important capabilities of monitoring and managing streamline the approach when differentiating composite services an enterprise. It enterprises accelerated time-to-market, derived utilizing granular services identified for functional increased productivity and quality, reduced risk and architectures. The purpose of this effort is to provide an project costs, and improved visibility. Enterprises often empirical method to rationalize differentiated services (DSs) in underestimate the amount of change required to adapt the an enterprise. The preliminary contribution is to provide abstract concept of DSs. [15], [16], and [17] indicates that DSs are principles and categories of DS compositions. Furthermore, the paper represents an approach to evaluate velocity of an enterprise usually architected, updated, and built based on ongoing and corresponding index formulation to continuously monitor the changes into the enterprise. For example, newly introduced maintainability of DSs. product of digital electric meter will be added to the Keywords: Business Process (BP) Activities, Differentiated product database and the service to “capture the meter Services (DSs), Enterprise Entities, Maintainability, data remotely” gets updated explicitly and in composition Requirements, and Velocity of an Enterprise. with data service to formalize the capabilities of the new product. The primary concerns such as update to the data service and the behavior of digital electric meter during the 1. Introduction outage are not being addressed or realized during later stages when the specific event occurs pertaining to the Traditionally, services of SOA are composited to associate specific BP. enterprise entities and corresponding operations to business process (BP) activities. The concept of DSs is Consequently, the entire purpose of DSs and their fairly novel that introduces level of variations necessary to association with the enterprise entities are misled. It accommodate all the potential scenarios that are required indicates that through feasibility analysis and navigation of to be included within the diversified business processes [4] complex cross functional changes of BPs associated with and [7]. DSs are the services with similar functional the enterprise entities are essential before updating DSs. characteristics, but with additional capabilities, different The analysis presented in this paper identifies core service quality, different interaction paths, or with different characteristics of DSs and their association to the modeled outcomes [5]. DSs provide the ability to capture precise BPs of an enterprise. The paper presents an approach to interconnectivity and subsequently the integration between rationalize the relationship between the DSs and the BPs and the operations of enterprise entities [12]. desired variability in BP activities. The goal is to streamline and evaluate association between the BP Typically, BP association with enterprise entities begins requirements and baseline criteria to incorporate them into with assessments of the goals and objectives of the events DSs. It sets the principles, categories, and evaluation required to accomplish the BP requirements. After criteria for DSs to retain the contexts and characteristics of modeling, BPs are implemented and consequently DSs in an enterprise during various levels of updates. deployed to the platform of choice in an enterprise. The DSs have the built-in ability to considerably amend the

In section 2, the primary concerns of the DSs and the decisions are based on some or other way related to the corresponding review of the past research efforts are following criteria. presented. Section 3 provides methodology to institute DSs in an enterprise and derives preliminary principles.  Existing product or service offerings and their Identified meta-level categories of DSs are enumerated in enhancements, support, and maintenance. For example, Section 4. The classification of DSs is based on DSs associated with the online payment BP has to characteristics as well as anticipated behavior of the DSs. consider the product subscribed or in use by the Section 5 represents the evaluation method for velocity of customer. change in an enterprise considering 7 different BPs.  New products or services that will enhance revenue or Section 6 proposes and derives practical criteria to indicate gain new market share in the current or near term maintainability of DSs depending on their classification. timeframe. The most prominent DS example is to Section 7 presents conclusion and future work. replace electric meter with the smart meter for specific set of customers.  Innovation related to future trends and 2. Literature Reviews and Primary Concerns competition. Product and service offerings that require of Introducing DSs immediate development, however, will not contribute to revenue until outlying years. DSs deployed to BPs assist businesses to make decisions in order to manage prospect search and survey to investigate interest in the enterprise. Using a combination of a BP activities, advanced smart grid products are the examples. associated metrics, and benchmarks, organizations can  Exit strategies for existing product or service identify enterprise entities that are most in need of offerings. Proactively determining end life of the improvement. There has been an increasing adaptation of products or services. In many cases, the previous BPs to derive granular level principles for an enterprise in products and services are either need to be recent years [2], [18], [22] and [29]. The Open Group discontinued or advanced significantly. The foremost Architectural Framework (TOGAF) [31] reserves business example is videocassette recorder. architecture as one of the initial phase to define BPs. The Supply Chain Council’s Supply Chain Operations The result of the decision process is a set of principles and Reference-model (SCOR), the Tele-Management Forum’s key value propositions that provides differentiation and Enhanced Telecom Operations Map (eTOM), and the competitive advantages. Various attempts have been made Value Chain Group’s Value Reference Model (VRM) either in specific use case [34] or in abstract framework are the prominent examples of specifying BPs. standardization [32] and [25]. Rationalized principles have a much longer life span. These principles are direct or However, widely accepted enterprise architecture (EA) and indirect reflection to attend the uncertainties of an other frameworks [27] and [2] aren’t addressing the enterprise. The principles should consider all the levels as complexities of implementing desired variability in BPs well as categories of uncertainties identified or evaluated and corresponding BP activities. They are highly deficient during the BP activities. In [14], three types of in specifying synergies of the DSs to BPs in an enterprise. uncertainties are illustrated with examples. BP management suite providers are also offering either inherent SOA and EA capabilities or third-party  State uncertainty relates to the unpredictability that integration adapters [8]. As specified in [3], [6], and [11], represents whether or when a certain change may it is primarily to eliminate the friction between BPM, occur. The example of state uncertainty is the initiation anticipated variations in services, and enterprise of outage process (by the utility corporation providing architecture modeling. The most prevalent examples are the outage to restoration services). Oracle SOA suite [24], Red Hat JBOSS BPM and Fuse  Effect uncertainty relates to the inability to predict the products, OpenText BPM suite, IBM BPM suite [21], and nature of the impact of a change. During the outage due Tibco Software as indicated in [8]. BP management suites to unforeseen weather condition, it is absolutely are still struggling to achieve their enterprise potential best unpredictable to know the locations or areas of impact. practices to implement and update DSs.  Response uncertainty is defined as a lack of knowledge of response options and/or an inability to The BP requirements are usually grouped to formulate the predict the consequences of a response choice. future state of an enterprise. These requirements drives the Generally, utility provider has guideline for restoration vision and guides the decisions to introduce DSs. Various during the outages, however, it is unpredictable during different research efforts [20], [23], and [33] indicates that the situations that are never been faced before, such as undermined breaks in the circuits.

level of updates and interdependencies with enterprise DS needs to implement these uncertainties either entities associated with the services (DSs or other types). proactively initiating a change or reactively responding to the change. The conclusion of various studies [9], [10], Defining and Evolving Service Architecture: It is the [18], and [22] indicates that first step to consistently primary step to define, update, version, and deploy DSs. implement and update DSs is to define principles. These The DS gets evolved and advanced accommodating the principles govern maintaining DSs in the correlations with desired level of diversification identified in previous step. the enterprise entities and advancements of BP activities. The responsibilities of this step also include evaluating the potential uncertainties and alternate path that needs to be derived in adherence to identified uncertainties. 3. Deriving Principles of DSs The decision whether to introduce additional DS, The analysis of primary concerns and literature reviews additional operation to existing DSs, or changes to the illustrated in Section 2 justifies that the method for operations of existing DSs has to be achieved during this deriving principles of DSs should fundamentally have a step. Modeling to map DSs with BP activities and focus at the BP requirements, identified and placed BP streamlining their implementation are the part of this phase activities, and interdependencies between events of BP of DSs enabled enterprise. activities. The BP requirements have to be reviewed to certify the legitimacy and candidature for diversification to n

form DSs’ specification. Figure 1 presents a sequence of o Stage 0 Review and i t

a Validate Approved i t steps performed to identify principles of DSs in an i n Requirement NO Analyze Business Impact, I

d Register(ed) BP YES Conflict of Interest, and n

enterprise and architect DSs in adherence to BP a Requirements Notification s t Stage ACN n requirements. e Notification to m

e Enterprise r i

u q e R

BP Requirements and Initiation: The first step is to P B

validate BP requirements alignment with business and Business Process Integration Architect Systems Architect Architect s

goals of an enterprise. Stage 0 (initiation) is defined to t c a

f Categorize Implication of i

t Feasibility Availability of reiterate and evaluate BP requirements at each phase (or r Requirement Update to

A Analysis of Service for

e & Send it for Infrastructure &

r Process Map Diversification

u Platform

step). When there is an ambiguity identified in the BP t Review c

e Resources t i

requirement at any step due to responsibilities associated h c r A

Stage ACN NO Approved Approved Approved with the corresponding step then Stage 0 has been initiated. g n i s

s ITERATE e NO s NO Stage ACN is defined to analyze business impact, conflict s YES

A YES YES ITERATE d

of interest (if any exists), and notification across enterprise. n a ITERATE g n

i BP and Activity r Enterprise-level Service Update Resource e

v Implications and Criteria and o Composition of Specification and

c Updated s Dependencies Discovering and Assessing Architecture Artifacts: i Service Availability for

D Specification Specification Specification Update When an enterprise receives alterations or new BP Stage 0 e

requirements, it needs to assess the impact in terms of r u

t DSs Design and

c Service Level Testing Deployment Iterations e Development t other architectures associated with an enterprise (BP i h c r A architecture, integration architecture, and system Review Categorize Update DS Deployment and e

c Contextual and i Service (Critical, Medium, Versioning v

r Regression Testing

architecture). The responsibility of this step is to identify e Specification High, & Low) S

g n the need of introducing or updating architecture artifacts i v YES l

o DS & BP Modeling, Approved v E based on the process map (that is, association of services to Approved YES Design, and d NO n Updated YES a

Approved

the BPs or their activities). Primarily, it is accountable to g Stage 0 Implementation n i

n Notification i f

identify whether any sublevel BPs (within existing BPs) e NO NO of Iteration D

and any additional BP activities required to be introduced. n o

i Support Ticketing & t DS Configuration DS Monitoring The need of introducing additional sublevel BPs or BP a

r Notification t s i n activities may be either due to critical to major i Evaluating Continuous m

s DS Description Receive and d Resources based Auditing and m A

and End-point Analyze User g i advancements in BP requirements or changes necessary to e on Scale of Monitoring of c d

i Configuration Tickets a v

r Update SLAs r a other associated architecture artifacts (including e P S YES g YES NO n i integration and system architectures). t a Available? Resolve? i Stage 0 Failure? c

o NO s

A YES The other major responsibility of this step is to check NO availability of services for diversification based on BP requirements. It is also liable for specifying the desired Fig. 1 Steps to identify principles of DSs and architect DSs in an enterprise.

Associating Service Administration Paradigms: 4. Identified Categories of DSs Specifying and resolving interdependencies of DSs with participant enterprise entities are the responsibilities of this Due to increasing availability and development of SOA step. It needs to ensure that DSs are in adherence to the and BPs [26] and [28] platforms, services are being availability of the enterprise resources and their defined characterized in numerous different aspects. The foremost Service Level Agreements (SLAs). Configuration, utilized classification methodology is functional monitoring, and supporting DSs in association with architecture types such as platform services, data services, enterprise entities (including any failure condition or application services, and infrastructure services. Another resolution to uncertainties) are also the accountability of approach is to classify industry segment specific services this step to derive principles of DSs in an enterprise and such as healthcare services, utility services, and payment provide informed architecture decisions for DSs. services. Certain enterprises are also inclined to introduce custom classification of the services due to unavailability Following are the principles derived to identify, specify, of the standards as well as rationalization. develop, and deploy DSs in an enterprise based on the steps necessary to achieve BP requirements. Each step Identified principles of DSs indicate that DSs are required identified in Figure 2 reveals and constitutes the reacting to the set of events associated with BP activities. foundation for deriving the principles of DSs in DSs are independently built or composited utilizing one or relationship with BP requirements. more types of services placed in an enterprise. DSs need to be categorized such that each type can be streamlined  Specification of DS’s operation into information that based on their characteristic and governed based on the can be utilized in BPs in the context of concrete type of SLAs associated with them. Following is the list of activities. The most prominent example is BP activity identified categories of DSs based on their characteristics. “generate invoice” needs DS that retrieves and combines the information of purchased products and Competency Services: DSs that participates to satisfy one their current pricing. or more competencies of the core business offerings are  Deterministic specification of relationship between BP categorized as competency services. Certain features activities and enterprise entities in DS. In the example between different versions of the same product-line are of BP activities generate invoice, if any discount has to generic and essential, however, some features need to be be implied then it needs to be in correlations with the distinguished in the DS. pricing of the product.  Precisely define BP activity’s events that can be Relationship Services: DSs presenting external and emulated, monitored, and optimized through DS. The internal relationships of the enterprise entities with the role BP activity “generate invoice” request requires to be associated with the entities such as customer, partner, and validated before retrieving the other related supplier. The example of such DS is the relationship of information. order with customer differs from the vendor and  Impact of people, processes, and product (or service) corresponding action needs to differ in the operations of offerings as metadata associated with the DS. The BP DS. activity “generate invoice” can only be initiated by specific role associated with the employee (example: Collaboration Services: Any DS offering collaboration manager) or triggered by another activity such as among varied enterprise entities and BP activities are “completed order”. considered the participant of collaborative service  Specify and govern SLAs of DS in the context of category. Calendar request to schedule the product review associated BP activity. The invoice should be meeting is the type of collaborative service where generated within 3 seconds of completing order is an participants can be either reviewer, moderator, or optional. example of SLA.  Regularly place and evaluate governance paradigms for Common Services: When an enterprise gain maturity, it DS in association with BP activity to address needs to have standardized audit, log, and monitor uncertainties. The BP activity “cancel order” or capabilities. These standardized DSs falls in the category “returning an item (or product)” can occur after of common services. They are built to utilize consistently invoice has been generated. If those activities are not across multiple sets of BP activities with specific objective defined and updating, canceling or revising invoicing to monitor. Generating invoice and amount paid for an capabilities are not defined then it needs to be order are different BP activities, however, the number of introduced. item purchased are same and they are required to be monitored as well as verified between BP activities.

 Instance identification of the DS. Framework Services: The framework services are to  Category of the DS. increase awareness of the enterprise’s technology  BP name and activity utilizing the DS. architecture capabilities. DS built to search metadata  Registered consumer group and associated role using associated with application services, data services, the DS. platform services, or infrastructure services is an example  Service’s probability of failure (recursively identified of framework service. The DSs differs in terms of what from the audit logs). type of metadata can be searched for which kind of service.

Governance Services: DSs deployed to ensure the 5. Evaluating Velocity of an Enterprise policies and practices are the governance services. Most diversification to the security related services including The experimental evaluation is based on set of 62 DSs out role based entitlement are the participant of governance of 304 services (includes functional architecture type services. services as well as industry segment specific services besides dedicated DSs). The services are built in Oracle Organizational Services: Organization culture has various SOA suite [24] that has internal capabilities to map and impacts on the BP activities. DSs that offer common generate relationship with BP activities. 4 iterations of the understanding of organization culture as well as corporate development, updates, and deployment have been processes are the organizational services. Ordering and conducted for the following 7 BPs. The BP activities and utilizing office supplies for different departments is an DSs are derived based on severity of the BP requirements. example of organizational service. In this example, DS differs in terms of accessibility of type of supplies to the BP# 1: Customer enrollment and registration particular department. BP# 2: Manage customer information, inquiry, and history BP# 3: Purchase order Strategic Services: DSs participates in making a decision BP# 4: Payment processing and account receivables that impacts strategic direction and corporate goals are BP# 5: Invoicing categorized as strategic services. Financial analysis based BP# 6: Notification and acceptance of terms selection of marketing segments and budgeting based on BP# 7: Account management available statistics of annual spending are the types of strategic services. Velocity of the enterprise is representation of the rapid changes and updates necessary to achieve the BP Conditional Services: Certain BP activities require special requirements. The changes can be achieved through attention and business logic dedicated to particular updating or introducing either DS operations, DSs, BP condition. The DSs built, updated, and maintained to activities, or sublevel BPs. Correspondingly, the velocity is accommodate such scenarios are subject to this based on four types of ratios as specified bellow. The classification. Credit card with special privilege for ratios are representation of the level of change necessary to purchases over allocated limit is an example of such DSs. achieve goals of BP requirement.

Automation Services: They are the services defined and  DSs’ Ratio (DSR) = (Additional composite service / utilized to introduce desired level of automation, yielding Total number of services) additional business value for new or existing BP activities.  DS Operations’ Ratio (OPR) = (Additional Typically, automation related services require stronger accumulative number of DSs operations / bonding and maturity at the BP activities. Service to send Accumulative number of DSs operations) email notification for the approval versus the service for  BP Activities’ Ratio (AR) = (Additional BP activities / online approval is the classical example of such DSs. Total number of BP activities)  Sublevel BPs’ Ratio (SBR) = (Additional sublevel BPs DSs can be associated with multiple categories. However, / Total number of sublevel BPs) alias to the DS is utilized for the secondary category such that it can be independently monitored and audited. The velocity evaluation presented in Eq (1) also introduces Optional DSs’ common header elements (or metadata) are impact factor corresponding to each ratio, that is, c introduced to capture the runtime metrics for the DSs. (critical), h (high), m (medium), and l (low). The assigned Following are the additional information that DSs’ values for the impact factors are c = 10, h = 7, m = 4, and provides at runtime for further evaluation. l = 2 to indicate finite value for the severity of update. There is absolutely no constraint to revisit the allocation of

severity to update impact factors during subsequent As such there is no maximum limit set for the velocity, iterations of updates to BP requirements and however, present deployment iteration’s velocity score can corresponding deployment cycle. It should be based on be considered as the baseline for subsequent iterations. findings as well as severity of BP requirements in The progressive values of velocity are indicated in Figure consideration. 2 for each iteration (1 through 4) pertaining to the 7 BPs in In Eq (1), #BPs represents total number of participant BPs consideration. to form DSs enabled enterprise (7 in this case). When there is a need to introduce or update sublevel BP due to BP requirement then it is considered critical (c) change to an 6. Formulating DSs Maintainability Index enterprise. Whereas, update to or introduction of DS (DSMI) operation is considered lowest category of change, that is, low (l). There is no obvious solution to evaluate maintainability of DSs. The primary reason is due to the little to no effort for VELOCITY = defining maturity model and standardization for DSs. SOA #BPs m DSR  l  OPR  h  AR  c SBR maturity models and governance are implied at more BP1 (1) operational aspects of the functional architecture type # BPs services [30] and [13]. The other types of metrics presented in [1] and [32] to measure the agility irrespective Table 1 provides implementation based analysis and of the maintainability concerns of DSs. The DSMI is an computed velocity of 4th deployment iteration of BP effort to compute and continuously monitor maintainability requirements corresponding to the 7 BPs (as described of DSs. Oracle SOA suite capabilities are utilized to above). Following are the acronyms utilized in Table 1. monitor and log DSs. Service registry features are  #DS: total number of participant DSs for the BP. embraced to define, govern, and monitor SLAs as well as  #OPs: accumulative number of DSs’ operations metadata associated with the DSs. involved.  #As: total number of BP activities for the BP. 6.1 Paradigms to Derive Inverted DSMI  #SBPs: total number of sublevel BPs of the BP. The paradigms to formulate DSMI are described below for  #A-CS: sum of new and updated DSs to the BP in each type of DSs. iteration 4.  #A-OPs: sum of new and updated number of DSs’ Business continuity (BUC): It is to determine whether the operations introduced to the BP in iteration 4. introduced or updated DSs are able to continue the day-to-  #A-A: sum of new and updated BP activities day business activities after the deployment (or iteration). introduced to the BP in iteration 4. The evaluation criterion for BUC paradigm is to monitor  #A-SBPs: sum of new and updated sublevel BPs the number of unique support tickets created for type of introduced to the BP in iteration 4. DSs in context. For example, new customer registration is DSs’ operations, DSs, BP activities, and sublevel BPs that providing errors due to inaccuracies in validation of are being reused across multiple BPs are counted at each customer account number and/or customer identification. and every instance for the purpose of accuracy to evaluate velocity. The inverted ratio for BUC specific to the set of DSs associated with the DS type is derived below. Table 1: Velocity of the enterprise in iteration 4 BP# #DSs # OPs #As # SBPs iBUC = (# of unique support tickets by the customer (#A-CSs) (#A-OPs) (#A-A) (#A-SBPs) / #DSs deployed for ) 1 7(0) 20(2) 8(1) 2(0)

2 12(3) 28(7) 15(0) 3(0) Operational risk (ORI): Operational risks are basically to 3 18(4) 42(7) 22(2) 5(1) evaluate the DS level continuation of the enterprise operations. Typically, it is traced by the number of failures 4 8(2) 15(3) 15(2) 3(0) occurred for the DSs in the production cycle of present 5 5(1) 12(2) 10(1) 3(0) deployment iteration. The specific example of change 6 3(0) 8(1) 7(0) 1(0) purchase order request DS failed due to unambiguous condition occurred within the dedicated DSs. The inverted 7 9(2) 16(3) 14(1) 2(0) ratio for ORI specific to the set of DSs associated with the VELOCITY (of Iteration 4) = 1.52 DS type is derived below.

an example of extendibility of DSs associated with iORI = (# of unique operational failures/ (#DSs payment processing and account receivable BP. The deployed for ) inverted ratio for ECI specific to the set of DSs associated with the DS type is derived below. The ratio of oRIS is being generated by comparing the failures with previous deployment iteration. The DSs iECI = (# of alternate BP flows accustomed in DSs header contains probability of failures and it is being of / #DSs deployed for ) automated at some extend to gain indicative operational risk at runtime (as stated in Section 4). If “n” stands for the number of DS types identified in an enterprise (it is 10 in this case based on Section 4) then SLA Factorization (SPR): Scalability, reliability, and inverted DSMI can be computed based on Eq. (2). performance (SPR) are being bundled to evaluate SLA #Paradigms (number of paradigms) to impact the DSMI is factorization. The SLAs defined in consideration of 5 as described above. desired SPR for each type of DSs are configured and monitored. The SPR is identified based on the number of Inverted DSMI = (1 / DSMI) = violations by the particular category of DSs in the present n n n n n deployment iteration. The 4 seconds delay (when SLA is [(  iBUC) / n][(  iORI ) / n][(  iSPR) / n][(  iCOS) / n][(  iECI ) / n] 1 1 1 1 1 (2) set for maximum 3 seconds) in sending order confirmation #Paradigms to vendor for specific product due to the heavy traffic is an example of SLA violation. The inverted ratio for SPR Table 1 below presents the DSMI computed in the iteration specific to the set of DSs associated with the DS type is 4 for the identified and deployed 7 BPs (as described in derived below. Section 5).

iSPR = (# of unique SPR specific SLA violations/ Table 2: DSMI in iteration 4 (#DSs deployed for ) Paradigm iBUC iORI iSPR iCOS iECI

Consistency (COS): Consistency can be evaluated at many different aspects. The primary objective of this DS Type criterion is to assess scope of the DS across multiple BP (# of DSs) activities. Due to the BPs requirements, specification of the Competency (6) 0.33 0.83 0.5 0.5 0.67 DS needs to incorporate high level interactions with enterprise entities and underneath events of BP activities. Relationship (12) 0.25 0.67 0.5 1.5 0.5 The consistency of DS is being derived based on the Collaboration (4) 0.25 0 0.25 0.5 0.25 number of BP activities utilizing the specific type of DSs in considerations. The most prominent example is order Common (7) 0.29 0.14 0.42 2 0.29 delivery confirmation and status needs to be sent to Framework (8) 0.5 0.25 0.75 0.5 0.38 customer, vendor, and account receivables. The inverted ratio for COS specific to the set of DSs associated with the Governance (6) 0.33 0.5 0.33 1.5 0.83 DS type is derived below. Organizational 0.29 0.14 0.42 0.71 0.86 (7) iCOS = (# of BP activities utilizing DSs of / #DSs deployed for ) Conditional (5) 0.6 0.8 0.4 0.4 0.2 Extendibility and Continuous Improvements (ECI): Automation (3) 0.33 0.67 1.67 0.67 2 Extensibility and continuous improvement of the DSs are Actual DSMI (of Iteration 4) = 1.76 evaluated based on customization required to accomplish BP requirements. It is computed considering the number of additional custom modeling as well as implementation 6.2 Analysis and Observations of Evaluation needed in context of BP activity and enterprise entity. The primary objective is, whether respective DSs are able to Figure 2 provides the progress of velocity and DSMI accommodate these customizations within the dilemma of through iteration 4 for the 7 BPs deployed, advanced, and their dependencies with existing enterprise entities. If the monitored. The finite numbers indicate the significant payment is not received within 6 months then it needs to be reduction in velocity over the iterations. 58% reduction in sent for collection and vendor also needs to be notified, is velocity (of deployment iteration 4) compare to iteration 3.

The graph also indicates increase in DSMI over the and placing DSs. The categorization and corresponding iterations. The DSMI (of deployment iteration 4) is implementation for BP requirements into the DSs are improved by 21% compare to iteration 3. The result identified and implied. Formulae to evaluate velocity of directly illustrates that continuous monitoring and enterprise and assessment criteria to monitor improvements in terms of reducing the number of issues maintainability of deployed DSs in terms of index are reported by the business users, immediate resolutions to illustrated with an example implementation and validated causes of services’ failures, accurate modeling of DSs with in number of actual deployment iterations. respective to the BP requirements, and precisions in test scenarios decreases the velocity of enterprise and stabilizes The rationalization achieved utilizing the methodology to the DSMI. derive and place principles of DSs increases consistency and predictability across multiple units as well as entities of an enterprise. The measurable implications due to changes in BP requirements and assessable maintainability are accomplished due to the classification and evaluation methodologies of DSs. The subsequent step is to determine more granular level of DSs types that can be leveraged in multifaceted BP scenarios. The underneath primary goal remains intact, that is, to evolve, retain, and stabilize maintainability of DSs.

Acknowledgments

Vikas Shah wishes to recognize Wipro Technologies’ Fig. 2 Computed velocities and DSMI for all deployment iterations in Connected Enterprise Services (CES) sales team to support production. the initiative. Special thanks to Wipro Technologies’ Oracle practice for providing opportunity of implying conceptually identified differentiated principles, types, and Essentially, it concludes that more number of BP activities measurements in 4 different iterations. utilizing single DS and more number of alternate path inclusion to single DS decreases the level of References maintainability of DSs, however, it increases the [1] A. Qumer and B. Henderson-Sellers, “An Evaluation of the consistency and extendibility of the DSs. Contrarily, Degree of Agility in Six Agile Methods and its Applicability for Method Engineering,” In: Information and Software introducing more number of DSs also increases additional Technology Volume 50 Issue 4, pp. 280 – 295, March 2008. level of SLAs’ associations and uncertainties, however, [2] Alfred Zimmermann, Kurt Sandkuhl, Michael Pretz, introduces increased level of flexibility and agility in an Michael Falkenthal, Dierk Jugel, Matthias Wissotzki, enterprise. It is a trade-off that enterprise has to decide “Towards and Integrated Service-Oriented Reference nd Enterprise Architecture,” In: Proceedings of the 2013 during the assessment of DSs architecture (2 step International Workshop on Ecosystem Architectures, pp. described in Section 3 Figure 2). 26-30, 2013. [3] Andrea Malsbender, Jens Poeppelbuss, Ralf Plattfaut, Björn Niehaves, and Jörg Becker, “How to Increase Service Productivity: A BPM Perspective,” In: Proceedings of Pacific Asia Conference on Information Systems 2011, July 7. Conclusions 2011. [4] Anirban Ganguly, Roshanak Nilchiani, and John V. Farr , The perception of SOA is receiving wide acceptance due “Evaluating Agility in Corporate Enterprise,” In: to the ability of accustom and respond to BP related International Journal of Production Economics Volume 118 requirements and changes providing operational visibilities Issue 2, pp. 410 – 423, April 2009. [5] Aries Tao Tao and Jian Yang, “Context Aware to an enterprise. DSs are the means to accommodate Differentiated Services Development with Configurable uncertainties of BPs such that an enterprise may able to Business Processes,” In: 11th IEEE International Enterprise gain acceptable level of agility and completeness. As such, Distributed Object Computing Conference 2007, Oct 2007. there are limited to no standardization available to derive [6] Anne Hiemstra, Pascal Ravesteyn, and Johan Versendaal, “An Alignment Model for Business Process Management and maintain the qualities of DSs. In this paper, we and Service Oriented Architecture,” In: th International presented necessity of rationalizing DSs and their Conference on Enterprise Systems, Accounting and principles. The research effort is to propose an empirical Logistics (6th ICESAL ’09), May 2009. method to derive and evolve the principles of identifying

[7] Bohdana Sherehiy, Waldemar Karwowski, and John K. Strategies, John Wiley & Sons, Web ISBN: 0-470223-65-0, Layer, “A Review of Enterprise Agility: Concepts, June 2008. Frameworks, and Attributes,: In: International Journal of [24] Oracle Inc., Oracle Application Integration Architecture: Industrial Ergonomics 37, pp. 445 – 460, March 2007. Business Process Modeling and Analysis, Whitepaper, [8] Clay Richardson and Derek Miers, “How The Top 10 2013. Vendors Stack Up For Next-Generation BPM Suites,” In: [25] Paul Harmon, “What is Business Architecture,” In: Business The Forrester Wave: BPM Suites, Q1 2013, For: Enterprise Process Trends Vol. 8, Number. 19, November 2010. Architecture Professionals, March 2013. [26] Petcu, D. and Stankovski, V., “Towards Cloud-enabled [9] Daniel Selman, “5 Principles of Agile Enterprise in 2012,” Business Process Management Based on Patterns, Rules and In: IBM Operational Decision Manager Blog, Dec 2011. Multiple Models,” In: 2012 IEEE 10th International [10] Dean Leffingwell, Ryan Martens, and Mauricio Zamora, Symposium on Parallel and Distributed Processing with “Principles of Agile Architecture,” Leffingwell, LLC . & Applications (ISPA), July 2012. Rally Software Development Corp., July 2008. [27] Ralph Whittle, “Examining Capabilities as Architecture,” [11] Douglas Paul Thiel, “Preserving IT Investments with BPM In: Business Process Trends, September 2013. + SOA Coordination,” Technology Investment Management [28] Ravi Khadka, Amir Saeidi, Andrei Idu, Jurriaan Hage, Library, SenseAgility Group, November 2009. Slinger Jansen, “Legacy to SOA Evolution: A Systematic [12] Florian Wagner, Benjamin Klöpper, and Fuyuki Ishikawa, Literature Review,” Technical Report UU-CS-2012-006, “Towards Robust Service Compositions in the Context of Department of Information and Computing Sciences, Functionally Diverse Services,” In: Proceedings of the 21st Utrecht University, Utrecht, The Netherlands, March 2012. international conference on World Wide Web, pp. 969-978, [29] Razmik Abnous, “Achieving Enterprise Process Agility April 2012. through BPM and SOA,” Whitepaper, Content Management [13] Fred A. Cummins, “Chapter 9: Agile Governance,” In EMC, June 2008. Book: Building the Agile Enterprise: With SOA, BPM and [30] Scott W Ambler and Mark Lines, “Disciplined Agile MBM, Morgan Kaufmann, July 28, 2010. Delivery: A Practitioner’s Guide to Agile Software Delivery [14] Haitham Abdel and Monem El-Ghareeb, “Aligning Service in the Enterprise,” IBM Press, ISBN: 0132810131, 2012. Oriented Architecture and Business Process Management [31] The Open Group: TOGAF Version 9.1 Standards, Systems to Achieve Business Agility,” Technical Paper, December 2011. Department of Information System, Mansoura University, EGYPT, 2008. http://pubs.opengroup.org/architecture/togaf9-doc/arch/ [15] Harry Sneed, Stephan Sneed, and Stefan Schedl, “Linking [32] Tsz-Wai Lui and Gabriele Piccoli, “Degree of Agility: Legacy Services to the Business Process Model,” In: 2012 Implications for Information Systems Design and Firm IEEE 6th International Workshop on the Maintenance and Strategy,” In Book: Agile Information Systems: Evolution of Service-Oriented and Cloud-Based Systems, Conceptualization, Construction, and Management, August 2012. Routledge, Oct 19, 2006. [16] Imran Sarwar Bajwa, Rafaqut Kazmi, Shahzad Mumtaz, M. [33] Vishal Dwivedi and Naveen Kulkarni, “A Model Driven Abbas Choudhary, and M. Shahid Naweed, “SOA and BPM Service Identification Approach for Process Centric Partnership: A paradigm for Dynamic and Flexible Process Systems,” In: 2008 IEEE Congress on Services Part II, and I.T. Management,” In: International Journal of pp.65-72, 2008. Humanities and Social Sciences 3(3), pp.267-273, Jul 2009. [34] Xie Zhengyu, Dong Baotian, and Wang Li, “Research of [17] Imran Sarwar Bajwa, “SOA Embedded in BPM: High Level Service Granularity Base on SOA in Railway Information View of Object Oriented Paradigm,” In: World Academy of Sharing Platform,” In: Proceedings of the 2009 International Science, Engineering & Technology, Issue 54, pp.209-312, Symposium on Information Processing (ISIP’09), pp. 391- May 2011. 395, August 21-23, 2009. [18] Jean-Pierre Vickoff, “Agile Enterprise Architecture PUMA,” Teamlog, October 2007. [19] Leonardo Guerreiro Azevedo, Flávia Santoro, Fernanda Vikas S Shah received the Bachelor of Engineering degree in Baião, Jairo Souza, Kate Revoredo, Vinícios Pereira, and computer engineering from Conceicao Rodrigues College of Isolda Herlain, “A Method for Service Identification from Engineering, University of Mumbai, India in 1995, the M.Sc. Business Process Models in A SOA Approach,” In: degree in computer science from Worcester Polytechnic Institute, Enterprise, Business-Process and Information Systems MA, USA in 1998. Currently he is Lead Architect in Connected Modeling, LNCS Volume 29, pp. 99-112, 2009. Enterprise Services (CES) group at Wipro Technologies, NJ, USA. [20] Marinela Mircea, “Adapt Business Process to Service He has published several papers in integration architecture, real- Oriented Environment to Achieve Business Agility,” In: time enterprises, architecture methodologies, and management Journal of Applied Quantitative Methods . Winter 2010, approaches. He headed multiple enterprise architecture initiatives Vol. 5 Issue 4, p679-691, 2010. and research ranging from startups to consulting firms. Besides software architecture research and initiatives, he is extensively [21] Martin Keen, Greg Ackerman, Islam Azaz, Manfred Haas, supporting pre-sales solutions, risk management methodologies, Richard Johnson, JeeWook Kim, Paul Robertson, “Patterns: and service oriented architecture or cloud strategy assessment as SOA Foundation - Business Process Management Scenario,” IBM WebSphere Software Redbook, Aug 2006. well as planning for multinational customers. [22] Mendix Technology, “7 Principles of Agile Enterprises: Shaping today’s Technology into Tomorrow’s Innovation,” Presentation, October 2013. [23] Michael Rosen, Boris Lublinsky, Kevin T. Smith, and Marc J. Balcer, “Overview of SOA Implementation Methodology,” In Book: Applied SOA: SOA and Design

A comparative study and classification on web service security testing approaches

Azadeh Esfandyari

Department of Computer, Gilangharb branch, Islamic Azad University, Gilangharb, Iran [email protected]

Abstract achieve reliable Web services, which can be integrated Web Services testing is essential to achieve the goal of scalable, into compositions or consumed without any risk in an robust and successful Web Services especially in business open network like the Internet, more and more software environment where maybe exist hundreds of Web Services development companies rely on testing activities. In working together. This Relatively new way of software particular, security testing approaches help to detect development brings out new issues for Web Service testing to vulnerabilities in Web services in order to make them ensure the quality of service that are published, bound, invoked and integrated at runtime. Testing services poses new challenges trustworthy. The rest of paper is organized as follows: to traditional testing approaches. Dynamic scenario of Service Section II presents an overview and a classification of web Oriented Architecture (SOA) is also altering the traditional view service testing approaches. Section III summarizes web of security and causes new risks. The great importance of this service security testing approaches and issues. Finally, field has attracted the attention of researchers. In this paper, in section IV gives a conclusion of the paper. addition of presenting a survey and classification of the main existing web service testing approaches, web service security 2. Overview and a classification of web service testing researches and their issues are investigated. Keywords: web service security testing, WSDL testing approaches

1. Introduction The Web Services world is moving fast, producing new specification all the time and different applications, and The Web Services are modular, self-described and self- hence introducing more challenges to develop more contained applications. With the open standards, Web adequate testing schemes. The challenges stem mainly Services enable developers to build applications based on from the fact that Web Services applications are any platform with any component modular and any distributed applications with runtime behaviors that differ programming language. More and more corporations now from more traditional applications. In Web Services, there are exposing their information as Web Services and what’s is a clear separation of roles between the users, the more, it is likely that Web Services are used in mission providers, the owners, and the developers of a service and critical roles, therefore performance matters. Consumers of the piece of software behind it. Thus, automated service web services will want assurances that Web Services discovery and ultra-late binding mean that the complete won’t fail to return a response in a certain time period. So configuration of a system is known only at execution time, the Web Services testing is more important to meet the and this hinder integration testing [2]. To have an consumers’ needs. Web Services’ testing is different from overview of web service testing approaches I use the traditional software testing. In addition, traditional testing classification proposed by [2]. But it seems that this process and tools do not work well for testing Web classification is not sufficient for categorizing all existing Services, and therefore, testing Web Services is difficult approaches therefore new classification is introduced. and poses many challenges to traditional testing In [2] the existing web service testing approaches are approaches due to the above mentioned reason and mainly classified to 4 classes by excluding the approaches that are because Web Services are distributed applications with based on formal method and data gathering: numerous runtime behaviors.  WSDL-Based Test Case Generation Approaches Generally, there are two kinds of Web Services, the Web  Mutation-Based Test Case Generation Services are used in Intranet and the Web Services are Approaches used in Internet. Both of them face the security risk since  Test Modeling Approaches message could be stolen, lost, or modified. The  XML-Based Approaches information protection is the complex of means directed All Mutation-Based test case generation approaches that on information safety assuring. In practice it should referred to in [2] like [3, 4] are based on WSDL and can include maintenance of integrity, availability, placed in first class. Also there are approaches that in confidentiality of the information and resources that used addition to considering WSDL specification use other for data input, saving, processing and transferring [1]. To scenarios to cope with limitation of WSDL specification

based test case generation so introduction new category is semantical aspects which are highly critical for service seemed necessary. The proposed classification is: availability testing, unlike other approaches that focus on  WSDL-Based Test Case Generation Approaches syntactical correctness, to be designed at an early stage to  Test Modeling Approaches drive the product development process and to help uncover  XML-Based Approaches failures prior to deployment of services [2].  Extended Test Case Generation Approaches Tsai et al. [9] present a Web Services testing approach based on a stochastic voting algorithm that votes on the 2.1 WSDL-Based Test Case Generation Approaches outputs of the Web Service under test. The algorithm uses the idea of k-mean clustering to handle the multi- dimensional data with deviations. The heuristics is based These approaches essentially present solution for on local optimization and may fail to find the global generating test cases for web services based only on Web optimal results. Furthermore, the algorithm assumes that Services Description Language (WSDL).Research the allowed deviation is known, which may be hard to activities in this category are really extensive and not determine because the deviation is application dependent. included in this paper. Two WSDL approaches is introduced in following. 2.3 XML-Based Approaches

Hanna and Munro in [5] present solution for test cases Tsai et al. [10] proposed an XML-based object-oriented generation depending on a model for the XML schema (OO) testing framework to test datatypes of the input message parameters that can be Web Services rapidly. They named their approach Coyote. found in WSDL specification of the Web Service under It consists of two parts: test master and test engine. The test. They consider the role of application builder and test master allows testers to specify test scenarios and broker in testing web services. This framework use just cases as well as various analyses such as dependency boundary value testing techniques. analysis, completeness and consistency, and converts WSDL specifications into test scenarios. The test engine Mao in [6] propose two level testing framework for Web interacts with the Web Services under test, and provides Service-based software. In service unit level, tracing information. The test master maps WSDL combinatorial testing method is used to ensure single specifications into test scenarios, performs test scenarios service’s reliability through extracting interface and cases generation, performs dependency analysis, and information from WSDL file. In system level, BPEL completeness and consistency checking. A WSDL file specification is converted into state diagram at first, and contains the signatures specification of all the Web then state transition-based test cases generation algorithm Services methods including method names, and is presented. input/output parameters, and the WSDL can be extended Obviously the researches that generate web service test so that a variety of test techniques can be used to generate case from WSDL by using various testing techniques like test cases. The test master extracts the interface black box and random testing techniques and so on are information from the WSDL file and maps the signatures placed in this category. of Web Services into test scenarios. The test cases are generated from the test scenarios in the XML format 2.2 Test Modeling Approaches which is interpreted by test engine in the second stage. Di Penta et al. [11] proposed an approach to complement Model-based testing is a kind of black-box testing, where service descriptions with a facet providing test cases, in these experiments are automatically generated from the the form of XML-based functional and nonfunctional formally described interface speciﬁcation, and assertions. A facet is a (XML) document describing a subsequently also automatically executed [7]. particular property of a service, such as its WSDL Frantzen et al. [7] discuss on a running example how interface. Facets to support service regression testing can coordination protocols may also serve as the input for either be produced manually by the service provider or by Model-Based Testing of Web Services. They propose to the tester, or can be generated from unit test cases of the use Symbolic Transition Systems and the underlying system exposed as a service. testing theory to approach modelling and testing the coordination. 2.4 Extended Test Case Generation Approaches Feudjio and Schieferdecker in [8] introduced the concept of test patterns as an attempt to apply the design pattern Because of weak support of WSDL to web services approach broadly applied in object-oriented software semantical aspect some approaches don't confine development to model-driven test development. Pattern themselves only to WSDL-Based Test Case Generation. driven test design effectively allows tests targeting

Damiani et al. [12] in order to guarantee the quality of the Graphical User Interface (GUI), where the test cases given services propose collaborative testing framework are written in XML files and the results are shown in where different part participate in. They proposed a novel HTML and XML files [15]. approach that uses a third party certifier as a trusted entity to perform all the needed test on behalf of the user and 3. Web service security testing overview and certify that a particular service has been tested successfully related work to satisfy the user's needs. The open model scenario is a way to overcome the limitations of WSDL specification based test cases Web services play an important role for the future of the generation [12]. Internet, for their flexibility, dynamicity, interoperability, Since the service source code is generally not available, and for the enhanced functionalities they support. The the certifier can gain a better understanding about the price we pay for such an increased convenience is the service behavior starting from its model. The benefit of introduction of new security risks and threats, and the need such strategy is to allow the certifier to identify the critical of solutions that allow to select and compose services on areas of the service and therefore design test cases to the basis of their security properties [16]. This dynamic check them [12]. and evolving scenario is changing the traditional view of security and introduces new threats and risks for 2.5 Web service testing tools applications. As a consequence, there is the need of adapting current development, verification, validation, and

certification techniques to the SOA vision [17]. Many tools have been implemented for testing Web To achieve reliable Web services, which can be integrated Services. Next subsections describe briefly the three into compositions or consumed without any risk in an selected tools. open network like the Internet, more and more software • SoapUI Tool development companies rely on software engineering, on This tool is a Java based open source tool. It can work quality processes, and quite obviously on testing activities. under any platform provided with Java Virtual In particular, security testing approaches help to detect Machine (JVM). The tool is implemented mainly to vulnerabilities in Web services in order to make them test Web Services such as SOAP, REST, HTTP, JMS trustworthy. and other based services. Although SoapUI Concerning, the Web service security testing few concentrates on the functionality, it is also consider dedicated works have been proposed. In [18], the passive performance, interoperability, and regression testing method, based on a monitoring technique, aims to filter [13]. out the SOAP messages by detecting the malicious ones to • PushToTest Tool improve the Web Service’s availability. Mallouli et al. also One of the objectives of this open source tool is to proposed, in [19], a passive testing method which analyzes support the reusability and sharing between people SOAP messages with XML sniffers to check whether a who are involved in software development through system respects a policy. In [20], a security testing method providing a robust testing environment. PushToTest is described to test systems with timed security rules primarily implemented for testing Service Oriented modelled with Nomad. The specification is augmented by Architecture (SOA) Ajax, Web applications, Web means of specific algorithms for basic prohibition and Services, and many other applications. This tool adopts obligation rules only. Then, test cases are generated with the methodology which is used in many reputed the "TestGenIF" tool. A Web Service is illustrated as an companies. The methodology consists of four steps: example. In [21] a security testing method dedicated for planning, functional test, load test, and result analysis. stateful Web Services is proposed. Security rules are PushToTest can determine the performance of Web defined with the Nomad language and are translated into Services, and report the broken ones. Also, it is able to test purposes. The specification is completed to take into recommend some solutions to the problems of account the SOAP environment while testing. Test cases performance [14]. are generated by means of a synchronous product between • WebInject Tool test purposes and the completed specification. This tool is used to test Web applications and services.

It can report the testing results in real time, and Some researchers (e.g., ANISETTI et al. [17]) focused on monitor applications efficiently. Furthermore, the tool security certification. They believe that certification supports a set of multiple cases, and has the ability to techniques can play a fundamental role in the service- analyze these cases in reasonable time. Practically, the based ecosystem. However, existing certification tool is written in Perl, and works with the platforms techniques are not well-suited to the service scenario: they which have Perl interpreter. The architecture of usually consider static and monolithic software, provide WebInject tool includes: WebInject Engine and certificates in the form of human-readable statements, and

consider systemwide certificates to be used at deployment Four classes are introduced for web service testing and installation time. By contrast, in a service-based approaches. By considering this classification when environment, we need a certification solution that can security is concerned, the classes that are only based on support the dynamic nature of services and can be WSDL specifications can't be useful for security testing. integrated within the runtime service discovery, selection, Since ignoring WSCL and implementation details doesn't and composition processes [22] allow the definition of accurate attack models and test To certify that a given security property is holed by its cases. Because of the fourth class abstraction, it can service, two main types of certification processes are of include approaches that by complete modeling of service interest: test-based certification and model-based enable themselves to produce fine-grained test cases that certification. will be used to certify the security property of service(e.g., According to Damiani et al. [23], test-based certification is ANISETTI et al. [16]) . Although some security concepts a process producing evidence-based aren't taken into account in [16] (for instance reliability) Proofs that a (white- and/or black-box) test carried out on and the level of complexity of its processes has increased, the software has given a certain result, which in turn but it seems that this is the most comprehensive approach shows that a given high-level security property holds for in web service security certification area. that software. Model-based certification can provide formal proofs based on an abstract model of the service References (e.g., a set of logic formulas or a formal computational [1] Li, Y., Li, M., & Yu, J. (2004). Web Services Testing, the model such as a finite state automaton). Methodology, and the Implementation of the Automation- ANISETTI et al. [16] propose a test-based security Testing Tool. In Grid and Cooperative Computing (pp. certification scheme suitable for the service ecosystem. 940-947). The scheme is based on the formal modeling of the service [2] Ladan, M. I. (2010). Web services testing approaches: A at different levels of granularity and provides a model- survey and a classification. In Networked Digital Technologies (pp. 70-79). based testing approach used to produce the evidence that a [3] Siblini, R., & Mansour, N. (2005). Testing web services. given security property holds for the service. The proposed In Computer Systems and Applications, 2005. The 3rd certification process is carried out collaboratively by three ACS/IEEE International Conference on (p. 135). main parties: (i) a service provider that wants to certify its [4] Andre, L., & Regina, S. (2009). V.: Mutation Based Testing services; (ii) a certification authority managing the overall of Web Services.IEEE Software. certification process; and (iii) a Lab accredited by the [5] Hanna, S., & Munro, M. (2007, May). An approach for certification authority that carries out the property specification-based test case generation for Web services. evaluation. Service model generated by the certification In Computer Systems and Applications, 2007. AICCSA'07. authority using the security property and the service IEEE/ACS International Conference on (pp. 16-23). [6] Mao, C. (2009, August). A specification-based testing specifications is defined at three level of granularity: framework for web service-based software. In Granular WSDL-based model, WSCL-based model and Computing, 2009, GRC'09. IEEE International Conference implementation-based model. The certification authority on (pp. 440-443). sends the Service model together with the service implementation and the requested security property to the [7] Frantzen, L., Tretmans, J., & de Vries, R. (2006, May). Towards model-based testing of web services. accredited Lab. The accredited Lab generates the evidence In International Workshop on Web Services–Modeling and needed to certify the service on the basis of the model and Testing (WS-MaTe 2006) (p. 67). security property and returns it to the certification [8] Feudjio, A. G. V., & Schieferdecker, I. (2009). Availability authority. If the evidence is sufficient to prove the testing for web services. requested property the certification authority awards a [9] Tsai, W. T., Zhang, D., Paul, R., & Chen, Y. (2005, certificate to the service, which includes the certified September). Stochastic voting algorithms for Web services property, the service model, and the evidence. They also group testing. In Quality Software, 2005.(QSIC 2005). propose matching and comparison processes that return Fifth International Conference on (pp. 99-106). the ranking of services based on the assurance level [10] Tsai, W. T., Paul, R., Song, W., & Cao, Z. (2002). Coyote: provided by service certificates. Because of supporting the An xml-based framework for web services testing. In High Assurance Systems Engineering, 2002. Proceedings. 7th dynamic comparison and selection of functionally IEEE International Symposium on (pp. 173-174). equivalent services, the solution can be easily integrated [11] Di Penta, M., Bruno, M., Esposito, G., Mazza, V., & within a service-based infrastructure. Canfora, G. (2007). Web services regression testing. In Test and Analysis of web Services (pp. 205-234). 4. Conclusions [12] Damiani, E., El Ioini, N., Sillitti, A., & Succi, G. (2009, July). Ws-certificate. InServices-I, 2009 World Conference on (pp. 637-644). This paper had a review on main issue and related work on [13] "SoapUI tool" , http://www.SoapUI.org. Web Service testing and Web Service security testing. [14] "PushToTest tool", http://www.PushToTest.com.

[15] "WebInject", http://www.WebInject.org/. [16] Anisetti, M., Ardagna, C. A., Damiani, E., & Saonara, F. (2013). A test-based security certification scheme for web services. ACM Transactions on the Web (TWEB), 7(2), 5. [17] Anisetti, M., Ardagna, C., & Damiani, E. (2011, July). Fine- grained modeling of web services for test-based security certification. In Services Computing (SCC), 2011 IEEE International Conference on (pp. 456-463). [18] Gruschka, N., & Luttenberger, N. (2006). Protecting web services from dos attacks by soap message validation. In Security and privacy in dynamic environments (pp. 171- 182). [19] Mallouli, W., Bessayah, F., Cavalli, A., & Benameur, A. (2008, November). Security rules specification and analysis based on passive testing. In Global Telecommunications Conference, 2008. IEEE GLOBECOM 2008. IEEE (pp. 1- 6). [20] Mallouli, W., Mammar, A., & Cavalli, A. (2009, December). A formal framework to integrate timed security rules within a TEFSM-based system specification. InSoftware Engineering Conference, 2009. APSEC'09. Asia-Pacific (pp. 489-496). [21] Salva, S., Laurençot, P., & Rabhi, I. (2010, August). An approach dedicated for web service security testing. In Software Engineering Advances (ICSEA), 2010 Fifth International Conference on (pp. 494-500). [22] Damiani, E., & Manã, A. (2009, November). Toward ws-certificate. InProceedings of the 2009 ACM workshop on Secure web services (pp. 1-2).

[23] Damiani, E., Ardagna, C. A., & El Ioini, N. (2008). Open source systems security certification. Springer Science & Business Media.

Collaboration between Service and R&D Organizations – Two Cases in Automation Industry

Jukka Kääriäinen1, Susanna Teppola1 and Antti Välimäki2

1 VTT Technical Research Centre of Finland Ltd. Oulu, P.O. Box 1100, 90571, Finland {jukka.kaariainen, susanna.teppola}@vtt.fi

2 Valmet Automation Inc. Tampere, Lentokentänkatu 11, 33900, Finland [email protected]

Abstract services? In this article, the objective is not to describe the Industrial automation systems are long-lasting multi- service development process, but rather to try to technological systems that need industrial services in order to understand and collect industrial best practices that keep the system up-to-date and running smoothly. The Service increase the collaboration and transparency between the organization needs to jointly work internally with R&D and Service and R&D organizations so that customers can be externally with customers and COTS providers so as to operate efficiently. This paper focuses to Service – R&D collaboration. It serviced better and more promptly. presents a descriptive case study of how the working relationship between Service and R&D organizations has been established in This article intends to discuss the collaboration between a two example industrial service cases (upgrade and audit cases). the Service and R&D organizations using two cases that The article reports the collaboration practices and tools that have provide practical examples about the collaboration, i.e. been defined for these industrial services. This research provides, what the collaboration and transparency between the for other companies and research institutes that work with Service and R&D organizations mean in a real-life industrial companies, practical real-life cases of how Service and industrial environment. In addition, the paper reports what R&D organizations collaborate together. Other companies would kind of solutions the company in the case study uses to benefit from studying the contents of the cases presented in this article and applying these practices in their particular context, effectuate the collaboration. where applicable. Keywords: Automation systems, Industrial service, Lifecycle, The paper is organized as follows. In the next section, Transparency, Collaboration. background and need for Service and R&D collaboration are stated. In section 3, the case context and research process is introduced. In section 4, two industrial service 1. Introduction processes are introduced that are cases for analyzing Service and R&D collaboration. In section 5, the cases are Industrial automation systems are used in various industrial analyzed from Service and R&D collaboration viewpoint. segments, such as power generation, water management Finally, section 6, discusses the results and draws up the and pulp and paper. The systems comprise HW and SW conclusions. sub-systems that are developed in-house or COTS (Commercial Off-The-Shelf) components. Since these systems have a long useful life, the automation system 2. Background providers offer various different kinds of lifecycle services for their customers in order to keep their automation In the digital economy, products and services are linked systems running smoothly. more closely to each other. The slow economic growth during recent years has boosted the development of Integrated service/product development has been studied product-related services even more – they have brought quite a bit, e.g. in [1, 2]. However, there is less information increasing revenue for the manufacturing companies in available how in practice the needs of the Service place of traditional product sales [3, 4]. The global market organization could be taken into account during product for product and service consumption is constantly growing development. What kind of Service/R&D collaboration [5]. In 2012, the overall estimate for service revenues could improve the quality and lead time of the industrial accrued from automation products like DCS, PLC,

SCADA, etc. amounted to nearly $15 billion [6]. based on a generic product platform, and a new version of Customers are more and more interested in value-added this product platform is released annually. Automation services compared to the basic products itself. Therefore, system vendors are also using HW and SW COTS companies and business ecosystems need the ability to components in their systems, for instance, third-party adapt to the needs of the changing business environment. operating systems (e.g. Windows). Therefore, automation The shift from products to services has been taking place systems are dependent on, for instance, the technology in the software product industry from 1990 onwards [7]. roadmaps of operating system providers. The (generic) The importance of service business has been understood main sub-systems in an automation system include: Control for a while, but more systematic and integrated product Room, Engineering Tools, Information Management and and service development processes are needed [8]. During Process Controllers. recent years the focus has shifted towards understanding the customer’s needs and early validation of the success of Engineering tools are used to configure the automation developed services [9]. Furthermore, the separation of system so as to fit the customer’s context. This includes, service and R&D organization may cause communication for instance, the development of process applications and problems that need to be tackled with new practices and related views. Automation systems have a long life and organizational units [10]. they need to be analyzed, maintained and updated, if necessary. Therefore, the case company offers, for instance, Technical deterioration (technology, COTS, standards, etc.) upgrade and audit services to keep the customers’ of the systems that have a long lifetime (such as automation systems up-to-date. Each update will be automation systems) is a problem in industry. The analyzed individually so as to find the optimal solution for reliability of technical systems will decrease over time if the customer based on the customer’s business needs. companies ignore industrial services. “For a typical Service operation is highly distributed since the case automation/IT system, only 20-40 percent of the company has over 100 sales and customer service units in investment is actually spent on purchasing the system; the 38 countries serving customers representing various other 60-80 percent goes towards maintaining high industrial segments in Europe, Asia, America, Africa and availability and adjusting the system to changing needs Australia. during its life span” [11]. This is huge opportunity for vendors to increase their industrial service business. Because of the demands of customer-specific tailoring, Automation system providers offer their automation there are many customer-specific configurations (i.e. the systems and related industrial services in order to keep customer-specific variants of an automation system) in the customer’s industrial processes running smoothly. These field containing sub-systems from different platform industrial services need to be done efficiently. Therefore, releases (versions). Therefore, the Service organization there should be systematic and effective service processes (the system provider) needs to track each customer with supporting IT systems in global operational configuration of an automation system and detect what environment. Furthermore, there should be collaboration maintenance, optimization, upgrades are possible for each practices with R&D and Service organization that systems customer to keep the customer’s automation solutions can be efficiently serviced and are service friendly. This all running optimally. requires deeper understanding how Service and R&D organizations should operate to enable this collaboration. Case company aims at better understand collaboration between Service organization and R&D organization. For other companies and research institutes this research 3. Case context and research process provides a descriptive case study how the collaboration between Service and R&D organizations have been This work was carried out within the international research established in a two example service case (upgrade and projects Varies (Variability in Safety-Critical Embedded audit cases). Therefore, the research approach is bottom-up. Systems) [12] and Promes (Process Models for These cases were selected into this study since the Engineering of Embedded Systems) [13]. The case company personnel that work in this research project have company operates in the automation systems industry. The in-depth knowledge about these services. We first studied company offers automation and information management these two service processes and then analyzed what kinds application networks and systems, intelligent field control of activities can be found to enable the transparency solutions, and support and maintenance services. The case between service and R&D organizations in these cases. We focuses on the automation system product sector and selected this approach since each industrial service seems includes the upgrade and audit service. Typically, the to have its own needs for collaboration and therefore you customer-specific, tailored installation of the system is first need to understand the service process itself. We have

adapted the approach defined by Charalampidou et al. [14] been utilized in order to identify the interfaces between as a frame for process descriptions. The research has been service and R&D organizations. done as follows: 4.1 Case 1: Upgrade–service 1. Upgrade–service process description was composed using company interviews and This section presents the Upgrade-service process (Fig. 1). workshops (case 1). Upgrade-service is a service that will be provided for a 2. Audit–service process description was composed customer to keep their automation systems up and running. using company interviews and workshops (case 2). The detailed description and demonstration of Upgrade- 3. Case analysis was performed that combined case service process has been presented in [15]. Phases are 1 and 2 and additional interviews/workshops were divided into activities that represent collections of tasks held to understand service/R&D collaboration that will be carried out by the workers (e.g. Service behind the service processes. Two persons that Manager). One worker has the responsibility (author) for work in service-R&D interface in case 1 and case the activity, and other workers work as contributors. 2 were interviewed and the results were discussed. Activities create and use artefacts that will be retrieved 4. Finally, the results of the case 1, case 2 and case from or stored in tools (information systems). analysis were reviewed and modified by the representatives of case company. The upgrade service process is divided into six activities. The first four, form the Upgrade Planning process. The last two represent subsequent steps, as the implementation of 4. Industrial cases upgrade and subsequent follow up. This case focuses to Upgrade planning–phase of the Upgrade-service process. Industrial automation systems are used in various industrial The process contains a sequence of activities to keep the segments, such as power generation, water management presentation simple, even though in real-life, parallelism and pulp and paper production. The systems comprise HW and loops/iterations are also possible. For instance, new and SW sub-systems that are in-house developed or COTS customer needs may emerge during price negotiations that (Commercial Off-The-Shelf) components. Since these will be investigated in a new upgrade planning iteration. systems have a long lifetime, the automation system providers offer different kinds of industrial services for “Identify upgrade needs” activity: their customers in order to keep their automation systems The process starts with the identification of upgrade needs. running smoothly. The input for an upgrade need may come from various sources, for instance, directly from customer, from a In this article, we present two cases related to the industrial service engineer working on-site at the customer’s services that both are the sub-processes of the maintenance premises, from component end-of-life notification, etc. The main process. The first is Upgrade-service and the second Service Manager is responsible for collecting and is Audit-service. These cases represent process documenting upgrade needs originating from internal or presentations that have been created in cooperation with external sources. the case company in order to document and systematize their service processes. These process descriptions have

NOTATION

PROCESS ACTIVITY ARTEFACT AUTHOR CREATE / STORE USE / RETRIEVE CONTRIBUTOR PHASE TOOL CONNECTOR Maintenance COMPOSITE ARTEFACT

Upgrade Planning Implementation & Follow-up

Upgrade service Service manager Service manager Service manager Customer Service manager Customer Service manager Customer Service staff Service staff R&D Customer Service staff Service staff 3 Analyse the 1 Identify 2 Identify 4 Communicate / Implement upgrades Re-evaluate system / upgrade needs installed system negotiate according to plans upgrade needs compose LC plan

Upgrade needs Lifecycle report Lifecycle Plan Installation information Installed Reports System / service offering

”Tell me your Life cycle rules

Operator Interface Engineering & Lifecycle Services Information Management & Maintenance Web Reports

Office Networkconfiguration” Router

100 Mbit/s Network Architecture 100 Mbit/s Switched Ethernet 0- mV Customer extranet PCS PCS 100 BU/AL 1,2 10- P Interfaces to: - PLC deg 50 Installed Base - DCS Installation - QCS

Distributed I/O’s Centralized I/O’s

HART information Plug-in Automation system Fig. 1 Description of the Upgrade Planning Process.

“Negotiations” activity: “Identify installed system” activity: In the “Negotiations” activity, the service manager The service manager is responsible for carrying out modifies the lifecycle plan based upon the maintenance “Identify installed system” activity. In this activity, the budgets and downtime schedules of the customer. customer’s automation system configuration information Customer extranet is the common medium for vendor and (i.e. customer-specific installed report) is retrieved from customer to exchange lifecycle plans and other material. the InstalledBase tool. The information is collected The final lifecycle plan presents the life cycle of each part automatically from the automation system (automatically of the system, illustrating for a single customer what needs via a network connection with the customer’s automation to be upgraded and when, and at what point in time a system) and manually by a site visit, if needed. The larger migration might be needed. The plan supports the updated information is stored in the InstalledBase tool. customer in preparing themselves for the updates, for instance by predicting costs, schedules for downtimes, Analyze the system/compose LC (lifecycle) plan” activity: rationale for management, etc. Based on the negotiations In the “Analyze the system/compose LC (lifecycle) plan” and offer, the upgrade implementation starts according to activity service manager is responsible for analyzing the contract. Additionally, the Service Manager is responsible instant and future upgrade needs for the customer’s for periodically re-evaluating the upgrade needs. automation system. The InstalledBase tool contains lifecycle plan functionality. This means that the tool 4.2 Case 2: Audit–service contains some lifecycle rules related to the automation systems. The lifecycle rules are composed by a product This section presents the Audit-service process (Fig. 2). manager who works in service interface working in Audit-service is used to determine the status of the collaboration with R&D organization. The service automation system or equipment. Systematic manager generates a lifecycle report from the InstalledBase practices/process and tools to collect the information allow tool and starts to modify it based on negotiations with the repeatable and high-quality service that forms basis for customer. subsequent services. Audit-service might launch, for instance, upgrade, optimizations, training –services. Again,

as in Upgrade-service–case process description, phases are audit, customer contact/team, customer’s arrangements to divided into activities that represent collections of tasks ensure successful audit (availability of key persons during that will be carried out by the workers (e.g. Service audit, data analysis and reporting/presentation, visits, Manager). One worker has the responsibility (author) for remote connections, safety/security), schedule, resources the activity, and other workers work as contributors. and reporting/presentation practices, etc. Furthermore, Activities create and use artefacts that will be retrieved service staff documents the audit plan and makes it visible from or stored in tools (information systems). The Audit- for customer. service process is divided into five activities. Office Research -activity: Plan audit -activity: The purpose of Office Research is to carry out audit This activity is used to identify, agree and document scope activities that can be done remotely. In this activity service and needs for audit. This enables systematic audit. The staff collects remote diagnostics according to audit planning starts when there are a demand for service or e.g. checklist. They further collect information about customer- service agreement states that the audit will be done specific product installation. The output of the activity is periodically. Service staff creates audit plan with the the data that is ready for data analysis. customer that contains information about: scope/needs for

NOTATION

PROCESS Activity ARTEFACT AUTHOR CREATE / STORE USE / RETRIEVE CONTRIBUTOR PHASE TOOL CONNECTOR COMPOSITE ARTEFACT

Maintenance

Audit Implementation & Follow-up

Service staff 3 Field Service staff Audit service Research Service manager Service manager Service manager Sales Manager Service manager Customer Optional Customer Service staff Customer Service R&D Service staff Service staff staff Customer Customer Service staff 5 Presentation / Actions based on Re-evaluate audit 2 Office 4 data analysis, 1 Plan audit Communicate / agreement between needs or start periodic Research report Negotiate customer and vendor audit

Audit plan Instrumented Installation Installed Audit checklist Audit report Sales leads Action plan data Data Report & roadmap information

Data server Customer extranet Installed Base ”Tell me your configuration”

Operator Interface Engineering & Lifecycle Services Information Management & Maintenance ”Instrumented data” Web Reports

Office Network Router 0- mV Installation 100 Mbit/s Network Architecture 100 Mbit/s Switched Ethernet 100 10- PCS PCS information BU/AL P 1,2 deg Interfaces to: 50 - PLC - DCS - QCS

Distributed I/O’s Centralized I/O’s

HART Plug-ins, Target system sensors Fig. 2 Description of the Audit Process.

Field research –activity (optional): practice the needs of the Service organization could be The purpose of Field Research is to acquire supplementary taken into account during product development. What kind information/data during site visits. This activity is optional of service/R&D collaboration could improve the quality if Office research–activity is not sufficient. In this activity, and lead time of the industrial services? Target is not to Service staff carries out field research tasks according to describe service development process but try to understand product-specific checklists (check instruments, get info and collect industrial best practices that increase about maintenance, training needs, remarks concerning service/R&D collaboration and transparency so that configuration, visual check, check the function of the customers can be better and faster serviced. Naturally these product, corrosion, etc.). Furthermore, staff collects practices are highly service dependent since each service additional installation information from the customer need different issues from R&D organization. During the premises (customer-specific product installation), if interviews, it become obvious that already on product needed. platform Business Planning phase there has to be analysis activity how new proposed features of the system will be Data analysis/report –activity: supported by services and what kind of effects there are for Purpose of Data analysis/Report is to analyze collected different services (e.g. compatibility). Therefore, already in data and prepare a report that can be communicated with system business planning phase one should consider the customer. Data analysis –task analyses audit data and technical support, product/technology lifecycle and version observations. Service staff utilize audit checklist and compatibility issues from service viewpoint before the consult R&D in analysis, if needed. Depending upon the implementation starts. audit, the analysis may contain e.g.: maintenance, part or product obsolence, replacements, inventory, needs for Based on cases above the following service/R&D training, etc. During the analysis customer should prepare collaboration practices were identified. Basically both time and contacts to answer questions that may arise cases highlight communication and information concerning audit data. Service staff and manager defined transparency between the organizational units. recommendations based upon the audit. The Service Manager identifies sales leads related to the audit results. 5.1 Case 1: collaboration related to Upgrade –service In addition, staff will update installation information into InstalledBase if discrepancies have been observed. The In case 1 there was a nominated person who works in audit report will contain an introduction, definition of service/R&D interface, i.e. Product Manager who works in scope, results, along with conclusions and Service interface (Fig. 3). This person defines and updates recommendations. The report will be reviewed internally life-cycle rules document that contains information e.g.: and stored into the Customer Extranet and the Customer will be informed (well in advance, in order to allow time - how long each technology will be supported. In for customer to check the report). other words, e.g. how long the particular version of each operating system (OS) will be supported Presentation/communicate/negotiations–activity: (product & service/security packs), along with This activity presents results to the key stakeholders and considerations of if there is any possibility of agrees future actions/roadmaps. The Service Manager extended support). agrees with the customer the time and participants of result - hardware – software compatibility (e.g. OS version presentation event. The results will be presented and vs. individual workstations) discussed. The Service Manager negotiates about the - compatibility information showing how different recommendations and defines actions/roadmaps based on sub-systems are compatible with each other recommendations (first step towards price and content (compatible, compatibility restrictions, not negotiations). compatible). - other rules or checklists containing what needs to be considered when conducting upgrades 5. Case analysis (conversions of file formats, etc.)

Integrated service/product development has been studied a lot. However, there is less information available how in

Life cycle policy

Parallel:

Product manager Product manager that Installed Base 3rd party tech LC info works in Service interface Optional: Product manager(s) Collect COTS LC info Define and Lifecycle info Product manager update life cycle (enable Upgrade Collect sub-system rules Planning) Technology life lifecycle info Life cycle rules Internal tech LC info cycle information Product manager Upgrade planning -process Collect sub-system compatibility info Internal compatibility info

Fig. 3 Collect lifecycle rules -process.

1. Coordinates the collection, maintenance and The rules are used by the Service function in order to documentation of lifecycle rules in cooperation understand the lifecycle effects to the system. For instance, with R&D to support upgrade planning. in Upgrade Planning -process in upgrade analysis -activity 2. Communicates to R&D how they should prepare this information is used to compose life cycle plans for for lifecycle issues (how R&D should take into customers. Product manager that works in Service account service needs?). interface coordinates the composition of lifecycle rules. 3. Defines the lifecycle policy with company These rules originate from internal and external sources. management. External information is collected from third-party system 4. Coordinates that Service Managers are creating providers (COTS providers). This information comes, for lifecycle plans for their customers. The objective instance, from operating system providers (a roadmap that is that there are as many lifecycle plans as possible shows how long operating system versions will be (every customer is a business possibility). supported). Internal lifecycle information originates from 5. Participates to the lifecycle support decision R&D (product creation and technology research making together with R&D (e.g. functions). Internal lifecycle information defines in-house replacement/spare part decisions, compatibility developed automation system components and their between platform releases). For instance: support policy. Furthermore, lifecycle information about - decisions concerning how long the company system dependencies is also important (compatibility provides support for different information). Dependency information shows the technologies/components. Decisions what dependencies between sub-systems and components so as technologies will be used (smaller changes/more to detect how changes in one sub-system may escalate to significant changes). Service organization will other sub-systems in a customer’s configuration. Finally, make the decision (cooperation with R&D). rules are also affected by a company’s overall lifecycle - decisions about the compatibility. Service policy (i.e. the policy on how long (and how) the company provides needs/requirements for compatibility decides to support the systems). Some of these rules are (based on effects to service business). R&D tells implemented into the InstalledBase tool that partly what is possible => needs and possibilities are automates the generation of lifecycle plan. However, since combined so that optimal compatibility is every customer tends to be unique some rules need to be possible for upgrade business (service applied manually depending on the upgrade needs. organization makes the decision).

- determines the volume/quantity components there Based on this case we could compose a task list for the Product Manager who works in the service interface. are in field (check from InstalledBase) => Product manager’s task is to increase and facilitate effects to the content of next platform release communication between the R&D and Service and what support are needed from service organizations (collaboration between organizational units): viewpoint.

5.2 Case 2: Collaboration related to Audit –service organizations. The approach has some similarities to the solution that is presented in [10]. Similarly their case study In case 2, the Service staff utilizes audit checklists that indicated that there were need to have units that worked in have been prepared collaboratively by Service and R&D, between the organizations that enabled the interaction. and consults R&D in the audit analysis, if required. The training team that works in the Service organization is Based on this research, it is possible to better understand responsible for the coordination of the collection and interfaces and needs between Service and R&D maintenance of the audit checklists (Fig. 4). The checklists organizations. With this information it is possible to begin are to be composed and maintained in cooperation with to improve the collaboration practices and solutions in case R&D. Checklists are product-specific since different issues company. This research provides for other companies and need to be checked depending on product type. research institutes that work with industrial companies the Furthermore, checklists require constant updates as the practical real-life cases how Service and R&D product platforms evolve, e.g. one needs to check different organizations collaborate. This research is based on issues from products of different age. bottom-up approach studying two cases and therefore the results are limited since the collaboration is service Training team (service) dependent. This study does not try to explain why the case R&D Audit -process Key service staff persons company has ended up with these practices and solutions, nor that these practices are directly applicable to other Audit info (enable Compose audit Audit service) checklists companies. However, we described the case in fairly detailed context in section 3 and Service processes in Audit checklist Fig. 4 Compose audit checklist -process. section 4. Therefore, this article provides for industrial companies a good ground to compare their operational environment with the one presented in this article and 6. Discussion and conclusions apply the collaboration practices when appropriate and applicable. For us, this study creates a basis for further The importance of industrial services has increased and research to study the collaboration needs of the other there needs to be systematic practices/processes to support industrial services – for instance such as preventive service and product development. This has been indicated maintenance services, optimization services, security also in other studies, e.g. in [1, 2]. However, there is less assessment services. information available concerning how in practice the needs of the Service organization could be taken into account Acknowledgments during product development. What kind of service/R&D collaboration could improve the quality and lead time of This research has been done in ITEA2 project named the industrial services? In this article, objective is not to Promes [13] and Artemis project named Varies [12]. This describe service development process but rather to try to research is funded by Tekes, Artemis joint undertaking, understand and collect industrial best practices that Valmet Automation and VTT. The authors would like to increase the collaboration and transparency between thank all contributors for their assistance and cooperation. service and R&D organizations so that customers can be better and faster serviced. References [1] A. Tukker, U. Tischner. “Product-services as a research field: This article aims to discuss the collaboration and past, present and future. Reflections from a decade of transparency of Service and R&D organizations using two research”, Journal of Cleaner Production, 14, 2006, Elsevier, cases that give practical examples about the collaboration, pp. 1552-1556. i.e. what the collaboration and transparency between [2] J.C. Aurich, C. Fuchs, M.F. DeVries. “An Approach to Life Service and R&D organizations means in real-life Cycle Oriented Technical Service Design”, CIRP Annals - industrial environment. Furthermore, the article reports Manufacturing Technology, Volume 53, Issue 1, 2005, pp. 151–154. what kind of solutions the case company uses to realize the [3] H.W. Borchers, H. Karandikar. “A Data Warehouse approach collaboration. for Estimating and Characterizing the Installed Base of Industrial Products”. International conference on Service The article shows that in case company service needs were systems and service management, IEEE, Vol. 1, 2006, pp. taken into account already in business planning phase of 53-59. the product development process. Furthermore, there were [4] R. Oliva, R. Kallenberg. “Managing the transition from roles and teams that worked between service and R&D products to services”. International Journal of Service organizations to facilitate the interaction between the Industry Management, Vol. 14, No. 2, 2003, pp. 160-172

[5] ICT for Manufacturing, The ActionPlanT Roadmap for Dr. Antti Välimäki works as a senior project manager in Valmet Manufacturing 2.0. Automation as a subcontractor. He has worked in many positions [6] K. Sundaram. “Industrial Services- A New Frontier for from designer to development manager in R&D and Service in Business Model Innovation and Profitability”, Frost and Valmet/Metso Automation. He has received a Ph.D. degree in 2011 in the field of Computer Science. He has over 20 years of Sullivan, https://www.frost.com/sublib/display-market- experience with quality management and automation systems in insight.do?id=287324039 (accessed 24th June 2015). industrial and research projects. [7] M.A. Cusumano. “The Chaning Software Business: Moving from Products to Services”. Published by the IEEE Computer Society, January, 0018-9162/08, 2008, pp. 20 – 27. [8] J. Hanski, S. Kunttu, M. Räikkönen, M. Reunanen. Development of knowledge-intensive product-service systems. Outcomes from the MaintenanceKIBS project. VTT, Espoo. VTT Technology : 21, 2012. [9] M. Bano, D. Zowghi. “User involvement in software development and system success: a systematic literature review”. Proceedings of the 17th International Conference on Evaluation and Assessment in Software Engineering, 2003, pp. 125-130. [10] N. Lakemond, T. Magnusson. “Creating value through integrated product-service solutions: Integrating service and product development”, Proceedings of the 21st IMP- conference, Rotterdam, Netherlands, 2005. [11] L. Poulsen. “Life-cycle and long-term migration planning”. InTech magazine (a publication of the international society of automation), January/February 2014, pp. 12-17. [12] Varies -project web-site: (Variability In Safety-Critical Embedded Systems) http://www.varies.eu/ (accessed 24th June 2015). [13] Promes -project web-site: (Process Models for Engineering of Embedded Systems) https://itea3.org/project/promes.html (accessed 24th June 2015). [14] S. Charalampidou, A. Ampatzoglou, P. Avgeriou. “A process framework for embedded systems engineering”. Euromicro Conference series on Software Engineering and Advanced Applications (SEAA'14), IEEE Computer Society, 27-29 August 2014, Verona, Italy. [15] J. Kääriäinen, S. Teppola, M. Vierimaa, A. Välimäki. ”The Upgrade Planning Process in a Global Operational Environment”, On the Move to Meaningful Internet Systems: OTM 2014 Workshops, Springer Berlin Heidelberg, Lecture Notes in Computer Science (LNCS), Volume 8842, 2014, pp 389-398.

Dr. Jukka Kääriäinen works as a senior scientist in VTT Technical Research Centre of Finland in Digital systems and services -research area. He has received a Ph.D. degree in 2011 in the field of Computer Science. He has over 15 years of experience with software configuration management and lifecycle management in industrial and research projects. He has worked as a work package manager and project manager in various European and national research projects.

Mrs. Susanna Teppola (M.Sc.) has worked as a Research Scientist at VTT Technical Research Centre of Finland since 2000. Susanna has over fifteen years’ experience in ICT, her current research interests are laying in the area of continuous software engineering, software product/service management and variability. In these areas Susanna has conducted and participated in many industrial and industry-driven research projects and project preparations both at national and international level.

Load Balancing in Wireless Mesh Network: a Survey

Maryam Asgari1, Mohammad Shahverdy2, Mahmood Fathy3, Zeinab Movahedi4

1Department of Computer Engineering, Islamic Azad University-Prof. Hessabi Branch, Tafresh, Iran [email protected] 2Department of Computer Engineering, Islamic Azad University- Prof. Hessabi Branch,Tafresh and PHD student of University of Science and Technology, Tehran Iran [email protected] 3, 4Computer Engineering Faculty, University of Science and Technology Tehran,Iran [email protected],[email protected]

Abstract Wireless Mesh network (WMN) is a state of the art networking wireless network has only a single hop of the path and the standard for next generation of wireless network. The Clients need to be within a single hop to make construction of these networks is basis of a network of wireless connectivity with wireless access point. Therefore to set up routers witch forwarding each other’s packets in a multi-hop such networks need access points and suitable backbone. manner. All users in the network can access the internet via As result a Deployment of large-scale WLANs are too Gateways nodes. Because of the high traffic load towards gateway node, it will become congested. A load balancing much cost and time consuming. However, The WMNs can mechanism is required to balance the traffic among the gateways provide wireless network coverage of large areas without and prevent the overloading of any gateway. In this paper, depending on a wired backbone or dedicated access points weinvestigatedifferent load balancing techniques in wireless [1, 2]. WMNs are the next generation of the wireless mesh networks to avoid congestion in gateways,as well as we networks that to provide best services without any survey the effective parameters that is used in these techniques. infrastructure. WMNs can diminish the limitations and to improve the performance of modern wireless networks Keywords:clustering, Gateway, Load Balancing, Wireless Mesh such as ad hoc networks, wireless metropolitan area Network (WMN). networks (WMANs), and vehicular ad hoc networks [2,3,4 and 5]. 1.Introduction WMNs are multi-hop wireless network which provide Wireless mesh networking is a new paradigm for next internet everywhere to a large number of users. The generation wireless networks. Wireless mesh networks WMNs are dynamically self-configured and all the nodes (WMNs) consist of mesh clients and mesh routers, where in the network are automatically established and maintain the mesh routers form a wireless infrastructure/backbone mesh connectivity among themselves in an ad hoc style. and interwork with the wired networks to provide multi These networks are typically implemented at the network hop wireless Internet connectivity to the mesh clients. layer through the use of ad hoc routing protocols when Wireless mesh networking has generated as a self- routing path is changed. This character brings many organizing and auto-configurable wireless networking to advantages to WMNs such as low cost, easy network supply adaptive and flexible wireless Internet connectivity maintenance, more reliable service coverage. to mobile users. Wireless mesh network has different members such as This idea can be used for different wireless access access points, desktops with wireless network interface technologies such as IEEE 802.11, 802.15, 802.16-based cards (NICs), laptops, Pocket PCs, cell phones, etc. These wireless local area network (WLAN), wireless personal members can be connected to each other via multiple hops. area network (WPAN), and wireless metropolitan area In the full mesh topology this feature brings many network (WMAN) technologies. WMNs Potential advantages to WMNs such as low cost, easy network application can be used in home networks, enterprise maintenance and more reliable service coverage. In the networks, community networks, and intelligent transport mesh topology, one or multiple mesh routerscan be system networks such as vehicular ad-hoc networks. connected to the Internet. These routers can serve as GWs Wireless local area networks (WLANs) are used to serve and provide Internet connectivity for the entire mesh mobile clients access to the fixed network within network. One of the most important challenges in these broadband network connectivity with the network networks happens on GW, when number of nodes which coverage [1]. The clients in WLAN use of wireless access connected to the internet via GW, suddenly increased. It points that are interconnected by a wired backbone means that GWs will be a bottleneck of network and network to connect to the external networks. Thus, the

performance of the network strongly decreases [4, 5, and A mesh router hearing the route request uses the 6]. information in the RREQ to establish a route back to the RREQ generator. 2.Related Work The problem of bottleneck in wireless mesh networks is an ongoing research problem although much of the literature [7, 8, 9, 10] available, addresses the problem without an introducing method for removing bottleneck and/or a well- defined way to prevent congestion. In [11], the authors proposed the Mesh Cache system for exploiting the locality in client request patterns in a wireless mesh network .The Mesh Cache system alleviates the congestion bottleneck that commonly exists at the GW node in WMNs while providing better client throughput by enabling content downloads from closer high throughput mesh routers. There is some papers related to optimization problems on dynamic and static load balancing across Fig.1 Broadcasting RREQs[13] meshes [11].Optimal load balancing across meshes is known to be a hard problem. Akyildiz et al.[12] During the path selection phase a source should decide exhaustively survey the research issues associated with which path is the best one among the multiple pathsfigured wireless mesh networks and discusses the requirement to out in the first phase. The path selection can be prioritized explore multipath routing for load balancing in these in following order: networks. However, maximum throughput scheduling and load balancing in wireless mesh networks is an unexplored (a)If there exist multiple paths to a source’s primary problem. In this paper we survey different load balancing gateway then, take the path with minimum hop count and schemes in wireless mesh networks and briefly introduce if there is still a tie, we can randomlyopt a path. some parameters witch they used in their approaches. (b)If there is no path to source’s primary gateway but a several paths to secondary gateways then take the path with minimum hop count and if there is still a tie opt a 3.Load Balancing Techniques path randomly. Increasing Load in a wireless mesh network causes Congestion and it is lead to different problems like packet As it’s clear, congestion control is based on bandwidth drop, high end to end delay, throughput decline etc. estimation technique,thereforeavailable bandwidth on a various techniques have been suggested that considers load link should be identified.Here the consumed bandwidth information can be piggy packed on to the “Hello” balancing are discussed below. message which is used to maintain local connectivity among nodes. Each host in the network determines its 3.1 Hop-Count Based Congestion-Aware routing [13] devoted bandwidth by monitoring the packets it sends on to the network. The mesh router can detect the congestion In this routing protocol, each mesh router rapidly finds out risk happening on its each link by the bandwidth multiple paths based upon hop count metric to the Internet estimation technics. A link is in risk of congestion gateways by routing protocol designing. Each mesh whenever the available bandwidth of that link is less than a routerequipped to a bandwidth estimation technique to threshold value of bandwidth. If a link cannot handle more allow it to forecast congestion risk, then router select high traffic, it will not accept more requestsover that link. The available bandwidth link for forwarding packets. Multipath primary benefit of this protocol is that it simplifies routing routing protocol consists two phases: Route discovery algorithm but it needs preciseknowledge about the phase and path selection phase. bandwidth of each link. In the route discovery phase, whenever a mesh router tries to find a route to an internet gateway, it initiates a route 3.2 Distributed Load Balancing Protocol. [14] discovery process by sending a route request (RREQ) to all its neighbors. The generator of the RREQ marks the In this protocol the gateways coordinates to reroute flows packet withits sequence number to avoid transmitting the from congested gateways to other underutilized gateways. duplicate RREQ. This technique also considers interference which can be appropriate for practical scenarios, achieving good results

and improving on shortest path routing. Here the mesh 3.4 DGWLBA: Distributed Gateway Load balancing network is divided into domains. A domain di can be Algorithm [16] defined as set of routers witch receive internet traffic and a gateway witch serve them. For each domain a specific In [16] gateways execute DGWLBA to attain load capacity is assigned and is compared against the load in balancing. DGWLBA starts by assigning all routers to the domain. The domain is considered as overloaded if the their nearest gateway that is called theNGW solution. Next load exceeds the sustainable capacity. To avoid congestion steps consist in trying to reroute flows from an overloaded in a domain we can reroute the traffic. This technique does domain d1 to an uncongested domain d2 such that the not impose any routing overhead in the network. overload of both domains is reduced.

Fig.2 Mesh network divided into domains for loadbalancing [14]

Fig.3 WMNs divided into 3 domains each having capacity 25[16] 3.3 Gateway–Aware Routing [15] If domain is overloaded, its sinks are checked in In [15] a gateway mesh aware routing solution is proposed descending order of distance to their serving gateway. This that selects gateways for each mesh router based on is done to givepreferenceto border sinks. The farther a sink multihop route in the mesh as well as the potentiality of is fromits serving gateway the less it will harm other flows the gateway. A composite routing metric is designed that of its domain if it is rerouted. And its path to other picks high throughput routes in the presence of multiple domains will be shorter, thus improving performance. For gateways. The metric designed is able to identify the same reason, when a sink is chosen, domains are congested part of each path, and selectasuitable gateway. checked in ascendingorder of distance to the sink. Next, to The gateway capacity metric can be defined as the time perform the switching of domains, the overload after the needed to transmit a packet of size S on the uplink and is switch must be less than the overload before the switch expressed by (lines 9-11). gwETT=ETXgwS/Bgw (1) Lastly, the cost of switching is checked. nGWsis the gateway nearest toΔs . Only if the cost is less than the Where ETXgw is the expected transmission count for the switching threshold Δsit will be performed (line 12). This uplink and Bgw is the capacity of the gateway. For rule takes into account the existence of contention, because forwarding packets a GARM(Gateway Aware Routing it prevents the establishment of long paths, which suffer Metric) is defined which is follows: from intra-flow interference and increase inter-flow GARM =β.Mi + (1-β) .(mETT+gwETT) (2) interference in the network, and gives preference to border sinks. Hence this approach successfully balances load in This Gateway-aware Routing Metric has two parts.The overloaded domains considering congestion and first part of the metric is for bottleneck capacity and the interference. second part accounts the delay of the path. The β is used forbalancing between these two factors. The gateway with ALGORITHM minimum GARM value can be chosen as the default for each gateway GWi do di={ }; gateway for balancing the load. This scheme overcomes for each sink s do the disadvantage of accurate bandwidth estimation if ( distance(s,GWi) = minimum) suggested in [6] and also improves network throughput. Add sink s to di; For domain d1 in D do if load(d1) > Cd1 then For sink s in d1 do

For domain d2 in D do node that has larger G_Value is more suitable for being a If d1=d2 then GW. Continue Ovldbefore = ovld(d1) + ovld(d2) Ovldafter = ovld(d1-{s}) + ovld(d2 U {s}) If ovldafter< ovldbefore then If dist(s,GW2) / (dist(s,nGWn) < ∆s then d1 = d1 – {s} d2 = d2 U {s} break; If load(d1) ≤ Cd then Break; Fig.4 Breaking a cluster[17] 3.5 Load Balancing in WMNs by Clustering[17] Althoughthe paper considers most of the design aspects of In [17] authors proposed a load balancing schemes for the proposed infrastructure, it leavessome open issues and WMNs by clustering. In first step all nodes are clustered to questions. For instance, surveying load balancing of multi- control the workload of them. If workload on a GW is channel GWs in clustering wireless mesh networks, increased up tomaximum capacity of the GW then the finding maximum throughput of nodes incluster based cluster is broken. With the respect to the gateways wireless mesh networks. Another open issue is using fuzzy capacity, the gateways overload can be predictable. logic for breaking the clusters. Because selecting a new GW andestablish a route table is time consuming, thus third scheme is proposed which 4. Conclusion GWselection and creating route table is done before breaking the cluster. Also they considered some Load balancing is one of the most important problems in parameters for selecting the new GW in new cluster witch wireless mesh networks that needs to be addressed. The isoffered in following formula: nodes in a wireless mesh network trend to communicate with gateways to access the internet thus gateways have × ×× the potential to be a bottleneck point. Load balancing is = (3) essential to utilize the entire available paths to the Where Power is the power of a node, Power is destination and prevent overloading the gateway nodes. In the processing power of each node, Constancy is the time this paper we surveyed different load balancing scheme which a node actively exists in cluster, Velocity is the with various routing metrics that can be employed to spped of each node and Distance is the distance of the tackle load overhead in the network. Table1 summarizes node to centeral of the cluster. In the above formula, they the load balancing techniques witch we surveyed in this calculate G_Value for each node in a cluster and then each paper.

Table1: Summery of different techniques

Technique Metric Advantages Issues that not Addressed Hop Count based Computational overhead. Hop Congestion Aware No routing overhead. accurate bandwidth Count routing. information required. Distributed Load Hop No Routing overhead. Computational Overhead. Balancing Protocol Count Gateway-Aware No routing overhead, GARM Computational Overhead. Routing High throughput. DISTRIBUTD Routing and Queue GATEWAY Low end to end delay Computational Length LOADBALANCING Overhead. Load Balancing in Queue Cluster initial formation Low end to end delay WMNs by Clustering Length parameter

References The Nominal Capacity of Wireless Mesh Networks, [1] Bicket, J., Aguayo, D., Biswas, S., Morris, R.,(2005). IEEE Wireless Communications, vol. 10, no 5,pp. 8– Architecture and evaluation of an unplanned 802.11b 14. mesh network, in: Proceedings of the 11th ACM [10] Abu, R., Vishwanath R., Dipak G., John B., Wei L., Annual International Conference on Mobile Computing Sudhir D., Biswanath M., (2008). and Networking (MobiCom), ACM Press, Cologne, Enhancing Multi-hop Wireless Mesh Networks with a Germany, pp. 31–42. Ring Overlay, SECON workshop. [2] Aoun, B., Boutaba, R., Iraqi, Y., Kenward, G. (2006 ). [11] Horton, G. (1993). Gateway Placement Optimization in Wireless Mesh A multi-level diffusion method for dynamic load Networks with QoS Constraints. IEEE Journal on balancing, Parallel Computing. 19 pp. 209-229. Selected Areas in Communications, vol. 24. [12] Akyildiz, I., Wang, X., Wang, W.,(2005). [3] Hasan, A.K., Zaidan, A. A., Majeed, A., Zaidan, B. B, Wireless Mesh Networks: A Survey, Computer Salleh, R., Zakaria, O., Zuheir, A. (2009). Networks Journal 47, (Elsevier), pp. 445-487. Enhancement Throughput of Unplanned Wireless Mesh [13] Hung Quoc, V.,Choong Seon, H., (2008). Networks Deployment Using Partitioning Hierarchical Hop-Count Based Congestion-Aware Multi-path Cluster (PHC), World Academy of Science, Routing in Wireless Mesh Network, International Engineering and Technology 54. Conference on Information Networking, pp. 1-5 . [4] Akyildiz, I.F., Wang, X., Wang, W.,(2005). [14] Gálvez, J.J., Ruiz, P.M., Skarmeta, A.F.G, (2008). Wireless mesh networks: a survey, Elsevier ,Computer A Distributed Algorithm for Gateway Load-Balancing Networks 47, 445–487. in Wireless Mesh Networks, Wireless Days. WD '08. [5] Jain, K., Padhye, J., Padmanabhan, V. N., Qiu, 1st IFIP, pp. 1-5. L.,(2003). [15] Prashanth A. K., David L, Elizabeth M.,(2010). Impact of interference on multihop wireless network Gateway–aware Routing for Wireless Mesh Networks, performance, in Proceeding of ACM MobiCom, 66-80. IEEE International Workshop on Enabling [6] Akyildiz, I., Wang, X., (2005). Technologies and Standards for Wireless Mesh A survey on wireless mesh networks, IEEE Networking (MeshTech), San Francisco. Communication Magazine, vol. 43, no.9, pp.s23-s30. [16] GUPTA, B. K., PATNAIK, S., YANG, Y.,(2013). [7] MANOJ, B.S., RAMESH, R.,(2006). GATEWAY LOAD BALANCING IN WIRELESS ,WIRELESS MESH NETWORKING, Chapter 8 ,Load MESH NETWORKS, International Conference on Balancing in Wireless Mesh Networks, page 263. Information System Security And Cognitive Science, [8] Saumitra Das, M., Himabindu Pucha , Charlie Singapore. Hu,Y.,(2006). [17] Shahverdy, M., Behnami, M., Fathy, M.,(2011). Mitigating the Gateway Bottleneck via Transparent A New Paradigm for Load Balancing in WMNs, Cooperative Caching in Wireless Mesh Networks, NSF International Journal of Computer Networks (IJCN), grants CNS-0338856 and CNS-0626703. Volume (3), Issue (4). [9] Jangeun, J., Mihail, L.,(2003).

Mobile Banking Supervising System- Issues, Challenges & Suggestions to improve Mobile Banking Services

Dr.K.Kavitha Assistant Professor, Department of Computer Science, Mother Teresa Women’s University, Kodaikanal [email protected]

Abstract Table 1 - Technology Usage Survey Banking is one of the Largest Financial Institution which constantly provides better Customer Services. To improve Online Mobile Issues ATM the Service Quality, Banking Services are expanded to Banking Banking Mobile Technology. Recently Mobile Banking plays a vital Preferable 25 20 5 role in Banking Sector. This technology helps the Customers Risk to provide all the account information as well as save 30 40 75 timings. Customer can avail all financial services such as Factor Credit, Debit, Money Transfer, Bill Payment etc in their Mobiles using this application. It saves time to speed in a Bank. Almost, most of the banks providing financial services through Mobile phones. But still majority of peoples are not preferred this services like ATM or Online because of Risk Factor. The main objective of this paper is to discuss about the benefits, issues and suggestions to improve Mobile Banking services successfully.

Keywords: ATM, Online Service, Mobile Banking, Risk Rating, MPIN, Log Off

1. Introduction

Mobile Banking System allows the Customers to avail all financial services through Phones or Tablets. The traditional Mobile Banking Services offered through Figure 1 - Technology Usage Survey Chart SMS which is called as SMS Banking. Whenever the Customer availed Transaction either Debit or Credit, The above table and figure proves that the major risk SMS will be sent to the Customers accordingly. But comes under the third category such as Mobile this service offer two transactions only such as Credit Banking. ATM and Online Services are having and Debit. Other benefits are gathered by spending minimum Risks comparable with mobile banking money for SMS. New technology is rapidly modified services. So that, in this Survey indicates that mostly the traditional Systems of doing banking services. preferred service is ATM then the next option is Banking Services is expanded up to this technology. Online Services. This paper studies the benefits, By using iphones, Customers can download and use limitations and suggestions to improve the mobile Mobile applications for further financial Services. This banking services. service avoids the Customers going to branch premises and provided more Services. The Usages of technological Financial Services was tested by 50 2. Related Work Customers and most of them indicated that major risk service is Mobile Banking instead of ATM and Online Renju Chandran [1] suggested some ideas and Services as follows in table 1 presented three steps to run and improve the mobile banking services effectively. The author presented the benefits, limitations and problems faced by the customer during the transaction of mobile banking and suggested a method for improving that service. Aditya Kumar Tiwari.et.al [2] discussed about mobile banking advantages, drawbacks, Security issues and challenges in mobile banking services and proposed

some idea to get the solution of mobile banking Internet Connection is necessary to avail these security. Services. If the Customers reside in rural area V.Devadevan [3] conversed about Mobile means then they cannot avail the services because Compatibility, Mindset about Mobile Banking of Tower problem or line breakage. acceptance and Security issues. The author depicted ii. AntiVirus Software Updation from the study that the evolution of eminent Many Customers are not aware about Anti Virus technologies in communication system and mobile Software so that spyware will affect their mobiles. device is a major factor and challenge to frequently iii. Forget to Log Off changing the mobile banking solutions. The author if Customer’s mobile phone theft means suggested creating awareness among the existing unauthorised person can reveal all our transaction customers and providing special benefits for mobile details. bankers which will increase the service. MD. iv. Mobile Compatibility Tukhrejul Inam.et.al[4] described the present condition Latest Mobile Phones alone suited for availing of mobile banking services in Bangladesh and also these services showed prospects and limitations of Mobile banking in v. Spend Nominal Charge their country. The author suggested to the Bangladesh For Regular Usage, Customers has to spend some banks to follow the mobile banking services for nominal Charges for transactions making their lives easier.Mobile phones target the world's nonreading poor[5] disussed the Modern Cell 3.2 Identify the Major Issue in Mobile Phone usage and functionalities. Banking

3. Objectives of Mobile Banking System Customers mostly prefer ATM and Online Services. Mobile Banking is not preferred by many because of The following steps are discussed in the next Section the above limitations. Customers have to aware about Mobile Banking Services before usage. The awareness 1. Benefits & Limitation of Mobile Banking and Risk about Mobile Banking was tested by 50 2. Identify the Major Issue in Mobile Banking Customers and Comparable with other risk factor most 3. Suggestion proposed to improve the Mobile pointed out the “forget to Log off” Issue in these Banking Services. Limitations as follows.

3.1 Benefits & Limitations of Mobile Table 2 - Risk Ratings in Mobile Banking Banking Risk Ratings Issues in Mobile Benefits of Mobile Banking Banking 5 4 3 2 1 i. Reduce Timing Instead of going to bank premises and waiting in a Compulsory Internet Queue for checking the account transactions, Connection & Tower 25 5 5 10 5 Customers can check all details through Mobile Problem Phones ii. Mini Statement Anti Virus Software 10 20 20 In Offline Mode, We can see our Transaction Updation Details through Mini Statement by using MPIN iii. Security Forget to Log Off & 35 5 10 During Transactions like Amount Transfer, SMS Misuse Mobile Phones Verification Code is provided for checking the Authorised Persons. iv. Availability Mobile Compatibility 10 40 At any time, Customers can avail all the Services through Mobile Phones v. Ease of Use Spend Nominal Charge 10 40 User Friendly. Customers can access all the financial services with little knowledge about mobile application.

Limitations of Mobile Banking i. Compulsory Internet Connection & Tower Problem

invoked by MBMS. Clock Time Limits are not fixed, Customers can change the limits at any time.

Mobile Banking

Transaction message or Figure 2 - Risk Ratings in Mobile Banking Chart Skip returns or time expires s time expires Log Off The above collected data also indicates that 70% of Mobile Log Off Service Customers mentioned the “Forget to Log Off & Misuse Banking of Mobile Phones due to theft” is a major risk factor in Service Supervisin g Invoke the above list. The author suggested an idea to improvise the mobile banking services in the next System section

3.3 Suggestion proposed to improve the Mobile Figure 3 - MBSS Model Design Banking Services Invoke Log out

5. Conclusion function In Mobile Banking Applications, whenever we need to avail financial services we have to enter our User name and Password for using our account transactions. After Mobile Banking is a convenient financial services to completion of our task, Customers have to log off these the Customers. Customers can avail all account services. But sometimes, for regular usage Customers transactions like Bill payment, Credit Amount, Debit may forget or postponed to log off. At that time, This Amount, Fund Transfer etc. It offers many benefits mobile application always keep inside the with ease of use. But still it has some limitations. So corresponding Customer’s Account Database. If the this paper discussed the major issues faced by the Customers mobile phones theft means, automatically Customer & Banks Sector through Mobile Banking hackers can reveal all their transaction details very services and suggested an idea for protecting the easily. This will become a very big issue. Banking account information from unauthorised persons Sector has to avoid this type of problems by using new through Mobile Banking Supervising System. emerging technologies. At the Same time, Customers also have to aware about these Services like How to References use these apps, what are the security measures taken by the banking sector and how to avoid major risks from [1] Renju Chandran “Pros and Cons of Mobile Banking”- unauthorized persons. International Journal of Scientific Research Publications, Volume 4 Issue 10 October 2014.ISSN 2250-3153 [2] Aditya Kumar Tiwari, Ratish Agarwal, Sachin Goyal - 4. Proposed Mobile Banking Supervising “Imperative & Challenges of Mobile Banking in India”- System [MBSS] International Journal of Computer Science & Engineering Technology, Volume 5 Issue 3 March 2014. ISSn 2229-3345 This paper suggested to implement Mobile Banking [3] V.Devadevan “ Mobile Banking in India- Issues & Supervising System [MBSS] along with mobile Challenges” – International Journal of Engineering Technology and Advanced Engineering, Volume 3, Issue 6, banking applications for protecting and keep track all June 2013. ISSN 2250-2459 the sensitive information. For tracking all the [4] MD.Tukhrejul Inam, MD.Baharul Islam “ Possibilities transactions, MBMS keeps Stop Watch for monitoring and Challenges of Mobile Banking: A Case Study in the Services regularly like Log in Timing, transaction Bangladesh” International Journal of Advanced Particulars, Log off (or) skip the application. Computational Engineering and Networking, Volume 1 Everything has to be monitored by MBSS. If the Issue 3 May 2013, ISSN: 2321-2106. Customer skips these mobile apps or forgets to log off [5] L.S.Dialing, “Mobile Phones target the world’s non keep on staying means, Automatic Log off Functions is reading poor”, Scientific American, Volume 296 issue 5 2007.

A Survey on Security Issues in Big Data and NoSQL

Ebrahim Sahafizadeh1, Mohammad Ali Nematbakhsh2

1 Computer engineering department, University of Isfahan Isfahan,81746-73441,Iran [email protected]

2 Computer engineering department, University of Isfahan Isfahan,81746-73441,Iran [email protected]

Abstract This paper presents a survey on security and privacy issues in big 2.1 Big Data data and NoSQL. Due to the high volume, velocity and variety of big data, security and privacy issues are different in such streaming data Big data is a term refers to the collection of large data sets infrastructures with diverse data format. Therefore, traditional which are described by what is often referred as multi 'V'. In security models have difficulties in dealing with such large scale [8] 7 characteristics are used to describe big data: data. In this paper we present some security issues in big data and Volume, variety, volume, value, veracity, volatility and highlight the security and privacy challenges in big data complexity, however in [9], it doesn't point to volatility and infrastructures and NoSQL databases. complexity. Here we describe each property. Keywords: Big Data, NoSQL, Security, Access Control Volume: Volume is referred to the size of data. The size of data in big data is very large and is usually in terabytes and 1. Introduction petabytes scale. Velocity: Velocity referred to the speed of data producing and The term big data refers to high volume, velocity and variety processing. In big data the rate of data producing and information which requires new forms of processing. Due to processing is very high. these properties which are referred sometimes as 3 'V's, it Variety: Variety refers to the different types of data in big becomes difficult to process big data using traditional database data. Big data includes structured, unstructured and semi- management tools [1]. A new challenge is to develop novel structured data and the data can be in different forms. techniques and systems to extensively exploit the large Veracity: Veracity refers to the trust of data. volume of data. Many information management architectures Value: Value refers to the worth drives from big data. have been developed towards this goal [2]. Volatility: "Volatility refers to how long the data is going to be valid and how long it should be stored" [8]. As developing new technologies and increasing the use of big Complexity: "A complex dynamic relationship often exists in data in several scopes, security and privacy has been big data. The change of one data might result in the change of considered as a challenge in big data. There are many security more than one set of data triggering a rippling effect" [8]. and privacy issues about big data [1, 2, 3, 4, 5 and 6]. In [7] Some researchers defined the important characteristics of big top ten security and privacy challenges in big data is data are volume, velocity and variety. In general, the highlighted. Some of these challenges are: secure characteristics of big data are expressed as three Vs. computations, secure data storages, granular access control and data provenance. 2.2 NoSQL

In this paper we focus on researches in access control in big The term NoSQL stands for "Not only SQL" and it is used for data and security issues on NoSQL databases. In section 2 we modern scalable databases. Scaling is the ability of the system to increase throughput when the demands increase in terms of have an overview on big data and NoSQL technologies, in data processing. To support big data processing, the platforms section 3 we discuss security challenges in big data and incorporate scaling in two forms of scalability: horizontal describe some access control model in big data and in section scaling and vertical scaling [10]. 4 we discuss security challenges in NoSQL databases. Horizontal Scaling: in horizontal scaling the workload distributes across many servers. In this type of scalability 2. Big Data and NoSQL Overview multiple systems are added together in order to increase the throughput. In this section we have an overview on Big Data and NoSQL.

Vertical Scaling: in vertical scaling more processors, more [2] using data content. In this case the semantic content of memory and faster hardware are installed within a single data plays the major role in access control decision making. server. "CBAC makes access control decisions based on the content The main advantages of NoSQL is presented in [11] as the similarity between user credentials and data content following: "1) reading and writing data quickly; 2) supporting dynamically" [2]. mass storage; 3) easy to expand; 4) low cost". In [11] the data Attribute relationship methodology is another method to models that studied NoSQL systems support are classified as enforce security in big data proposed in [3] and [4]. Protecting Key-value, Column-oriented and Document. There are many the valuable information is the main goal of this methodology. products claim to be part of the NoSQL database, such as Therefore [4] focuses on attribute relevance in big data as a MongoDB, CouchDB, Riak, Redis, Voldermort, Cassandera, key element to extract the information. In [4], it is assumed Hypertable and HBase. that the attribute with higher relevance is more important than Apache Hadoop is an open source implementation of Google other attributes. [3] uses a graph to model attributes and their big table [12] for storing and processing large datasets using relationship. Attributes are expressed as node and relationship clusters of commodity hardware. Hadoop uses HDFS which is is shown by the edge between each node and the method is a distributed file system to store data across clusters. In section proposed by selecting protected attributes from this graph. The 6 we have an overview of Hadoop and discuss an access method proposed in [4] is as follow: control architecture presented for Hadoop. "First, all the attributes of the data is extracted and then generalize the properties. Next, compare the correlation between attributes and evaluate the relationship. Finally 3. Security Challenges and Access Control protect selected attributes that need security measures based Model on correlation evaluation" [4] and the method proposed in [3] is as follow: There are many security issues about big data. In [7] top ten "All attributes are represented as circularly arranged nodes. security and privacy challenges in big data is presented. Add the edge between the nodes that have relationships. Select Secure computation in distributed framework is a challenge the protect nodes based on the number of edge. Determine the which discusses security in map-reduce functions. Secure data security method for protect nodes" [3]. storage and transaction logs discuss new mechanism to A suitable data access control method for big data in cloud is prevent unauthorized access to data stores and maintain attribute-based encryption.[1] A new schema for enabling availability. Granular access control is another challenge in efficient access control based on attribute encryption is big data. The problem here is preventing access to data by proposed in [1] as a technique to ensure security of big data in users who should not have access. In this case, traditional the cloud. Attribute encryption is a method to allow data access control models have difficulties in dealing with big owners to encrypt data under access policy such that only user data. Some mechanisms are proposed for handling access who has permission to access data can decrypt it. The problem control in big data in [2, 3, 4, 13 and 14]. with attribute-based encryption discussed in [1] is policy Among the security issues in big data, data protection and updating. When the data owner wants to change the policy, it access control are recognized as the most important security is needed to transfer the data back from cloud to local and re- issues in [4]. Shermin In [14] presents an access control model encrypt the data under new policy and it caused high for NoSQL databases by the extension of traditional role based communication overhead. The authors in [1] focus on solving access control model. In [15] security issues in two of the this problem and propose a secure policy updating method. most popular NoSQL databases, Cassandra and MongoDB are Hadoop is an open source framework for storing and discussed and outlined their security features and problems. processing big data. It uses Hadoop Distributed File System The main problems for both Cassandra and MongoDb (HDFS) to store data in multiple nodes. Hadoop does not mentioned in [15] are the lack of encryption support for data authenticate users and there is no data encryption and privacy files, weak authentication between clients and servers, simple in Hadoop. HDFS has no strong security model and Users can authentication, vulnerability to SQL injection and DOS attack. directly access data stored in data nodes without any fine grain It is also mentioned that both of them do not support RBAC authorization [13, 16]. Authors in [16] present a survey on and fine-grained authorization. In [5] the authors have a look security of Hadoop and analyze the security problems and at NIST risk management standards and define the threat risks of it. Some security mechanism challenges mentions in source, threat events and vulnerabilities. The vulnerabilities [16] are large scale of the system, partitioning and distributing defined in [5] in term of big data are Insecure computation, files through the cluster and executing task from different user End-point input validation/filtering, Granular access control, on a single node. In [13] the authors express some of security Insecure data storage and communication and Privacy risk in Hadoop and propose a novel access control scheme for preserving data mining and analytics. storing data. This scheme includes Creating and Distributing In some cases in big data it is needed to have access control Access Token, Gain Access Token and Access Blocks. The model based on semantical content of the data. To enforce same scheme is also used with Secure Sharing Storage in access control in such content centric big data sharing, cloud. It can help the data owners control and audit access to Content-Based Access Control (CBAC) model is presented in

Copyright (c) 2015 Advances in Computer Science: an International Journal. All Rights Reserved. ACSIJ Advances in Computer Science: an International Journal, Vol. 4, Issue 4, No.16 , July 2015 ISSN : 2322-5157 www.ACSIJ.org their data but the owners need to update the access token when service attack because it performs one thread per one client the metadata of file blocks changes. [19] and it does not support inline auditing.[15] Cassandra uses a query language called Cassandra Query Language (CQL), which is something like SQL. The authors of [15] 4. Secutity Issues in NoSQL Databases show that injection attack is possible on Cassandra like SQL injection using CQL. Cassandra also has problem in managing NoSQL stands for "Not Only SQL" and NoSQL databases are inactive connection [19]. not meant to replace the traditional databases, but they are suitable to adopt big data when the traditional databases do not 4.4 HBase appropriate [17]. NoSQL databases are classified as Key-value database, Column-oriented database, Document based and HBase is an open source column oriented database modeled Graph database. after Google big table and implemented in java. Hbase can manage structured and semi-structured data and it uses 4.1 MongoDB distributed configuration and write ahead logging. Hbase relies on SSH for inter-node communication. It MobgoDB is a document based database. It manages supports user authentication by the use of SASL (Simple collection of documents. MongoDB support complex datatype Authentication and Security Layer) with Kerberos. It also and has high speed access to huge data.[11] flexibility, power, supports authorization by ACL (Access Control List) [17]. speed and ease of use are four properties mentioned in [18] for MongoDB. All data in MongoDB is stored as plain text and 4.5 HyperTable there is no encryption mechanism to encrypt data files [19]. All data in MongoDB is stored as plain text and there is no Hypertable is an open source high performance column encryption mechanism to encrypt data files. [19] This means oriented database that can be deployed on HDFS. It is that any malicious user with access to the file system can modeled after Google's big table. It use a table to store data as extract the information from the files. It uses SSL with X.509 a big table [20]. certificates for secure communication between user and Hypertable does not support data encryption and MongoDB cluster and intra-cluster authentication [17] but it authentication [19]. It does not tolerate the failure of range does not support authentication and authorization when server and if a range server crashes it is not able to recover lost running in Sharded mode [15]. The passwords are encrypted data [20]. Eventhough Hypertbale uses Hypertable Query by MD5 hash algorithm and MD5 algorithm is not a very Language (HQL) which is similar to SQL, but it has no secure algorithm. Since mongo uses Javascript as an internal vulnerabilities for the injection [19]. Additionally there is no scripting language, authors in [15] show that MongoDb is denial of service is reported for Hypertable [19]. potential for scripting injection attack. 4.6 Voldemort 4.2 CouchDB Voldemort [23] is a key value NoSQL database used in CouchDb is a flexible, fault-tolerant document based NoSQL LinkedIn. This type of databases match keys with values and database [11]. It is an open source apache project and it runs the data is stored as a pair of key and value. Voldemort on Hadoop Distributed File Systems (HDFS) [19]. supports data encryption if it uses BerkeleyDB as the storage CouchDB does not support data encryption [19], but it engine. There is no authentication and authorization supports authentication based on both password and cookie mechanism in Voldemort. It neither supports auditing [21]. [17]. Passwords are encrypted using PBKDF2 hash algorithm and are sent over the network using SSL protocol [17]. 4.7 Redis CouchDB is potential for script injection and denial of service attack [19]. Redis is an open source key value database. Data encryption is not supported by Redis and all data stored as plain text and the 4.3 Cassandra communication between Redis client and server is not encrypted [19]. Redis does not implement access control, so it Cassandra is an open source distributed storage for managing provides a tiny layer of authentication. Injection is impossible big data. It is a key value NoSQL database which is used in in Redis, since Redis protocol does not support string escaping Facebook. The properties mentioned in [11] for Cassandra are concept [22]. the flexibility of the schema, supporting range query and high scalability. 4.8 DynamoDB all passwords in Cassandra are encrypted by the use of MD5 hash function and passwords are very weak. If any malicious DynamoDB is a fast and flexible NoSQL database used in user can bypass client authorization, user can extract the data amazon. It supports both key value and document data model because there is no authorization mechanism in inter-node [24]. Data encryption is not supported in Dynamo but the message exchange.[17] Cassandra is potential for denial of communication between client and server uses https protocol.

Authentication and authorization is supported by dynamo and Some researchers presented new access control model for big arequests need to be signed using HMAC-SHA256 [21]. data which was introduced in this paper. In the last section we described security issues in NoSQL 4.9 Neo4J databases. As it was mentioned the most of NoSQL databases has the lack of data encryption. To have a more secure Neo4j [25] is an open source graph database. Neo4j does not database it is needed to encrypt sensitive database fields. support data encryption and authorization and auditing. The Some of databases have vulnerability for injection. It is communication between client and server is based on SSL needed to use sufficient input validation to overcome this protocol. [21]. vulnerability. Some of them have no authentication mechanism and some of them have weak authentication mechanism. So to overcome this weakness it is needed to have 5. Conclusion strong authentication mechanism. CouchDB uses SSL protocol, Hbase uses SASL and Hypertable, redis and Increasing the use of NoSQL in organization, security has Voldemort has no authentication and the other databases has become a growing concern. In this paper we presented a weak authentication. MongoDB and CouchDB are potential survey on security and privacy issues in big data and NoSQL. for injection and Cassandra and CouchDB are potential for We had an overview on big data and NoSQL databases and denial of service attack. Table 1 briefly shows this discussed security challenges in this area. Due to the high comparison. volume, velocity and variety of big data, traditional security models have difficulties in dealing with such large scale data.

Table 1: The Comparison between NoSQL Databases

Data Communication Potential DB/Criteria Data Model Authentication Authorization Encription Auditing protochol for attack Data Model Script MongoDb Document Not Support Not Support Not Support - SSL injection Document Script CouchDB Document Support - Not Support - SSL injection Document and DOS Script injection (in Cassandra Key/Value Support Not Support Not Support Not Support SSL CQL) Key/Value and DOS Not reoprt Column for DOS Column Hbase Oriented Support Support Not Support - SSH and Oriented injection Column Column HyperTable Oriented Not Support - Not Support - - - Oriented Voldemolt Key/Value Not Support Not Support Support Not Support - Key/Value Redis Key/Value Tiny Layer Not Support Not Support Not Support Not Encrypted - Key/Value Key/Value Key/Value DynamoDB Support - Not Support - https - Document Document Neo4J Graph - Not Support Not Support Not Support SSL - Graph

Security (ICITCS), 2013 International Conference on, pp 1-4, References 2013 [1] K.Yang, Secure and Verifiable Policy Update Outsourcing for Big Data Access Control in the Cloud, Parallel and Distributed [5] M.Paryasto, A.Alamsyah, B.Rahardjo, Kuspriyanto, Big-data Systems, IEEE Transactions on , Issue 99, 2014 security management issues, Information and Communication Technology (ICoICT), 2nd International Conference on, pp 59- [2] W.Zeng, Y.Yang, B.Lou, Access control for big data using data 63, 2014 content, Big Data, IEEE International Conference on, pp. 45-47, 2013 [6] J.H.Abawajy,A. Kelarev, M.Chowdhury, Large Iterative Multitier Ensemble Classifiers for Security of Big Data, [3] S.Kim, J.Eom, T.Chung, Big Data Security Hardening Emerging Topics in Computing, IEEE Transactions on, Volume Methodology Using Attributes Relationship, Information 2, Issue 3, pp 352-363, 2014 Science and Applications (ICISA), 2013 International Conference on, pp 1-2, 2013 [7] Cloude Security Allience, Top Ten Big Data Security and Privacy Challenges, www.cloudsecurityalliance.org, 2012 [4] S.Kim, J.Eom, T.Chung, Attribute Relationship Evaluation Methodology for Big Data Security, IT Convergence and [8] K. Zvarevashe, M. Mutandavari, T. Gotora, A Survey of the Security Use Cases in Big Data, International Journal of

Innovative Research in Computer and Communication Technologies (EIDWT), 2012 Third International Conference Engineering, Volume 2, issue 5, pp 4259-4266, 2014 on, pp 330 – 335, 2012 [9] M.D.Assuncau, R.N.Calheiros, S.Bianchi, A.S.Netto, R.Buyya, [19] 19P.Noiumkar, T.Chomsiri, A Comparison the Level of Security Big Data computing and clouds: Trends and future directions, on Top 5 Open Source NoSQL Databases, The 9th International Journal of Parallel and Distributed Computing, 2014 Conference on Information Technology and [10] D.Singh, C.K.Reddy, A survey on platforms for big data Applications(ICITA2014) , 2014 analytics, Journa of Big Data, 2014 [20] A.Khetrapal, V.Ganesh, HBase and Hypertable for large scale [11] J.Han, E.Haihong, G.Le, J.Du , Survey on NoSQL Database, distributed storage systems, A Performance evaluation for Open Pervasive Computing and Applications (ICPCA), 2011 6th Source BigTable Implementations, Dept. of Computer Science, International Conference on, pp 363-366, 2011 Purdue University, http://cloud.pubs.dbs.uni-leipzig.de/node/46, 2008 [12] F.Chang, J.Dean, S.Ghemawat, W.C. Hsieh, D.A. Wallach, Bigtable: A Distributed Storage System for Structured Data, [21] K.Grolinger, W.A.Higashino, A.Tiwari,M.AM Capretz, Data Google, 2006 management in cloud environments: NoSQL and NewSQL data stores, Journal of Cloud Computing: Advances, Systems and [13] C.Rong, Z.Quan, A.Chakravorty, On Access Control Schemes Applications, 2013 for Hadoop Data Storage, International Conference on Cloud Computing and Big Data, pp 641-645, 2013 [22] http://redis.io/topics/security [14] M. Shermin, An Access Control Model for NoSQL Databases, [23] http://www.project-voldemort.com The University of Western Ontario, M.Sc thesis, 2013 [24] http://aws.amazon.com/dynamodb [15] L.Okman, N.Gal-Oz, Y.Gonen, E.Gudes, J.Abramov, Security [25] http://neo4j.com Issues in NoSQL Databases, Trust, Security and Privacy in Computing and Communications (TrustCom), IEEE 10th International Conference on, pp 541-547, 2011 Ebrahim Sahafizadeh, B.S. Computer Engineering (Software), Kharazmi University of Tehran,2001, M.S. Computer Engineering (Software), ran [16] M.RezaeiJam, L.Mohammad Khanli, M.K.Akbari, M.Sargolzaei University of Science & Technology, Tehran, 2004. Ph.D student at Isfahan Javan, A Survey on Security of Hadoop, Computer and University. Faculty member, Lecturer, Department of Information Knowledge Engineering (ICCKE), 2014 4th International Technology, Payame Noor University , Boushehr. Conference on , pp 716-721, 2014 MohammadAli Nematbakhsh, B.S. Electrical Engineering , Louisiana Tech [17] A.Zahid, R.Masood, M.A.Shibli, Security of Sharded NoSQL University, USA, 1981, M.S. Electrical and Computer Engineering. Databases: A Comparative Analysis, Conference on Information University of Arizona, USA, 1983, Ph.D. electrical and Computer Assurance and Cyber Security (CIACS), pp 1-8, 2014 Engineering, University of Arizona, USA, 1987. Micro Advanced Computer, [18] A.Boicea, F.Radulescu, L.I.Agapin, MongoDB vs Oracle - Phoenix, AZ, 1982-1984, Toshiba Co, USA and Japan, 1988-1993, Computer database comparison, Emerging Intelligent Data and Web engineering Department, university of Isfahan, 1993-now

Classifying Protein-Protein Interaction Type based on Association Pattern with Adjusted Support

Huang-Cheng Kuo and Ming-Yi Tai

Department of Computer Science and Information Engineering National Chiayi University Chia-Yi City 600, Taiwan [email protected]

some small transactions from a protein complex, and Abstract each small transaction contains the residues which are Proteins carry out their functions by means of interaction. geographically close to each other. The binding surface There are two major types of protein-protein interaction (PPI): of a protein complex is usually curve. So, there have obligate interaction and transient interaction. In this paper, some residues of a protein with concave shape binding residues with geographical information on the binding sites site and others are on the other protein with convex are used to discover association patterns for classifying protein shape binding site. A transaction is a tuple , interaction type. We use the support of a frequent pattern as its inference power. However, due to the number of transient where R is a set of residues of a protein, L is a set of examples are much less than the number of obligate examples, residues of the other protein. Residues of a transaction therefore there needs adjustment on the imbalance. Three are geographical close to each other. Patterns from methods of applying association pattern to classify PPI type obligate protein complexes and from transient protein are designed. In the experiment, there are almost same results complexes are mined separately [1]. for three methods. And we reduce effect which is correct rate In this paper, we assume proteins are in complex form. decreased by data type imbalance. However, with the association patterns, proteins can be indexed under the patterns. So that biologists can Keywords: Protein-Protein Interaction, Association Pattern quickly screen proteins that interact with the certain Based Classification, Type Imbalance type of interaction.

1. Introduction 2. Related Works Protein-protein interaction refers to an event generated The ultimate goal of this paper is user input transient on the change in physical contact between two or more protein binding proteins, and then quickly screened out proteins. Protein-protein interaction occurs when a an experimental biological experimenter direction from number of proteins combine into an obligate protein the data library. As for how to predict protein complex or a transient protein complex. An obligate interaction type, researchers have proposed method complex will continue to maintain its quaternary using machine learning classification methods to design structure and its function will continue to take effect. A the system module. transient complex will not maintain its structure. It will Mintseris et al for protein complexes can identify separate at the end of its function. Protein-protein whether their prediction classification of information interaction occurs mainly on the binding surface of the depends only limited participation in Pair of two proteins. The residues on the binding surface play an proteins interact in order for the quantity of various important role for deciding the type of protein-protein atoms, called Atomic Contact Vector. There is a good interaction. The residue distribution affects the accuracy, but there are two drawbacks. (1) Feature contacting orientation and thus determines the binding vector (171 dimensions), there will curse of energy which is important in interaction type. dimensionality problem. (2) Focus only on contact with In this paper, an association pattern method is proposed the atom, it did not consider the shape of the contact for classifying protein-protein interaction type. A surface. The shape of the contact surface of the protein transaction is instead of considering all the residues on affects the contacting area and the types of atom contact the binding surface of a protein complex. We generate

[2,3]. rules, such as SVM, has 99% correct rate, and k- It is a popular field of study on protein complexes in nearest-neighbor method also has about 93% addition to the recognition as a research outside the accuracy[13]. Lukman et al divided interaction into pharmacy. In pharmaceutical drug design research, the three categories: crystal packing, transient and obligate. main goal is analyzing protein-protein interaction[4,5]. 4,661 transient and 7,985 obligate protein complexes are Pharmaceutical drug design research intends to find the in order bipartite graph pattern mining complexes of protein that is located in a position gap, and the gap is each category, and find style single binding surface mainly protein or protein-bound docking. In the protein dipeptide. Find out which style or Patches, locate the binding site, there is information, such as shapes, notch joint surface, and can bring good accuracy whether depth and electrical distribution. The protein binding protein interactions. Some people use the opposite site is the main location of the occurrence of disease operation, a collection of known protein acting plane. organisms, and the location is a place where compound We want to find those proteins which have similar produced protein chemistry and mutual bonding place. effects because the first know what is the role of surface Therefore, when researchers design a drug, they look for interacted with each other exist. But it is possible to be existing molecules or the synthesis of new compounds. able to know relationship exists for protein to protein. When a compound is placed in the protein gap, we must try to put a variety of different angles to constantly rotate. Looking for as much as possible fill the gap and 3. Data Preparation produce good binding force between molecules. So, we can find a compound capable of proteins that have the 3D coordinate position of the plan of proteins derived highest degree of matching notches. This is called from the RCSB Protein Data Bank. Identification of docking. protein complexes, there are several sources: Protein-protein interaction network can help us to 1. 209 protein complexes collected by Mintseris and understand the function of the protein[6,7]. You can Weng [3]. understand the basic experiment to determine the role of 2. Protein complexes obtained from the PDB web the presence of the protein, but the protein due to the site[14]. Then type of the complexes is determined by huge amount of data. using NOXclass website. NOXclass uses SVM to We are not likely to do it one by one experiment. So classify the protein-protein interaction of three types into predicting interactions between proteins has become a biological obligate, biological transient and crystal very important issue. In order to predict whether there packing. We keep only the protein complexes which as has interaction between the protein situations more classified as biological obligate and biological transient. accurately [8]. So Biologists proposed using protein The accuracy rate is claimed to be about 92%. So, we combination to increase the accuracy. The joint surface use the data classified by NOXclass as genuine data for between the protein and the protein interacting surface experiment. We collected total 243 protein complexes is called protein domain. [15] by this way. Domain Protein binding protein is the role of surface functional units. Usually a combination of surface has 3.1 The Binding Surface Residues more than one domain presence, and combined with the presence of surface property which is divided into the A protein complex is composed of two or more proteins, following categories: hydrophobicity, electrical where in PDB[14] each protein is called a chain. The all resistance, residual Kitt, shape, curvature, retained the chains in a protein complex share a coordinate residues[9, 10, 11]. We use information of residues. system, and the coordinates of each chain of each Park et al also use classification association rules residue sequentially label in a number of the more prediction interaction. The interaction into Enzyme- important atoms. Because it is a complex of a common inhibitors, Non Enzyme-inhibitors, Hetero-obligomers, coordinate system, so that the relative position is found Homo-obligomers and other four categories[12], the between the chains. If there are two chains bind together total 147. Association rule is used in conjunction face by the chain of residue position determination, but no value of 14 features, as well as domain, numerical residues are indicated at the bonding surface on[16]. It characteristics such as average hydrophobicity, residue is therefore necessary to further inquiries by other propensity, number of amino acids, and number of repositories or algorithms judgment. atoms. There have been many studies on protein sequence data There is about 90 percent correction rate, but the to predict which residues are on the binding surface[17]. information has the same way. Other non-association In our research, the atom-to-atom distance between two

residues is used to decide whether the two residues Input which are on the binding surfaces. They are binding surface residues if there exists an atom-to-atom distance which is less than a threshold. The distance threshold 100 is 5Å in this paper[18,19,20] . Operation Input pdb data confidence

Search residue Confidence sets pair > 0.6

Yes

Transaction No Have No same rule

Support No >1% Yes

Yes Select

Get association Delete rule

Fig 1. Associative Classification Mining Finish rule Delete classify We input a dataset of PDB files[14], then find the residues on the binding site for each complex and get a pair of residue sets, one set on the convex side, the other Fig 2. Applying the rules to classify one on the other side. And partition each of the pairs of residue sets into transactions for association rule mining Taking association rule operation value for PDB file[14] called confidence. If the value of confidence is less than 0.6, delete in next box. Finally if the transaction's value of supper is this rule. Next check have same rule with different type, if less than 1%, then delete this transaction. have same rule, delete lower confidence rule.

3.2 Obtaining Data

Frequent pattern mining results can be obtained such as: identification of protein complexes in the same body, which is electrically common to the residues in the concave joint surface, and hydrophilic (or hydrophobic) residue at the convex joint surface. Association rule mining results can be obtained, such as: polar residues

on the concave joint surface and hydrophilic (or association rule mining we can get rule: {phe, ser} → hydrophobic) residue at the convex joint surface. The {arg} , support for the rule of 1.9%, confidence is 90%. identification of protein complexes mostly is interaction. After exploration, it identify with the non-binding Combination of surface materials can have physical and protein complex identification of amino acid side chemical properties of amino acids. As well as connected to the rules. It can assist in predicting the numerical features such as accessible surface area (ASA). style to get a likelihood classification combined with The appropriate numeric features discrete (discretize) another side effect of the combination of face for the interval, as a project (item) mining of association recognition occurs. If one party has a combination of rules. At this stage, we just take the residual basic body surface phe and ser. The other party has arg, and then styles and frequently used as input data mining of the binding surface increases the likelihood of association rules. identification. But if the two sides should also consider We combine two protein complexes uneven surface combining amino acid side of the amino acid pattern of residues projected on the surface of the joint surface of non-compliance with the identification of the joint the cross to a radius of 10 Å circular motion in the surface. The likelihood of this occurrence to identify the transverse plane. The radius of 10 Å successive combination of surface and reduced. In addition, increases in a circular motion to form concentric detailed rules amino acid position for the application of rings[21]. Each ring will cut the number of between the rules is likely to affect, such as: if ser phe and the zones of equal area (sector), and different ring every distance to the amino acid level, it is very far away. area roughly equal to the formation of residues in each Even if there is another combination of surface arg. The district a deal. This method will be divided residues applicability of this rule may play on a discount of, on from the past in the same transaction. But the the contrary. If it is very close distance, the influence of disadvantage is that the district boundary residues are this rule should be raised. We will consider the overall rigidly assigned to a transaction that is on the boundary impact is to identify a set of joint surface when the of two similar residues are divided into different deal. amino acid pattern recognition and non-recognition of Another way for the direct use of the tertiary structure the joint surface binding amino acid pattern in the coordinates to each binding surface residues as a surface of the judge[22,23]. Obtain association rules benchmark. In concave surface of combination site. For method, divided into two steps: example, take one of the residues r, the same as in the 1. Delete unimportant or conflict association rules. concave and r similar residues put together, then convex. 2. Select the association rules, set for unknown objects. The similar residues with r is added and assigned to the And related forms of association rule is X => C, X is the same transaction. The advantage is that residues close to project set (also called itemset), C is a category. each other are put into a transaction. But the drawback Association rules in the form of => C, X and Y is that a residue may repeatedly appears in some are itemsets, representing convex and convex surface transactions. binding residues; C is a correlation between Then we find on the amount of data that is significantly categories[24]. the number of type biological transient less than obligate, which causes production rules and calculations during supper. The value of biological transient is underrated, 4. Association Rules Deletion so we have to do to adjust the value of biological transient. So biological obligate and biological transient Data for the transaction or relation, in the training data are at the fair situation. set (hereinafter referred to as DB) data attach each category. The following instructions to the P and N two 3.3 Associative Classification Rule categories. For example, Arising class association rule (hereinafter referred to as CAR) format X => Y, X is an Various protein complexes bind frequency distribution item set, Y is a type. of different surface amino acids, which the reason we In [25] algorithm, which depend on the sort of believe that the association rules can be used as the basis confidence class association rule. The pattern with the to predict classification. In addition, the combination of same confidence are sorted according their supports. surface irregularities are made of a complex Then the class association rule is according to the sort combination of the two surfaces. order of selection. The selection process is that the data Using arg amino acids for example, in the identification base in line with the current class association rule of the complex, if there is a combination of concave and (called r) case deletion of the conditions of. Data base phe ser, then there will be 90% across arg, there is the case after delete the called DB '. Suppose r of category P,

in line for the case N R condition but the number of obligate two type. Let Ri is descending order of classes, for r the error; case DB 'is determined by a confidence value rule set. Determine which type majority judgment error, assume that the majority of large quantity for the top few rule in set. If object Class P case, the DB' in the number N class called DB type large, we surmise this object type is obligate, 'of the error. So every pick a class association rule, there else we surmise this object type is transient. will be one pair of class association rule of error, and 3. The number of qualified rule: If this object contains choose class association rule to the whole process lasted a number of rules Ri and Rj, Ri is obligate type rule until the lowest error set and Rj is non- obligate type rule set. If the Its algorithm is as follows: number of rule in Ri is large than Rj, then surmise 1 R = sort(R); this object is obligate type, if not surmise this object 2 for each rule r in R in sequence do is transient type. 3 temp = empty; 4 for each case d in D do 6. Support Adjustment 5 if d satisfies the conditions of r then When predicting PPI type, we find that no matter which 6 store d.id in temp and mark r method of calculation the prediction results are almost if it correctly classifies d; always obligate type. There are almost all obligate data end rule, transient data rule almost nothing. We judge because the gap between the obligate data and transient 7 if r is marked then quantity data, resulting in a lower number of transient 8 insert r at the end of C; rule, more likely to be filtered out, so we focused on a 9 delete all the cases with the ids in number of imbalances do numerically adjustment. temp from D; The following formula: 10 select a default class for the current C; C( x ) = P( x ∩obligate ) * R / 11 compute the total number of errors of (P( x ∩obligate ) * R + P( x ∩non-obligate ) ) (1) C; Let's R is assumed that the ratio of transient to obligate. 12 end X is a rule, denote this PPI is obligate 13 end and this PPI contain rule x. denote 14 Find the first rule p in C with the lowest total number this PPI is transient. of errors and drop all the rules after p in C; 15 Add the default class associated with p to end of C, Table1. The number of non-obligate rules. and return C (our classifier). Factor 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 Before 2 1 1 1 1 0 0 1 1 2 support adjustment 5. Applying Rules for Classification After 11 14 14 18 34 64 95 17 29 44 support 3 1 9 adjustment When the classification of an unknown category of object, there are some methods of selecting rules: This table shows the number two transient rules in range 1. Confidence sum factor from 1.0 to 2.0. Before the change of the number 2. Higher confidence of transient rules are rare and some cases do not even 3. The number of qualified rule have. After the change, see table 1 the number of The methods are as follows: transient rules has increased after support adjustment. 1. Confidence sum: If this object contains a number of rules Ri and Rj, Ri is obligate type rule set and Rj is non- obligate type rule set. Let X is sum of confidence for Ri, Y is sum of confidence for Rj. If X > Y, we surmise this object type is obligate, else we surmise this object type is transient. 2. Higher confidence: If this object contains a number of rules R, R is rule set contain obligate and non-

7. Experiment and Discussion In figure 3, we find that is low correct rate before data counterpoise. Because the number of non-obligate rule is more less than obligate rule number. From the figure 4, the results of the number of qualified rule and confidence sum rule number the difference of two methods is not large. Because the confidence will be greater the more acceptable when the number of its rule, cause results not dissimilar large. High confidence at lower accuracy rate method beginning, because at factor size is relatively small number of transient rule is not much. It is easy to determine when not to judge him, because of high confidence. The number of the back of transient rule is changed for a long time. It increases the accuracy by High confidences before making a judgment.

8. Conclusions

The amount of protein data is enormous, coupled with Fig 3 The correct rate of non-obligate data prediction environmental variation factors of uncertainty. It takes a lot of time and money to determine protein-protein interaction in wet lab. So there are many experts and scholars toward using known information to predict protein interactions situation, in order to reduce the amount of protein test objectives. We use a class association rule method for classifying protein-protein interaction type. And we compared several screening methods about screening associated rules. Due to type imbalance, where there are much more obligate protein complexes than transient protein complexes, the interesting measures of the mined rules are tortured. We have designed a method to adjust this effect. The proposed method can further be used to screen proteins that might have a certain type of protein-protein interaction with a query protein. For biologists, it may take much less time to explore; also it saves time to experiment. Even for pharmaceutical research and Fig. 4 The result of proposed method. development, it has brought many benefits. Mainly proteins and protein interactions experimentally really have to spend a lot of time and money. If a system can quickly provide a list of the list of subjects, there will be a great help.

References [1] SE Ozbabacan, HB Engin, A Gursoy, O Keskin, “Transient Protein-Protein Interactions,” Protein Engineering, Design & Selection, Vol. 24, No. 9, pp. 635- 48, 2011. [2] Ravi Gupta, Ankush Mittal and Kuldip Singh, “A Time- Series-Based Feature Extraction Approach for Prediction

of Protein Structural Class,” EURASIP Journal on protein Interaction Types Using Association Rule based Bioinformatics and Systems Biology, pp. 1-7, 2008. Classification,” BMC Bioinformatics, Vol. 10, January 2009. [3] Julian Mintseris and Zhiping Weng, “Atomic Contact [14] Protein Data Bank Vectors in Protein-Protein Recognition,” PROTEINS: [http://www.rcsb.org/pdb/home/home.do] Structure, Function, and Genetics, Vol. 53, pp. 629–639, [15] NOXclass [http://noxclass.bioinf.mpi-inf.mpg.de/] 2003. [16] Biomolecular Object Network Databank [4] S. Grosdidier, J. Fernández-Recio, “Identification of Hot- [http://bond.unleashedinformatics.com/] spot Residues in Protein-protein Interactions by [17] Chengbang Huang, Faruck Morcos, Simon P. Kanaan, Computational Docking,” BMC Bioinformatics, 9:447, Stefan Wuchty, Danny Z. Chen, and Jesus A. Izaguirre, 2008. “Predicting Protein-protein Interactions from Protein [5] JR Perkins, I Diboun, BH Dessailly, JG Lees, C Orengo, Domains Using a Set Cover Approach,” IEEE/ACM “Transient Protein-Protein Interactions: Structural, Transactions on Computational Biology and Functional, and Network Properties,” Structure, Vol. 18, Bioinformatics, Vol. 4, No. 1, pp. 78-87, 2007. No. 10, pp. 1233-43, 2010. [18] Frans Coenen and Paul Leng, “The Effect of Threshold [6] Florian Goebels and Dmitrij Frishman, “Prediction of Values on Association Rule based Classification Protein Interaction Types based on Sequence and Network Accuracy,” Data & Knowledge Engineering, Vol. 60, No. Features,”BMC Systems Biology, 7(Suppl 6):S5, 2013. 2, pp. 345-360, 2007. [7] Huang-Cheng Kuo and Ping-Lin Ong, “Classifying Protein [19] John Richard Davies, “Statistical Methods for Matching Interaction Type with Associative Patterns,” IEEE Protein-ligand Binding Sites”, Ph.D. Dissertation, Symposium on Computational Intelligence in School of Mathematics, University of Leeds, 2009. Bioinformatics and Computational Biology, pp. 143-147, [20] S. Lukman, K. Sim, J. Li, Y.-P. P. Chen, “Interacting 2013. Amino Acid Preferences of 3D Pattern Pairs at the [8] Nurcan Tuncbag,Gozde Kar, Ozlem Keskin, Attila Gursoy Binding Sites of Transient and Obligate Protein and Ruth Nussinov, “A Survey of Available Tools and Complexes,” Asia-Pacific Bioinformatics Conference, pp. Web Servers for Analysis of Protein-Protein Interactions 14-17, 2008. and Interfaces,” Briefings in Bioinformatics, Vol. 10, No. [21] Huang-Cheng Kuo, Jung-Chang Lin, Ping-Lin Ong, Jen- 3, pp. 217-232, 2009. Peng Huang, “Discovering Amino Acid Patterns on [9] R. P. Bahadur and M. Zacharias, “The Interface of Binding Sites in Protein Complexes,” Bioinformation, Protein-protein Complexes: Analysis of Contacts and Vol. 6, No. 1, pp. 10-14, 2011. Prediction of Interactions,” Cellular and Molecular Life [22] Aaron P. Gabow, Sonia M. Leach, William A. Sciences, Vol.65, pp. 7-8, 2008. Baumgartner, Lawrence E. Hunter and Debra S. [10] Huang-Cheng Kuo, Ping-Lin Ong, Jung-Chang Lin, Jen- Goldberg, “Improving Protein Function Prediction Peng Huang, “Prediction of Protein-Protein Recognition Methods with Integrated Literature Data,” BMC Using Support Vector Machine Based on Feature Bioinformatics, 9:198, 2008. Vectors,” IEEE International Conference on [23] Mojdeh Jalali-Heravi, Osmar R. Zaïane, “A Study on Bioinformatics and Biomedicine, pp. 200-206, 2008. Interestingness Measures for Associative Classifiers,” [11] Huang-Cheng Kuo, Ping-Lin Ong, Jia-Jie Li, Jen-Peng ACM Symposium on Applied Computing, pp. 1039- Huang, “Predicting Protein-Protein Recognition Using 1046, 2010. Feature Vector,” International Conference on Intelligent [24] Xiaoxin Yin and Jiawei Han, “CPAR: Classification Systems Design and Applications, pp. 45-50, 2008. based on Predictive Association Rules,” SIAM [12] M. Altaf-Ul-Amin, H. Tsuji, K. Kurokawa, H. Ashahi, Y. International Conference on Data Mining, pp. 331–335, Shinbo, and S. Kanaya, “A Density-periphery Based 2003. Graph Clustering Software Developed for Detection of [25] B. Liu, W. Hsu, Y. Ma, “Integrating Classification and Protein Complexes in Interaction Networks,” Association Rule Mining,” KDD Conference, pp. 80-86, International Conference on Information and 1998. Communication Technology, pp. 37-42, 2007. [13] Sung Hee Park, José A Reyes, David R Gilbert, Ji Woong Kim and Sangsoo Kim, “Prediction of Protein-

Digitalization Boosting Novel Digital Services for Consumers

Kaisa Vehmas1, Mari Ervasti2, Maarit Tihinen3 and Aino Mensonen4

1 VTT Technical Research Centre of Finland Ltd Espoo, PO BOX 1000, 02044, Finland [email protected]

2 VTT Technical Research Centre of Finland Ltd Oulu, PO BOX 1100, 90571, Finland [email protected]

3 VTT Technical Research Centre of Finland Ltd Oulu, PO BOX 1100, 90571, Finland [email protected]

4 VTT Technical Research Centre of Finland Ltd Espoo, PO BOX 1000, 02044, Finland [email protected]

Abstract successful themes for economic growth: data is often Digitalization has changed the world. The digital revolution has considered as a catalyst for overall economy growth, promoted the Internet, and more recently mobile network innovation and digitalization across all economic sectors. infrastructure, as the technological backbone of our society. For example, in Europe the Big Data sector is growing by Digital technologies have become more integrated across all 40% per year, seven times faster than the IT market [3]. sectors of our economy and society, and create novel possibilities for economic growth. Today, customers are more and more interested in value-added services, compared to the basic Digitalization is affecting people‘s everyday lives, and products of the past. Novel digital services, as well as the use of changing the world. The pervasive nature of technology in mobile services, has increased both at-work and during free time. consumers‘ lives also causes a rapid change in the business However, it is important to understand the needs and landscape [4]. The value of the ICT sector‘s manufacturing expectations of the end users and develop future services with and services will increase faster than the world economy them. This paper focuses on pointing out the importance of user on average [5]. Thus, companies have to move their involvement and co-design in digital service development and business into digital forms. Business models must change providing insights on transformation caused by the digital to support and improve new business opportunities, which revolution. Experiences and effects of user involvement and co- are created together with the services. In order to build up design are introduced with details via two case studies from the traditional retail domain. an excellent digital service that meets the customers‘ needs, Keywords: digital services, digitalization, user involvement, co- participatory design of the service is inevitable [6]. design, retail. To be successful, innovative solutions must take into account opportunities provided by new technology, but 1. Introduction they cannot lose sight of the users. In practice, companies have understood how important it is to understand the The digital revolution is everywhere and it is continually needs and expectations of the end users of the product or changing and evolving. Information technology (IT) service. Users are experts on user experience and thus are innovations, such as the Internet, social media, mobile a significant source of innovation [7]. Involving different phones and apps, cloud computing, big data, e-commerce, stakeholders in the value chain, from the very start of the and the consumerization of IT, have already had a development process, increases customer acceptance, transformational effect on products, services, and business gives the developers new development ideas and gives the processes around the world [1]. In fact, information and users feelings that their voices have been heard. The communications technology (ICT) is no longer a specific interaction with the customer is the key issue. That is, sector, but the foundation of all modern innovative keeping customers satisfied in a way that they feel that the economic systems [2]. Digitalization is one of the service provider listens to them and appreciates their

opinions and activities is of major importance. This will 2.1 Digital Services Boosting the Finnish Economy make it possible for companies to obtain and keep loyal customers. [8] In the beginning of the DS program DIGILE, Finnish industry and Academia of Finland (http://www.aka.fi/en) This paper focuses on pointing out the importance of user together identified four themes which would lead the involvement and co-design in digital services development Finnish economy towards having the best means to reach and providing insights of transformation caused by digital an advantageous position in the global market for mobile revolution. Experiences and effects of user involvement services. The themes were 1) small and medium and co-design are introduced with details via two case enterprises (SME) services, 2) financial services, 3) studies from the traditional retail domain. The research educational services, and 4) wellness services. The main was done as a part of large Digital Services (DS) program aim of the DS program was to create and begin to (http://www.digital-services.fi) facilitated by DIGILE implement various digital services, service platforms and (http://www.digile.fi), one of Finland‘s Strategic Centers technologies which promote new or enhanced services, as for Science, Technology and Innovation. DIGILE points well as to ensure maintenance of new services in selected out that ICT-based digital services are the most important areas. The structure of the DS program is presented in way to provide added value to customers. Thus, DIGILE is Figure 1. focused on promoting the development of digital service know-how for business needs. In the DS program, work was conducted in a true partnership model, meaning that the program provided a The case studies described in this paper, Case A and Case pool of complementary platforms where partners shared B, are introduced in detail for illustrating user involvement and trialed their innovations and enabler assets. The and co-design while developing new digital services for a mission was accomplished by creating new innovative traditional retail sector. In Case A, novel omnichannel services in these selected sectors and by recognizing the services for the customers were integrated into the retail need of enablers in their context. The ecosystem thinking processes to better serve and meet the needs of the store‘s played a central role during the whole program. rural customers closer to their homes. Customers living in rural areas were known not to have access to the larger selections of the retailer‘s online stores. The second, Case B, aimed to understand consumer attitudes towards novel digital service points in hypermarkets. Customers were able to test the first version of a novel user interface to be used in digital service points. The case studies emphasized the importance of user involvement and co-design while developing new digital services. Fig. 1 The structure of the Digital Services program. This paper is structured in the following way. In the second chapter background information about the DIGILE The key objectives for the SME services were optimizing Digital Services program and digitalization in general are service creation tools for SMEs and sharing of know-how given. Also, relevant literature concerning the retail sector in service building to enable a new class of service and the context of the case studies are introduced. The development. The SME theme targeted creation of a pool third chapter presents the research approaches used with of companies implementing key services by or for the the two case studies. In the fourth chapter case study SME sector. SMEs supported the whole services findings are introduced and discussed. Finally, in the fifth ecosystem by utilizing and trialing service platforms chapter the main findings are summarized and concluded. offered by the program. A pull of additional platform features was created, and SMEs acted to create new service products for their business. Rapid prototyping and 2. Background iterative research methods were utilized.

In the case of financial services the program concentrated In this chapter, the DS program is introduced with some on services and service enablers which bring added value examples of developed digital services in order to build a to the companies, customers, as well as to consumers, and complete picture of where the case studies were carried linked money transactions and financial services smoothly out. After that, an overview of digitalization and its‘ effect to the ecosystem market. The goal was to introduce on digital services are presented. Finally, digitalization in mechanisms and platforms for service developers, the retail sector, the environment of our case studies, is enabling financial transactions in the services and to introduced in detail. develop safe and flexible trust enablers, as well as cost-

efficient banking and payment tools, for both existing and changes the world and affects people‘s everyday lives. The new innovative mobile services. huge impact of digitalization and the Internet will be felt in different industrial and associated sectors, for example, 3D The goal of educational services was to increase the printing, smart city services (e.g. lighting control, utilization and availability of e-learning materials and temperature optimization), predictive maintenance services in education. It was not only concentrated on solutions, intelligent logistics, smart factories, etc. support for mobile and pervasive learning, but on high quality services that could be easily integrated into Digitalization also means that businesses make use of everyday processes of the ordinary school day. User electronic information exchange and interactions. For perspectives played an important role; this was seen as example, digitalization in factories allows end-to-end especially important in cases that aim at the global market transparency over the entire manufacturing process, so that or developing services for challenged learners. individual customer requirements can be implemented profitably and the produced solutions managed throughout In the case of wellness services the aim was to create a their life cycle. Existing businesses can be modernized wellness ecosystem with common platform access and the using new technologies, which potentially also generate capability to develop enablers and tools for integrating entirely new types of businesses. Evans and Annunziata different categories of value adding services and [10] highlight the promise of the Industrial Internet by technologies. It was also targeted towards developing stating that it is the 3rd innovation wave – after the components and capabilities for integrating technologies Industrial Revolution (1st wave) and the Internet for automatic wellness data collection. Data analysis will Revolution (2nd wave). The growing importance of be facilitated by developing and enabling the integration of context-awareness, targeting enriched experience, intuitive tools for professional wellness data analysis and content communication services and an increasingly mobile delivery. society, requires intelligent services that are smart, but invisible to users. Hernesniemi [11] argued that the value During 2012-2015, 85 organizations (53 SMEs, 19 large of the ICT sector‘s manufacturing and services will companies and 13 research organizations) in total increase faster than the world economy on average. For participated in the DS program. The program exceeded the example, e-Commerce is growing rapidly in the EU at an goals by achieving 27 highlighted new services and 18 average annual growth rate of 22%, surpassing EUR 200 features. In addition, three new companies were billion in 2014 and reaching a share of 7% of total retail established. One of the successful examples of results sales [12]. achievement is Personal Radio. This offers consumers new services and personal content from different sources based In the digital economy, products and services are linked on recommendation engine technology. The ecosystem more closely to each other. The slow economic growth thinking was an enabling asset: there have been several during recent years has boosted the development of companies involved in trying to create the service, e.g., product-related services even more – these services have companies to take care of content production and delivery, brought increasing revenue for the manufacturing speech synthesis, audio search, content analysis, payment companies in place of traditional product sales. The global system, concept design, user interface, business models, market for product and service consumption is steadily user experience and mobile radio player. In addition, in the growing. Today consumers are key drivers of technology wellness services domain several wellbeing services were and change as new digital tools, e.g., comparison websites, developed. For example, novel mobile services to prevent social media, customization of goods and services and illnesses, such as memory disorder or work-related mobile shopping, have empowered them [13]. Customers musculoskeletal disorder, were developed. Novel option are more and more interested in value-added services for traditional marital therapy and couching methods is compared to the basic products themselves. now available in mobile also. In this paper, two pilot cases are introduced as examples of digitalization and Now, companies around the world are not only willing to developing digital services in the retail sector. use digital technologies to obtain transformation—they must [14]. Companies are working towards achieving 2.2 Digitalization Effects on Services and Business digital transformation, but still most are lacking experience with emerging digital technologies and they are skeptical. The digital revolution has promoted the Internet and more The key issue is to respond effectively and quickly to recently mobile networks infrastructures as the newly available technologies in order to gain better technological backbone of our society. The Internet and customer experiences and engagement, streamlined digital technologies have become more integrated across operations and new lines of business. Accordingly, digital all sectors of our economy and society [9]. Digitalization services are a strong global trend in the world: long-term

development is moving the weight of economic value of these displays is to provide better service for customers creation from agriculture, to goods, and then services. The and promote sales. From a retail perspective, these service sector is the fastest growing segment of global displays can be seen as one of the many channels that aim economies. Figure 2 illustrates the trend in the USA. to capture people‘s attention and affect their customer behavior. ICT has a remarkable impact on the development of services; ICT enables completely new services, increases 2.3 Digitalization of the Retail Sector the efficiency of service production, enhances the availability of services, and increases the profitability of The retail sector is considered one of the most rapid service business. Kettunen et al. [15] have identified six technology-adoptive sectors (e.g., [19]). Over the years, megatrends in ICT and ICT enabled services; 1) data- retailers have learned how to design their stores to better intensiveness, 2) decentralized system architectures, 3) meet shoppers‘ needs and to drive sales. In addition, the fusion of real and virtual, 4) disappearing (or hidden) technical infrastructure that supports most retail stores has human interface, 5) web-based organization of work and grown enormously [20]. The retail industry has evolved life, and 6) increasing need to manage social robustness. from traditional physical stores, through the emergence of For example, the advantage of the data-intensiveness is electronic commerce, into a combination of physical and that the service providers can automatically collect and digital channels. Seeing the future of retailing is quite analyze a large amount of customer or process data, and complex and challenging; busy customers expect that also combine it with other data that is available free of companies use innovative approaches to facilitate their charge. This helps service providers to develop their shopping process efficiently and economically, along with services, e.g., by customizing the services and creating providing value-added shopping experiences. People no new ones. longer only go shopping when they need something: the experience of shopping is becoming more important [21].

There are a number of challenges and opportunities retailers face on their long-term radar, such as changes in consumer behavior and consumer digitalization. These drivers affecting the retail sector should be a key consideration for retailers of all shapes and sizes [22].

It is likely that the power of the consumer will continue to grow [23], and from the demand side, consumers will be empowered to direct the way in which the revolution will unfold [24]. The focus on buying behavior is changing from products to services [25]. Thus, the established Fig. 2 Long term industry trends [16]. retailers will need to start considering how they can more effectively integrate their online and off-line channels to The use of mobile services has increased both at work and provide customers with the very highest levels of service. during free time, and social communities are increasingly formed. People are able and willing to generate content It is now widely recognized that the Internet‘s power, and the line between business and private domains is scope and interactivity provide retailers with the potential increasingly blurred [17]. The idea of everybody having to transform their customers‘ shopping experiences, and in their own personal computer is being reborn and has so doing, strengthen their own competitive positions [26]. evolved into everyone having their own personal cloud to Frost & Sullivan [27, 28] predicts that by 2025, nearly store and share their data and to use their own applications 20% of retail will happen through online channels, with [18]. This is driving a power shift away from personal global online retail sales reaching $4.3 trillion. Thus, devices toward personal services. retailers are facing digitalization of the touch-point and consumer needs [29]. By 2025, 80 billion devices will Currently, digital signage is widely used in different connect the world with each person carrying five environments to deliver information about a wide array of connected devices [30]. Mobile and online information topics with varying content formats. Digital signs are technology make consumers more and more flexible in generally used in public spaces, transportation systems, terms of where and how they wish to access retailer sport stadiums, shopping centers, health care centers etc. information and where and how to purchase products. There are also a growing number of indoor digital displays Consumer behavior is changing as a growing number of in shopping centers and retail stores. The underlying goal smarter, digitally-connected, price-conscious consumers

exploit multiple shopping channels, thus making the All these needs and requirements must come together as a multichannel retail approach an established shopping unified, holistic solution, and retailers should be able to behavior [31]. Described as channel agnostic, modern exploit the channel-specific capabilities in a meaningful consumers do not care whether they buy online, via mobile way [43]. or in-store as long as they get the product they want, when they want it at the right price. A new behavior of test-and- buy-elsewhere is becoming more common [32] and 3. Developing New Digital Services retailers must adapt to the buying behavior of these ―channel-hoppers‖ [33]. Aubrey and Judge [34] talk about In this chapter the research approaches for our case studies ‗digital natives‘ who are highly literate in all things digital, are introduced in detail. The case studies emphasize user and their adoption of technology is easy and distinctive. involvement and co-design while developing new digital services. However, simply ―adding digital‖ is not the answer for retailers – yet that is an approach too often taken [35]. For 3.1 Case Study A traditional retailers to survive, they must pursue a strategy of an integrated sales experience that blends online and in- In this research context the retailer wanted to integrate store experiences seamlessly, leading to the merger of a novel services and adapt retail processes to better serve web store and a physical store [36]. According to Frost & and meet the needs of the store‘s rural customers closer to Sullivan [37], the retail model will evolve from a their homes. Customers living in rural areas were known single/multiple channel model to an integrated hybrid not to have access to the retailer‘s online store‘s larger cross-channel model, identified as bricks and clicks. Thus, selection. In the development of the novel service concept shoppers of the future float seamlessly across mobile, the utilization of Internet possibilities and the importance online and real-world platforms [38]. of sales persons guiding and socializing alongside the customers at the physical store were also emphasized [44]. Adoption of both online and physical channels, to sell simultaneously through multiple marketing channels, is Case study A was conducted in the context of developing referred to as multichannel retailing [39]. Today, in an and piloting a novel omnichannel service concept for a ever digitizing world the line between channels is fading Finnish retail chain (described in more detail in [45]). A as the different channels are no longer separate and starting point for the new service was a need to provide a alternative means for delivering shopping services, but wider selection of goods for the customers of a small, customers increasingly use them as complements to each distant rural store. The store is owned by a large national other, or even simultaneously. Hence, the term co-operative retail chain. multichannel is not enough to describe this phenomenon, and instead the new concept of omnichannel is adopted The service concept was based on the idea of providing [40]. Omnichannel is defined as ―an integrated sales customers with the selection available in large stores by experience that melds the advantages of physical stores integrating an e-commerce solution within the service of a with the information-rich experience of online shopping‖. rural store. This was practically done by integrating the The customers connect and use the offered channels as service provider‘s digital web store to the service best fits their shopping process, creating their unique processes of the small brick-and-mortar store. Burke [46] combinations of using different complementary and suggests that retailers who want to web-enable their store alternative channels. In an omnichannel solution the should optimize the interface to the in-store environment customer has a possibility to seamlessly move between instead of just providing web access. Thus, one of the channels which are designed to support this ―channel- driving design principles of our case study was to achieve hopping‖. a seamless retail experience by a fusion of web and physical retail channels. The novelty of the service concept Payne and Frow [41] examined how multichannel was in how it was integrated to the service processes of a integration affects customer relationship management and physical store, i.e., how the different channels were used stated that it is essential to integrate channels to create together to create a retail experience that was as seamless positive customer experiences. They pointed out how a as possible. seamless and consistent customer experience creates trust and leads to stronger customer relationships as long as the A co-design process was used in the service design. The experience occurs both within channels and between them. build-and-evaluate design cycle involved a small group of Technology-savvy consumers expect pre-sales researchers and the employees of the retail company. The information, during-sales services and after-sales support researchers were active actors in the design process, through a channel customized to their convenience [42]. participating in the service concept design and facilitating co-design activities. Technical experts of the retail

organization were involved in specification of how the screen (located on the wall above the terminal) that developed solution would best integrate with the existing advertised the new retail service concept and directed the infrastructures of the organization, and how the new customers in its use. solutions related to the strategic development agenda of other related omnichannel solutions. Retail experts were 3.1.1. Research Approach of Case Study A involved in designing the customer journey, tasks of the The focus of the research was on more closely staff, the service solution‘s visual and content design, and investigating and analyzing the customer and personnel internal and external communication required by the service experience and deriving design implications from service. the gained information for developing and improving the service concept further. The user experience data The pilot study was conducted in a small rural store that collection methods and the number of stakeholders for was part of the service provider‘s retail chain, located in each method are listed in Table 1. the city of Kolari (www.kolari.fi) in northern Finland, with a population of 3,836. The customers visiting the physical The research study was focused on the two main user store could access the selection of goods otherwise not groups: store customers and personnel. Altogether 35 store available through a web store interface. The study was customers were interviewed, and of these 10 also launched with a goal of eventually scaling up the digital experimented with the service hands-on by going through retail service concept to other small rural stores of the the controlled usability testing. The ages of the study service provider. The retail service included a touch screen participants among the customers varied from 21 years to customer terminal located inside the physical store (see 73 years. Altogether six members of the store personnel Figure 3). participated in the interviews.

Table 1. Summary of the data collection methods and number of participants for each method.

Data collection method Number of participants Interviews with store customers 35 store customers Usability testing 10 store customers Paper questionnaires 10 returned questionnaires Group interviews with store personnel 6 members of store personnel Phone calls 1 store superior Automatic behaviour tracking ~484 service users

A set of complementary research methods were used to monitor and analyze the retail experience. The interviews were utilized as a primary research method, accompanied by in-situ observation at the store and a questionnaire delivered to customers. These qualitative research methods were complemented with quantitative data achieved Fig. 3 The digital retail service inside the store. through a customer depth sensor tracking system installed inside the store. Interviews were utilized to research The customers could use the terminal for browsing, customer and personnel attitudes and expectations towards comparing and ordering goods from the retail provider‘s the novel service concept, motivations for the service web store selections. The two web stores accessible adoption and usage, their service experiences, and ideas through the customer terminal were already existing and for service improvement. Two types of structured available for any customers through the Internet interviews were done with the customers: a) General connection. In addition, the retailer piloted the marketing interview directed for all store customers, and b) interview and selling of their own campaign products through a new focusing on the usability aspects of the service (done in the web store interface available on the customer terminal. context of the usability testing). Usability testing, The customers could decide whether they wanted their accompanied with observations, was conducted to gain product order delivered to a store (the delivery was then insights into the ways customers used the service. free of charge) or directly to their home. After placing the order, the customer paid for the order at a cash register at Paper questionnaires were distributed for the customers the store alongside their other purchases. The customer who had ordered goods through the service, with a focus terminal was also accompanied by a large information on gathering data of their experiences with the service ordering process. Also a people tracking system based on

depth sensor was used to automatically observe the one tablet device). Consumers were able to freely customers. The special focus of the people-tracking was to comment on their experience and they were also better understand the in-store customer behavior, and to interviewed after testing the novel service prototype. collect data in more detail of the number of customers using the service through the customer terminal, and of the duration and timing of the service use.

3.2 Case Study B

In Case B, the retailer‘s goal was to improve customer service in the consumer goods trades by implementing novel digital service points in the stores. Generally, using these displays customers were able to browse the selection of consumer goods and search detailed information about the products. On displays customers were also able to see advertisements and campaign products available at the store. Customer displays help consumers in making purchase decisions by providing guides and selection assistant. In addition to that, customers can get help to find Fig. 4 Test setup in the study. certain products from the bigger store by utilizing a map service. It has also been planned that customers could use the displays to find a wider selection of consumer goods 4. Results and Findings from the Case Studies from the online shop and place the online order in the store. In this chapter the main results and findings of the case studies are presented in detail for introducing user This case study aimed to understand consumer attitudes involvement in the development process. towards digital service points in Prisma hypermarkets. The research was divided into three tasks: 4.1 Results of Case Study A

The findings from Case Study A are analyzed from the 1. Digital service points as a service concept viewpoint of two end-user groups, namely the rural store 2. Type and location of the digital service point in the customers and personnel. store

3. Online shopping in the store. 4.1.1 Store Customers In the study, customers were able to test the first version of Altogether, 35 customers of the store were asked about the novel user interface to be used in digital service points their attitudes, expectations, and experiences related to the in stores and compare different screens. The goal was to novel retail service concept. gather information about customer experience, their expectations and needs, and also ideas of how to develop Interviews and paper questionnaires. When asked the user interface further. A test setup of the study is whether or not the customers were likely to use the novel presented in Figure 4. The novel user interface was tested retail service on a scale of 1-5 (where 1 = not likely to use, with the big touch screen (on right). The other touch 5 = likely to use), the average was 2.6, resulting in 16 screens were used to test with the web store content just to interviewees responding not likely to use the service and get an idea about the usability of the screen. 19 interviewees responding likely to use the service.

3.2.1 Research Approach of Case Study B Regarding those 16 customers stating not likely to use the The work was carried out in the laboratory environment, novel retail service, the age distribution was large, as this not in the real hypermarket. Consumers were invited to customer group consisted of persons aged between 21 and participate in a personal interview where their attitudes 73 years, the average age being 44 years. The gender towards customer displays were clarified. The interviews distribution was very even; 10 men vs. 9 women (some were divided into two phases. First, the background respondents comprised of couples who answered the information and purchase behavior was discussed and the researchers‘ questions together as one household). Except novel digital service concept was presented to the for one person, all the interviewees said they visited quite customers. In the second phase they were able to test the regularly the nearest (over 150 kilometers) bigger cities for proof of concept version of the user interface and compare shopping purposes. Of the 16 respondents 13 had either no different types of devices (two bigger touch screens and or only little experience with online shopping. This

customer group gave the following reasons for not being Usability testing. A total of ten store customers so eager to adopt the novel retail service in use (direct participated in the usability testing. The customers were quotes translated from Finnish): directed to go through a set of predetermined tasks with the retail service interface, and they were asked to ―think “I do not need this kind of a service.” aloud‖ and give any comments, feedback and thoughts that “Everyone has an Internet connection at home. It is easier to came to their mind during the interaction with the service. order [products] from home.” Their task performance was observed by the researchers “Might be good for someone else…” and notes were taken during the customer‘s

experimentation with the service. The tasks included 1) On the other hand, 19 responders stated that they were browsing the product selections available through the web likely to use the retail service in the future. Also in this stores, 2) looking for more detailed product information, customer group the gender distribution was very even, as and 3) ordering a product from two different web stores. the responders consisted of 10 men vs. 11 women. The age distribution was respectively diverse, from 29 to 72 years, The biggest difficulty the customers encountered was the average age being 51 years. In addition, everyone related to the touch-based interaction with the service regularly made shopping journeys to the closest bigger terminal. The terminal‘s touch screen appeared not to be cities. In this customer group, 11 respondents had some sensitive enough, resulting in six out of ten customers experience with online shopping, with six respondents experiencing difficulties in interacting with the touch stating they often ordered products online. These screen. In addition, it was not immediately clear for the customers justified their interest towards the novel retail customers that the terminal indeed was a touch screen, as service in the following ways (direct quotes translated six customers hesitated at first and asked aloud whether from Finnish): the terminal had a touch screen: “Do I need to touch this? /

Should I touch this?” “Everything [new services] that comes need to be utilized so that the services also stay here [in Kolari].” “We do not have much [product] selections here.” However, interestingly four customers out of ten changed “Really good… No need to visit [bigger cities] if we do not have their initial answer regarding their willingness to use the other businesses/chores there.” service (asked before actually experimenting with the “Sounds quite nice… If there would be some product offers.” service UI) in a more positive direction after having a “If there [in the digital retail service] would be some specific hands-on experience with the service. Thus, after usability product that I would need, then I could use this.” testing, the average raised a bit from the initial 2.6 to 2.7 (on a scale of 1-5). None of the customers participating in To conclude, age or gender did not seem to have an effect the usability testing changed their response in a negative on the store customers‘ willingness to use the retail direction. Other valuable usability findings included service. Neither did the shopping journeys to bigger cities observation on the font size on the service UI, insufficient influence the willingness for service adoption, as most of service feedback to the customer, and unclear customer the customers made these shopping journeys regularly. journey path. However, previous experience with online shopping appeared to have a direct effect on the customers‘ Automatic tracking of store customers’ behaviors. A willingness to use the retail service. If the customer did not depth sensor-based system was used for detecting and have, or had only little previous experience with ordering tracking objects (in this case people) in the scene, i.e., products from web stores, the person in question often also inside the physical store. Depth sensors are unobtrusive, responded not likely to adopt the retail service into use. and as they do not provide actual photographic However, if the customer was experienced with online information, any potential privacy issues can be more shopping, they had a more positive attitude and greater easily handled. The sensor was positioned so that it could willingness to use the novel retail service. observe the customer traffic at the store‘s entrance hall where the service terminal was positioned. Sensor Paper questionnaires were distributed to the customers implementation is described in more detail in [47]. who had ordered products through the retail service (either at home or through the store‘s customer terminal), with the The purpose of the implementation of depth sensor goal of researching customers‘ experiences with the tracking was to better understand the in-store customer ordering process. These customers identified the most behavior, and to gather in more detail data of 1) the positive aspects of the service as the following: 1) wider number of customers using the service terminal, and 2) the product selections, 2) unhurried [order process], 3) easy to duration of the service use. The data was recorded during a compare the products and their prices, 4) fast delivery, and total of 64 days. Most of those days contain tracking 5) free delivery. information from all the hours the store was open. Some

hours are missing due to the instability of the people- In addition, the following comments illustrate the general tracking software. From the recorded data all those store thoughts of the store personnel and expectations regarding customers that came to the near-range of the service set-up the service: were analyzed. The real-world position of the customers using the service terminal was mapped to the people- “This is [indeed a useful service], since we have these long tracker coordinates and all the customers that had come distances [to bigger cities]. into a 30 cm radius of the user position and stayed still Now a customer can buy the washing machine from us.” more than three seconds were accepted. The radius from “Always the adding of more services should be a positive thing.” “More services also always mean more customers.” the user position was kept relatively small in order to “When we get our own routines and set-up for this, I’m certain minimize the distortion of data resulting from confusing this will succeed!” the users of the slot machine as service terminal users. “…Should have distribution of [personnel‘s] work with this.”

The results show that most of the users used the service for During the first two months of the case study, inquiry calls a relatively short time. On average 0.54 store customers were made every two weeks to the store superior in order per hour used the service terminal. It is reasonable to to keep records and obtain information regarding the assume that, most likely, a proper usage of the service progress of the service adoption at the store, in addition to system would take more than 120 seconds. The shorter the possible encountered problems from the viewpoint of both usage period, the less serious or determined the user the customers and the personnel. In general, the novel session has been. Average usage period was 58.4 seconds. retail service appeared to have been quickly well- Thus, the service usage appeared as quite short-term, integrated into the personnel‘s work processes. indicating that in most cases the service usage was not so ―goal-directed‖, but more like sessions where store 4.2 Results of Case Study B customers briefly familiarized themselves with the novel service. During the hours the store was open, from 7am to The target group of the Case Study B included a working- 9pm, there were on average 7.56 service users/day. For the age population. Together 17 people were interviewed (8 week, Saturday and Sunday attracted the most service women and 9 men). Age distribution varied between 27 users and the times of most service users were at 1-2pm and 63 years. Most of the interviewees (88%) lived in the and 6-7pm. Helsinki metropolitan area. Over half of the interviewees (62%) commonly used the retailer‘s stores to buy daily 4.1.2. Store Personnel consumer goods. The most remarkable factor affecting The goal of the group interviews was to investigate store selection of the store was location. Also, selection of personnel attitudes and expectations towards the novel goods, quality of the products, price level, bonus system service concept, and ideas for service improvement and and other services besides the store location were further development. In addition, the store superior was important criteria for consumer choice of which store to go contacted every other week with a phone call for the to. purpose of enquiring about the in-store service experiences, both from the viewpoint of the store Consumers are mainly confident with the selection of customers and personnel. goods in retailer‘s stores. According to the customers, it is usually easy to find the different products in smaller and Group interviews and phone calls. Two group interviews familiar stores. In unfamiliar bigger hypermarkets it is with six members of the store personnel were carried out sometimes a real challenge. If some product is not at the same time as the personnel were introduced and available in the store, the customer usually goes to some familiarized with the service concept, alongside their new other store to buy it. Most of the interviewees (71%) also service-related work tasks. In general, the attitudes of the shop online, an average of 1-2 times in a month, and they store personnel towards the novel service appeared as usually buy clothes and electronics. Online shopping is enthusiastic and positive. liked mainly because of cheaper prices, wider selection of consumer goods and it is easy and fast. Naturally, the novel service also invoked some doubts, mostly related to its employing effect on the personnel, the Generally, customers (88%) liked the idea of novel digital clearness and learnability of the order processes, and service points in the Prisma stores. They felt that the formation of the new routines related to the service customer displays sped up getting the useful information adoption that would also streamline their new work duties, and made shopping in the stores more effective. According and thus ease their work load. to the interviewees, the most important services were the map service and product information. Especially in bigger hypermarkets, and if the customers are not familiar with

the store, it is sometimes challenging to find certain 5. Conclusions products. The map service could include additional information about the location of the department and the Today the Internet and digital technologies are becoming shelf where the product can be found. In the hypermarkets more and more integrated across all sectors of our there usually are limited possibilities to offer detailed economy and society. Digitalization is everywhere; it is information about the products. With this novel digital changing the world and our everyday lives. Digital service customers are willing to get more information services provide new services or enhanced services to about the products and compare them. customers and end users. In a DIGILE Digital Services program, 85 Finnish partners innovated and developed A proof of concept version of the novel user interface novel digital services in 2012-2015 by recognizing the received positive feedback; customers thought it was clear, need of enablers in their context. Work was conducted in a simple and easy to use. They also felt that it was true partnership model, in close co-operation with research something new and different, compared to traditional web organizations and companies. During the whole program, sites. It was pointed out that there is too much content e.g., ecosystem thinking had a big role in innovating and in Prisma´s web store to be flipped through in the developing the solutions. The program exceeded the goals hypermarket. It is important to keep the content and layout by achieving 27 highlighted new services and 18 features. of the novel user interface simple. In addition, three new companies were established as a result of ecosystem thinking and companies shared and People are more willing to do online shopping at home. innovated together new or enhanced digital services. Online shopping in the store was still not totally refused, and interviewees found several circumstances when they In many cases in the DS program the role of consumers could utilize it. For example, it could be easy to do online and stakeholders was remarkable in the development shopping at the same time with other shopping in Prisma process. Narratives, illustrations and prototypes enhanced stores. If some certain product is no longer available in the the co-development of new solutions from ideas through store, customers could buy it online in the store, especially trials and evaluations to working prototypes. There is a the sale products. wide scale of tools, methods and knowledge available for demonstrating ideas and opportunities enabled by According to the customers there should be several digital emerging technologies, and for facilitating co-innovation service points in Prisma stores, customers are not willing processes. In this program, novel game-like tools were to queue up for their own turn. The service points should developed to easily involve different groups of people in be located next to the entrance and also in the departments, the development process. The tools support efficient and next to the consumer goods. The service point should be a agile development and evaluation for identifying viable peaceful place where they have enough privacy to concepts in collaboration between experts and users. concentrate on finding information and shopping. Still, there should be something interesting on the screen, In this paper, two retail case studies were presented in something that attracts the customers. The youngest detail. Case Study A was conducted in the context of interviewees commented that they would like to go to test developing and piloting a novel omnichannel service the new device and find out what it is, whereas the eldest concept in distant rural store. Case Study B concentrated interviewees said that they would like to know beforehand on novel digital service points in hypermarkets. The need what they could get from the new service. The screen has for these kinds of novel digital services have different to be big enough and good quality. Interviewees thought starting points; in the city of Kolari the selection of goods that the touch screen was modern. Using a tablet as a in the local store is limited and the distance to bigger cities screen in a digital service point was seen to be too small. and stores is long. In the Helsinki area, the selection of stores and goods is huge and people are also more In addition, the retailer got some new ideas for developing experienced in shopping online. Still, in both cases, people customer service in the stores. For example, some of the expect more quality customer service, e.g., in terms of interviewees suggested a mobile application for customer value added shopping experience, easier shopping and service and map service, and it could also be used as a wider selection of goods. In both case studies customers news channel. Customers could also create personalized stated they were likely to use the novel digital retail shopping lists with it. Authentication to the digital service services in the future. The behavior of the consumer has point could be executed by fidelity cards in order to changed due to digitalization and this change must be receive personalized news and advertisements and to taken into consideration when developing services for the accelerate the service. customers.

Integrating novel digitally-enabled retail services with a co-develop the retail services. It is noticed that active user physical store requires innovative thinking from retailers. involvement in the early stage of a development process Customers are interested in having these types of novel increases the quality of the novel service, and user digital services in the stores; they feel them to be modern involvement leads to better user acceptance and and forward-looking options. Most of the customers see commitment. that the digital service points make shopping more effective and they expect that they will get useful As introduced in this paper, user involvement and co- information faster compared to current situation in the design have a central and very important role when hypermarkets. developing novel digital services for customers. In fact, feedback and opinions of end users can significantly These retail-related case studies were implemented in improve or change the final results. The DS program order to better understand the challenges and opportunities facilitated the development of novel digital services by in this domain. Based on these studies the most important providing an ecosystem where companies could share and issues for retailers to take into account when implementing pilot their innovations. This kind of ecosystem thinking digital services in the stores are: was seen in a very useful and productive manner.

 Keep it simple. Keeping the layout and user interface Acknowledgments clear and easy makes it possible to serve all the user groups digitally. This work was supported by Tekes – the Finnish Funding Agency for Innovation (http://www.tekes.fi/en) and  Central location. Digital service points should be DIGILE. We also want to thank all the tens of participants situated in noticeable positions in the stores. we have worked with in this program. Customers do not want to search for the service points or queue up for using the service. Enlarging and clarifying the instructional and information texts References is part of this issue. Also, elements in the graphical user interface must be considered carefully to arouse [1] Bojanova, I. (2014). ―The Digital Revolution: What´s on the customer interest when they are passing by. Horizon?‖ IT Pro, January/February 2014, 12 p. IEEE. [2] European Commission (2015a). ―A Digital Single Market  Adding more privacy. Despite the central location Strategy for Europe.‖ 20 p. Available: privacy issues have to be taken into consideration http://ec.europa.eu/priorities/digital-single- when implementing the digital service point in the market/docs/dsm-communication_en.pdf [29.6.2015]. [3] European Commission (2015b). ―A Digital Single Market store. The service points should offer an undisturbed Strategy for Europe – Analysis and Evidence.‖ Staff place for searching information and shopping. Working Document. 109 p. Available:  High quality screens. Customers are nowadays http://ec.europa.eu/priorities/digital-single-market/docs/ dsm-swd_en.pdf [29.6.2015]. experienced with using different kinds of screens. A [4] Fitzgerald, M., Kruschwitz, N., Bonnet, D. & Welch, M. touch screen was felt to be modern. The screens have (2013). ―Embracing Digital Technology. A New Strategic to be high quality to ensure the smoothness of Imperative.‖ MITSloan Management review. Available: interaction between the customer and the service http://sloanreview.mit.edu/projects/embracing-digital- terminal user interface. technology/ [4.6.2015]. [5] Hernesniemi, H. (editor), (2010). ―Digitaalinen Suomi 2020.  Going mobile. Customers were also asking for Älykäs tie menestykseen.‖ Teknologiateollisuus ry. 167 p. mobile services enclosed in this novel digital service. [6] Mensonen, A., Grenman, K., Seisto, A. & Vehmas, K. This could bring the retailers an unlimited amount of (2015). ‖Novel services for publishing sector through co- possibilities to offer their services also outside the creation with users.‖ Journal of Print and Media stores. Technology Research, 3(2014)4, pp. 277-285. [7] Thomke, S. & von Hippel, E. (2002). ―Customers as Digital service points are one option to offer digital innovators: a new way to create value.‖ Harvard Business services for retail customers. Still others, e.g., mobile Review, 80(4), pp. 74-81. services, have unlimited possibilities to create added value [8] Mensonen, A., Laine, J., & Seisto, A. (2012). ‖Brand for the customers. In this type of development work, when experience as a tool for brand communication in multiple something new is developed for the consumer, it is channels.‖ Advances in Printing and Media Technology. essential to involve real customers in the beginning of the IARIGAI Conference, Ljubljana, 9-12 September 2012. [9] European Commission (2015a). ―A Digital Single Market planning and developing process. Customers are the best Strategy for Europe.‖ 20 p. Available: experts in user experience. In this study consumers were http://ec.europa.eu/priorities/digital-single-market/docs/ bravely involved in the very early stage to co-innovate and dsm-communication_en.pdf [29.6.2015].

[10] Evans, P. C., & Annunziata, M., (2012). ―Industrial Internet: [24] Doherty, N.F. & Ellis-Chadwick, F. (2010). ―Internet Pushing the Boundaries of Minds and Machines.‖ GE Retailing; the past, the present and the future.‖ Reports. Available: International Journal of Retail & Distribution Management, http://www.ge.com/docs/chapters/Industrial_Internet.pdf Emerald. 38(11/12), pp. 943-965. DOI: [18.5.2015]. 10.1108/09590551011086000. [11] Hernesniemi, H. (editor), (2010). ‖Digitaalinen Suomi [25] Marjanen, H. (2010). ‖Kauppa seuraa kuluttajan katsetta.‖ 2020.‖ Älykäs tie menestykseen. Teknologiateollisuus ry. (Eds. Taru Suhonen). Mercurius: Turun 167 p. kauppakorkeakoulun sidosryhmälehti (04/2010). [12] European Commission (2015b). ―A Digital Single Market [26] Doherty, N.F. & Ellis-Chadwick, F. (2010). ―Internet Strategy for Europe – Analysis and Evidence.‖ Staff Retailing; the past, the present and the future.‖ Working Document. 109 p. Available: International Journal of Retail & Distribution Management, http://ec.europa.eu/priorities/digital-single- Emerald. 38(11/12), pp. 943-965. DOI: market/docs/dsm-swd_en.pdf [29.6.2015]. 10.1108/09590551011086000. [13] European Commission (2015b). ―A Digital Single Market [27] Frost & Sullivan (2012). ―Bricks and Clicks: The Next Strategy for Europe – Analysis and Evidence.‖ Staff Generation of Retailing: Impact of Connectivity and Working Document. 109 p. Available: Convergence on the Retail Sector.‖ Eds. Singh, S., http://ec.europa.eu/priorities/digital-single-market/docs/ Amarnath, A. & Vidyasekar, A. dsm-swd_en.pdf [29.6.2015]. [28] Frost & Sullivan (2013). ―Delivering to Future Cities – [14] Fitzgerald, M., Kruschwitz, N., Bonnet, D. & Welch, M. Mega Trends Driving Urban Logistics.‖ Frost & Sullivan: (2013). ―Embracing Digital Technology. A New Strategic Market Insight. Imperative.‖ MITSloan Management review. Available: [29] Reinartz, W., Dellaert, B., Krafft, M., Kumar, V. & http://sloanreview.mit.edu/projects/embracing-digital- Varadajaran, R. (2011). ‖Retailing Innovations in a technology/ [4.6.2015]. Globalizing Retail Market Environment.‖ Journal of [15] Kettunen, J. (editor), Vähä, P., Kaarela, I., Halonen, M., Retailing. 87(1), pp. S53-S66. DOI: Salkari, I., Heikkinen, M. & Kokkala, M. (2012). ‖Services http://dx.doi.org/10.1016/j.jretai.2011.04.009. for Europe. Strategic research agenda and implementation [30] Frost & Sullivan (2012). ―Bricks and Clicks: The Next action plan for services.‖ Espoo, VTT. 83 p. VTT Visions; Generation of Retailing: Impact of Connectivity and 1. Convergence on the Retail Sector.‖ Eds. Singh, S., [16] Spohrer, J.C. (2011). Presentation: SSME+D (for Design) Amarnath, A. & Vidyasekar, A. Evolving: ―Update on Service Science Progress & [31] Aubrey, C. & Judge, D. (2012). ―Re-imagine retail: Why Directions.‖ RIT Service Innovation Event, Rochester, NY, store innovation is key to a brand growth in the ‗new USA, April 14th, 201. normal‘, digitally-connected and transparent world.‖ [17] Kettunen, J. (editor), Vähä, P., Kaarela, I., Halonen, M., Journal of Brand Strategy, April-June 2012, 1(1), pp. 31- Salkari, I., Heikkinen, M. & Kokkala, M. (2012). ―Services 39. DOI: http://henrystewart.metapress.com/link.asp?id= for Europe. Strategic research agenda and implementation b05460245m4040q7. action plan for services.‖ Espoo, VTT. 83 p. VTT Visions; [32] Anderson, H., Zinser, R., Prettyman, R. & Egge, L. (2013). 1. ‖In-Store Digital Retail: The Quest for Omnichannel.‖ [18] Bojanova, I. (2014). ―The Digital Revolution: What´s on the Insights 2013, Research and Insights at SapientNitro. Horizon?‖ IT Pro, January/February 2014, 12 p. IEEE. Available: [19] Ahmed, N. (2012). ―Retail Industry Adopting Change.‖ http://www.slideshare.net/hildinganderson/sapientnitro- Degree Thesis, International Business, Arcada - Nylands insights-2013-annual-trend-report [1.7.2015]. svenska yrkeshögskola. [33] Ahlert, D., Blut, M. & Evanschitzky, H. (2010). ―Current [20] GS1 MobileCom. (2010). ―Mobile in Retail – Getting your Status and Future Evolution of Retail Formats.‖ In Krafft, retail environment ready for mobile.‖ Brussels, Belgium: A M. & Mantrala, M.K.. (Eds.), Retailing in the 21st Century: GS1 MobileCom White Paper. Current and Future Trends (pp. 289-308). Heidelberg, [21] Gehring, S., Löchtefeld, M., Magerkurth, C., Nurmi, P. & Germany: Springer-Verlag. Michahelles, F. (2011). ―Workshop on Mobile Interaction [34] Aubrey, C. & Judge, D. (2012). ―Re-imagine retail: Why in Retail Environments (MIRE).‖ In MobileHCI 2011, Aug store innovation is key to a brand growth in the ‗new 30-Sept 2 (pp. 729-731). New York, NY, USA: ACM normal‘, digitally-connected and transparent world.‖ Press. Journal of Brand Strategy, April-June 2012, 1(1), pp. 31- [22] Reinartz, W., Dellaert, B., Krafft, M., Kumar, V. & 39. DOI: http://henrystewart.metapress.com/link.asp?id= Varadajaran, R. (2011). ‖Retailing Innovations in a b05460245m4040q7. Globalizing Retail Market Environment.‖ Journal of [35] Anderson, H., Zinser, R., Prettyman, R. & Egge, L. (2013). Retailing. 87(1), pp. S53-S66. DOI: ‖In-Store Digital Retail: The Quest for Omnichannel.‖ http://dx.doi.org/10.1016/j.jretai.2011.04.009. Insights 2013, Research and Insights at SapientNitro. [23] Aubrey, C. & Judge, D. (2012). ―Re-imagine retail: Why Available: http://www.slideshare.net/hildinganderson/ store innovation is key to a brand growth in the ‗new sapientnitro-insights-2013-annual-trend-report [1.7.2015]. normal‘, digitally-connected and transparent world.‖ Journal of Brand Strategy, April-June 2012, 1(1), pp. 31- 39. DOI: http://henrystewart.metapress.com/link.asp?id= b05460245m4040q7.

[36] Maestro (2012). ‖Kaupan alan trendikartoitus 2013: M.Sc. Kaisa Vehmas received her M.Sc. in Graphic Arts Hyvästit itsepalvelulle – älykauppa tuo asiakaspalvelun Technology from Helsinki University of Technology in 2003. She is takaisin.‖ Available: currently working as a Senior Scientist in the Digital services in http://www.epressi.com/tiedotteet/mainonta/kaupan-alan- context team at VTT Technical Research Centre of Finland Ltd. Since 2002 she has worked at VTT and at KCL (2007-2009). Her trendikartoitus-2013-hyvastit-itsepalvelulle-alykauppa-tuo- background is in printing and media research focusing nowadays asiakaspalvelun-takaisin.html?p328=2 [20.2.2014]. on user centric studies dealing with participatory design, user [37] Frost & Sullivan (2012). ―Bricks and Clicks: The Next experience and customer understanding especially in the area of Generation of Retailing: Impact of Connectivity and digital service development. Convergence on the Retail Sector.‖ Eds. Singh, S., Amarnath, A. & Vidyasekar, A. Dr. Mari Ervasti received her M.Sc. in Information Networks from [38] PSFK. (2012). ―The Future of Retail.‖ New York, NY, the University of Oulu in 2007 and her PhD in Human-Centered USA: PSFK Labs. Technology from Tampere University of Technology in 2012. She is currently working as a Research Scientist in the Digital services [39] Turban, E., King, D., Lee, J., Liang, T-P. & Turban, D.C. in context team at VTT Technical Research Centre of Finland Ltd. (2010). ―Electronic commerce: A managerial perspective.‖ She has worked at VTT since 2007. Over the years, she has Upper Saddle River, NJ: USA Prentice Hall Press. authored over 30 publications. In 2014 she got an Outstanding [40] Rigby, D. (2011). ―The Future of Shopping.‖ New York, Paper Award in Bled eConference. Her research interests include NY, USA: Harvard Business Review. December 2011. user experience, user-centered design and human computer [41] Payne, A. & Frow, P. (2004). ―The role of multichannel interaction. integration in customer relationship management.‖ Industrial Marketing Management. 33(6), pp. 527-538. Dr. Maarit Tihinen is a Senior Scientist in the Digital services in context team at VTT Technical Research Centre of Finland. She DOI: http://dx.doi.org/10.1016/j.indmarman.2004.02.002. graduated from the department of mathematics from the University [42] Oh, L-B., Teo, H-H. & Sambamurthy, V. (2012). ―The of Oulu in 1991. She worked as a teacher (mainly mathematics effects of retail channel integration through the use of and computer sciences) at the University of Applied Sciences information technologies on firm performance.‖ Journal of before coming to VTT in 2000. She completed her Secondary Operations Management. 30, pp. 368-381. DOI: Subject Thesis in 2001 and received her PhD in 2014 in http://dx.doi.org/10.1016/j.jom.2012.03.001. information processing science from the University of Oulu, [43] Goersch, D. (2002). ―Multi-channel integration and its Finland. Her research interests include measurement and metrics, implications for retail web sites.‖ In the 10th European quality management, global software development practices and digital service development practices. Conference on Information Systems (ECIS 2002), June 6– 8, pp. 748–758. Lic.Sc. Aino Mensonen obtained her Postgraduate Degree in [44] Nyrhinen, J., Wilska, T-A. & Leppälä, M. (2011). Media Technology in 1999 from Helsinki University of Technology. ‖Tulevaisuuden kuluttaja: Erika 2020–hankkeen She is currently working as a Senior Scientist and Project Manager aineistonkuvaus ja tutkimusraportti.‖ Jyväskylä: in the Digital services in context team at VTT Technical Research Jyväskylän yliopisto, Finland. (N:o 370/2011 Working Centre of Finland Ltd. She started her career at broadcasting paper). company MTV Oy by monitoring the TV viewing and has worked [45] Ervasti, M., Isomursu, M. & Mäkelä, S-M. (2014). as a Research Engineer at KCL before coming to VTT in 2009. At the moment she is involved in several projects including Trusted ―Enriching Everyday Experience with a Digital Service: th Cloud services, Collaborative methods an city planning, User Case Study in Rural Retail Store.‖ 27 Bled eConference, experience, and Service concepts and development. June 1-5, Bled, Slovenia, pp. 1-16. [46] Burke, R.R. (2002). ―Technology and the customer interface: what consumers want in the physical and virtual store.‖ Academy of Marketing Science, 30(4), pp. 411-432. DOI: 10.1177/009207002236914. [47] Mäkelä, S-M., Sarjanoja, E-M., Keränen, T., Järvinen, S., Pentikäinen, V. & Korkalo, O. (2013). ‖Treasure Hunt with Intelligent Luminaires.‖ In the International Conference on Making Sense of Converging Media (AcademicMindTrek '13), October 01-04 (pp. 269-272). New York, NY, USA: ACM Press.

GIS-based Optimal Route Selection for Oil and Gas Pipelines in Uganda

Dan Abudu1 and Meredith Williams2

1 Faculty of Engineering and Science, University of Greenwich, Chatham, ME4 4TB, United Kingdom [email protected]

2 Centre for Landscape Ecology and GIS, University of Greenwich, Chatham, ME4 4TB, United Kingdom [email protected]

Abstract routing. Impacts to animal migration routes, safety of The Ugandan government recently committed to development of nearby settlements, security of installations and financial a local refinery benefiting from recently discovered oil and gas cost implications are all important variables considered in reserves and increasing local demand for energy supply. The optimal pipeline routing. Jankowski [7] noted that pipeline project includes a refinery in Hoima district and a 205 kilometre routing has been conventionally carried out using coarse pipeline to a distribution terminal at Buloba, near Kampala city. This study outlines a GIS-based methodology for determining an scale paper maps, hand delineation methods and manual optimal pipeline route that incorporates Multi Criteria Evaluation overlaying of elevation layers. Although conventional, it and Least Cost Path Analysis. The methodology allowed for an emphasises the importance spatial data play in determining objective evaluation of different cost surfaces for weighting the where the pipeline is installed. This has also pioneered constraints that determine the optimal route location. Four advancement in spatial-based pipeline planning, routing criteria (Environmental, Construction, Security and Hybrid) were and maintenance. evaluated, used to determine the optimal route and compared with the proposed costing and length specifications targets issued The approaches used in this paper are presented as an by the Ugandan government. All optimal route alternatives were improvement and a refinement of previous studies such as within 12 kilometres of the target specification. The construction criteria optimal route (205.26 km) formed a baseline route for those conducted by Anifowose et al. [8] in Niger Delta, comparison with other optimal routes. Nigeria, Bagli et al. [9] in Rimini, Italy, and Baynard (10) Keywords: GIS, MCE, LCPA, Oil & Gas, pipeline routing. in Venezuela oil belts. This study was the first of its kind in the study area and incorporated both theory and practice in similar settings and model scenarios for testing to 1. Introduction support the decision making process. The study understood that evaluation of the best route is a complex multi criteria Lake Albertine region in Western Uganda holds large problem with conflicting objectives that need balancing. reserves of oil and gas that were discovered in 2006. Tests Pairwise comparison matrix and Multi Criteria Evaluation have been continually carried out to establish their (MCE) were used to weight and evaluate different factors commercial viability and by August 2014, 6.5 billion necessary for deriving optimal routes, and then Least Cost barrels had been established in reserves [1, 2 & 3]. The Path Analysis (LCPA) used to derive alternative paths that Ugandan government plans to satisfy the country’s oil are not necessarily of shortest distance but are the most demands through products processed at a local refinery to cost effective. be built in Kabaale, Hoima district and transported to a distribution terminal in Buloba, 14 kilometres from Kampala capital city [4]. Several options have been 2. Study Area proposed on how to transport the processed products from the refinery to the distribution terminal, this study explored Uganda is a land locked country located in East Africa (Fig. one option; constructing a pipeline from Hoima to 1). The refinery and distribution terminal locations define Kampala [5]. the start and end points respectively for the proposed pipeline route. The refinery is located near the shores of Determination of the optimal route for pipeline placement Lake Albert at Kabaale village, Buseruka sub-country in with the most cost effectiveness and least impact upon Hoima district, on a piece of land covering an area of 29 natural environment and safety has been noted by Yeo and square kilometres. This location lies close to the country’s Yee [6] as a controversial spatial problem in pipeline largest oil fields in the Kaiso-Tonya which is 40 kilometres

by road from Hoima town. Kaiso-Tonya is also 260 and a hybrid cost surface comprising of all criteria factors. kilometres by road from Kampala, Uganda’s capital. The Different cost surfaces for each of the criteria were approximate coordinates of the refinery are: 1⁰30’0.00”N, generated and evaluated to identify the combination of 31⁰4’48.00”E. The distribution terminal is located at factors for an optimal pipeline route and the route Buloba town centre approximately 14 kilometres by road, alternatives determined using Least Cost Path Analysis. west of Kampala city. The coordinates of Buloba are: 0⁰19’30.00”N, 32⁰27’0.00”E. The geomorphology is 3.1 Data characterised by a small sector of flat areas in the north- eastern region and rapid changing terrain elsewhere with Achieving the study objectives required the use of both elevations ranging from 574 to 4,877 metres above sea spatial and non-spatial data (Table 1). Data were obtained level. The most recent population census was carried out in from government departments in Uganda and 2014 and reported total national population results of 34.9 supplemented with other publicly available data. The million covering 7.3 million households with 34.4 million choice of input factors was determined by the availability inhabitants [11]. This represented a population increment of data, their spatial dimensions and computational of 10.7 million people from the 2002 census. Subsistence capacity. The study noted that there are many factors that agriculture is predominantly practiced throughout the can influence the routing of an oil and gas pipeline. country as a major source of livelihood as well as fishing However, only factors for which data were available were and animal grazing. Temperature ranges between 20 - 30 examined. Spatial consistency was attained by projecting ºC with annual rainfall between 1,000 and 1,800 mm. all data to Universal Transverse Mercator (UTM) projection, Zone 36N for localised projection accuracy and a spatial resolution of 30 m maintained during data processing.

Table 1: Data used for designing the cost surface layers Data type Format Scale Date Wellbores & Table & Points 1:60,000 2008 Borehole data Rainfall & 1990- Table & Raster 30 metre Evapotranspiration 2009 Soil map Raster 30 metre 1970 Topography Raster 30 metre 2009 Geology Raster 30 metre 2011 Land cover Raster 30 metre 2010 Soil Raster 30 metre 2008 Population Raster & Table 30 metre 2014 Wetlands Raster 30 metre 2010 Fig. 1: Location Map of Uganda, East Africa Streams (Minor & Raster 30 metre 2007 Major) Urban centres Vector 1:60,000 2013 3. Methodology Protected sites Vector 1:60,000 2011 Boundary, source & Vector 1:60,000 2014 The methodology utilised a GIS to prepare, weight, and destination evaluate environmental, construction and security factors Linear features used in the optimal pipeline routing. Estimates for local (Roads, Rail, Utility Vector 1:60,000 2009 construction costs for specific activities such as the actual lines) costs of ground layout of pipes, building support structures Construction costs Table 1:60,000 2009 in areas requiring above ground installations, and maintenance costs were beyond the scope of the available 3.2 Routing Criteria data. However, cost estimates averaged from published values for similar projects in the USA and China [12, 13 & Pipeline route planning and selection is usually a complex 14] were used to estimate the total construction costs of the task involving simultaneous consideration of more than optimal route. Multi Criteria Evaluation of pairwise one criterion. Criteria may take the form of a factor or a comparisons were used to calculate and obtain the relative constraint. A factor enhances or detracts from the importance of each of the three major criteria cost surfaces suitability of a specific alternative for the activity under

consideration. For instance, routing a pipeline within close Environmental criteria distance to roads is considered more suitable compared to routing it far away from the road. In this case, distance The environmental criteria were aimed at assessing the from the road constitute a factor criterion. Constraints on risks and impacts upon the environmental features found in the other hand serve to limit the alternatives under potential corridors of the pipeline route. Two objectives consideration, for instance protected sites and large water were addressed, i.e. minimising the risks of ground water bodies are not preferred in any way for pipelines to be contamination (GWP) and maintaining least degrading routed through them. effect on the environment such as the effects on land cover, land uses, habitats and sensitive areas (DEE). A GIS-based Routing a pipeline is therefore, more complex than simply DRASTIC Model (Fig. 3) was used to assess areas of laying pipes from the source refinery to the final ground water vulnerability while a weighted overlay model destination. Natural and manmade barriers along possible was used in determining areas with least degrading routes have to be considered as well as the influences these environmental effects. barriers have on the pipeline after installation. Accurate determination of the impact of these factors and constraints on pipeline routes is usually a time-consuming task requiring a skilled and dedicated approach [15]. This study employed a criteria-based approach in order to consider the different barriers and factors required to perform optimal pipeline route selection. Datasets were selected and processed into friction surfaces and grouped into three separate strands of criteria for analysis. Fig. 2 shows the implementation methodology and the grouping of the criteria (environmental, engineering and security).

Fig. 3: DRASTIC Model

Construction criteria

Construction criteria considered factors and constraints that accounted for the costs of laying oil and gas pipelines through the route. Two objectives were addressed; maximising the use of existing rights of way around linear features such as roads and utility lines (ROW), and maintaining routing within areas of low terrain costs (HTC). Although, the criteria aimed at minimising costs as much as possible, maintenance of high levels of pipeline integrity was not compromised.

Security criteria

Oil and gas pipeline infrastructures have been vandalised and destroyed in unstable political and socio-economic environments [16]. Political changes in Uganda have often been violent, involved military takeover leading to Fig. 2: Flow diagram of the implementation methodology destruction of infrastructures and resources. Therefore, the security of the proposed pipeline has always been a concern. Also, the proposed pipeline is projected to be laid

above ground traversing through different land cover types, summary of the normalised weights derived from expert administrative boundaries and cultural groupings opinion is shown in Table 10. comprising the study area. It is therefore, imperative that security is kept at high importance in consideration of the Table 2: DRASTIC Model Description and assigned Standard Weights pipeline route. Two objectives were addressed by the S/n Factor Description Weights Depth to Depth from ground surface to water security criteria: 1 5 water table table. Represents the amount of water per First, facilitation of quick access to the pipeline facility unit area of land that penetrates the 2 Net Recharge 4 (QCK) and secondly, protection of existing and planned ground surface and reaches the water infrastructures around the pipeline route (PRT). This is in table. line with the observation that pipeline infrastructure poses Refers to the potential area for water a high security risk to the environment and communities, Aquifer logging, the contaminant attenuation 3 3 and is of international concern [17]. Pipeline media of the aquifer inversely relates to the infrastructures suffer from illegal activities involving amount and sorting of the fine grains Refers to the uppermost weathered 2 siphoning, destruction and sabotage, disrupting the supply 4 Soil media of oil products. Similar studies such as the Baku-Tblilsi- area of the ground. Refers to the slope of the land Ceyhan (BTC) pipeline [18] and the Niger Delta pipeline 5 Topography 1 surface. [19] reported significant effects of pipeline infrastructure It is the ground portion between the Impact of vandalism and the need for proper security planning to 6 aquifer and soil cover in which pores 5 vadose zone counter such activities during pipeline route planning. It is or joints are unsaturated. also important that oil and gas pipelines are regularly Indicates the ability of the aquifer to monitored and maintained against wear and tear effects on transmit water and thereby Hydraulic the pipe materials, pressure, and blockages inside the 7 determining the rate of flow of 3 conductivity pipeline. Routing in locations with ease of access for contaminant material within the maintenance, emergency response and protection against ground water system. vandalism were therefore addressed. Source: [21]

3.3 Weighting Criteria 3.4 Estimating the construction costs

The weighting criteria used were based on weights derived The construction costs for each pipeline alternative were from literature review and expert opinions. Questionnaires estimated using the economic model proposed by were used to collate responses from experts as well as Massachusetts Institute of Technology (MIT), Laboratory standard weights (Table 2) sourced from literature that for Energy and the Environment (LEE) (MIT-LEE) [13]. were incorporated to weigh and derive the optimal routes. MIT applied the model to estimate the annual construction cost for a Carbon Dioxide (CO2) pipeline. Data used were Values were assigned to each criterion based on their based on Natural Gas pipelines due to the relative ease of degree of importance in the containing criteria. For availability. The cost data were used to estimate the example, gentle slopes provide solid foundations for laying pipeline construction costs. Although, the rate of flow and pipelines so it received higher weight (lower friction value) pipeline thickness of these two types of pipelines (Natural in the construction criteria whereas steep slopes require Gas and oil) may differ, the land construction costs does levelling and/or support posts to raise the pipeline above not differ much. The costs of acquiring pipeline materials ground hence it received low weight (higher friction value). such as pipes, pump stations, diversions and support Based on linguistic measures developed by Saaty [20], structures were not included in the analysis. Equation 1 weights were assigned on a scale of 1 to 9 semantic illustrates the formula used to estimate the total differentials scoring to give relative rating of two criteria construction cost (TCC) over the operating life of the where 9 is highest and 1 is lowest. The scale of differential pipeline in British Pounds Sterling (BPD): scoring presumes that the row criterion is of equal or greater importance than the column criterion. The TCC = LCC ´ CCF + OMC (1) reciprocal values (1/3, 1/5, 1/7, or 1/9) were used where Where, LCC is the Land construction cost in BPD, the row criterion is less important than the column criterion. CCF is the Capital Charge Factor, A decision matrix was then constructed using Saaty’s scale OMC is the annual operation & management costs in BPD and factor attributes were compared pairwise in terms of importance of each criterion to that of the next level. A

CCF values were defaulted to 0.15 and the OMC estimated regional differences in pipeline construction costs by at BPD 5,208.83 per kilometre per year irrespective of the using regional dummy variables. The two correlations pipeline diameter [14]. provided comparative results for the study area. LCC were obtained from two correlation equations which assume a linear relationship between LCC and distance LCC = b ´ Dx ´ (L ´ 0.62137)y ´ z ´ i (3) and length of the pipeline. Equations 2 and 3 illustrate the formula used to obtain LCC for the MIT and Carnegie Where, b = BPD 27187.55 Mellon University (CMU) correlation models respectively. D = pipeline diameter in inches and x = 1.035 L = pipeline length in kilometres and y = 0.853 1. In the MIT correlation, it is assumed that the pipeline’s z = regional weights = 1 (since regional weights are LCC has a linear correlation with pipeline’s diameter constant) and length i is optional. It is the cost fluctuation index due to LCC = a ´ D ´ (L ´ 0.62137) ´ i (2) increase in inflation and costs in a given year (Table 4). The study used running average index for year 2007. Where, a = BPD 21,913.81 (variable value specific to the user) per inch per kilometre D is the pipeline diameter in inches 4. Results and Discussion L is the least-cost pipeline route length in Kilometres i is optional. It is the cost fluctuation index due to This section presents the results of the various analyses increase in inflation and costs in a given year. The carried out in the study. Maps, Figures and Tables make up study used the running average for year 2007 (Table 3). the content together with detailed descriptions and discussion of the results shown. Table 3: MIT Correlation Price Index Year Index (i) Running Average 4.1 Weights used in the study 2000 1.51 1.47 The study employed both primary and secondary data. 2001 1.20 1.48 Primary data were obtained from a sample of experts in the 2002 1.74 1.65 fields of oil and gas, environment, plus cultural and 2003 2.00 2.01 political leaders. Questionnaires were used to collect 2004 2.30 2.20 expert opinions from 20 respondents from each of the three 2005 2.31 2.30 fields. Fig. 4 shows the category of respondents and the 2006 2.30 2.71 percentage responses obtained for each of the categories. 2007 3.53 2.92 Table 10 shows the comparative responses normalised in Source; [13] percentage.

Table 4: CMU Correlation Price Index Year Index (i) Running Average 2000 1.09 1.05 2001 0.99 1.08 2002 1.17 1.16 2003 1.33 1.35 2004 1.56 1.47 2005 1.52 1.57 2006 1.68 1.59 2007 2.46 2.07

Source; [13] Fig. 4: Respondents collated from questionnaires 2. The CMU correlation model is similar to the MIT model. However, it is more recent and departs from the 4.2 Environment cost surface linearity restriction in the MIT correlation and allows for a double-log (nonlinear) relationship between An environmental cost surface (Fig.5C) was obtained by pipeline LCC and pipeline diameter and length. In applying equal weighting on two objective-based cost addition, the CMU correlation model takes into account surfaces; that is maintaining least degrading effect on the

environment (DEE) and protection of ground water from 4.6 Optimal route contamination arising from pipeline related activities (GWP), represented in Fig. 5 (A) and (B) respectively. Table 5 shows the accumulated costs incurred by each Additionally, studies by Secunda et al. [22] revealed that route and the total distance traversed by the optimal routes. assuming constant values for the missing layers in the While the diameter of the actual pipes for the proposed DRASTIC Model produced the same results as when all pipeline have yet to be decided, a buffer of 1 kilometre was seven layers were used. This study applied constant values applied around the optimal routes to generate a strip to the three cost layers (Net Recharge, Impact of vadose accounting for the potential right-of-way. Also, there were and Hydraulic conductivity) based on literature because no routing activities conducted for oil and gas pipeline in these layers have values representing a country-wide extent the study area prior to this study. The Government’s [23]. estimated total distance for the pipeline route determined by a neutral criteria was 205 kilometres [4]. Therefore, this 4.3 Construction cost surface study considered the optimal route with the shortest length as a baseline route for comparisons with other optimal A Construction cost surface (Fig. 6C) was obtained by routes. applying equal weighting on two objective-based cost surfaces; maintaining the use of areas with existing right of Table 5: Costs and lengths of the optimal routes way (ROW, Fig. 6A) and minimising areas with high Length terrain cost (HTC, Fig. 6B). The cost surfaces for both Pipeline difference Optimal route Accumulated length from the ROW and HTC show that distribution of the costs cover alternatives cost distance the entire study area. Over 50% of the study area presented (km) proposed very low ROW with a few areas in the West, Central and length Eastern parts of the study extent recording high costs Environmental 1,529,468.00 213.09 +8.09 indicating areas of urban concentrations, Mount Elgon to Construction 1,363,801.75 205.26 +0.26 the East and protected sites covering the South-Western Security 1,393,417.50 209.52 +4.52 part of the study area and North-Eastern parts. Similarly, Hybrid 1,255,547.75 215.11 +10.12 one protected site (licensed sites for oil drilling purposes) and all major streams (lakes and rivers) presented higher The construction criteria optimal was the shortest route costs to the construction criteria. Much of the Central and with a length of 205.26 kilometres, a 0.26 kilometre Northern parts of the country are cheaper. Moderate increase over the 205 km estimate proposed by Ugandan construction costs are observed around areas covered by government. From Table 5, the environmental, security protected sites such as national parks, cultural sites, and hybrid are respectively 8.09, 4.52 and 10.12 wildlife reserves and sanctuaries. This is so because the kilometres longer than the proposed route. The baseline importance of these protected sites are evaluated entirely route also has an accumulated cost cheaper than both on economic terms (ROW and HTC objectives). security and environmental criteria. However, the hybrid criteria optimal route is 1.95% cheaper than the baseline 4.4 Security cost surface route. This suggests that the incorporation of multiple constraints and criteria in the optimal route selection A security cost surface was obtained from equal weighting minimises the resultant costs associated with routing. of the QCK and PRT cost surfaces. QCK and PRT cost surfaces are the two objective-based cost surfaces for which the security criteria achieved. The results are shown 4.7 The financial implications of each optimal route in Fig. 7 (A), (B) and (C) for QCK, PRT and security criteria cost surfaces respectively. In the three maps, costs Construction cost estimates from Tables 6 and 7 show that were represented as continuous surfaces. construction costs linearly vary with increases in both pipeline diameter and length across the two models. The 4.5 Hybrid cost surface shorter the route and the narrower the pipeline, the cheaper the construction costs. Fig. 10 shows the graphical The final cost surface obtained is the hybrid cost surface representation of the linear relationship between pipeline where the six cost surfaces (DEE, GWP, ROW, HTC, construction costs and both pipeline diameter and length. QCK and PRT) were combined and equally weighted. A continuous surface was generated as shown in Fig. 8 (A).

Table 6: TCC estimates for the optimal routes based on MIT Model Total construction cost (MIT Model) Land uses such as roads, urban centres and protected sites Pipeline in millions of BPD Optimal were crossed by at least one of the four optimal routes. length Routes Pipeline diameter in inches Linear features (Roads, Rail roads, utility lines) and minor (km) 8 16 18 24 30 36 40 42 streams were among the most crossed features by the Environmental 213.09 10.2 20.3 22.9 30.5 38.1 45.8 50.8 53.4 optimal routes. No urban and protected sites were directly crossed by the optimal routes. However, when a spatial Construction 205.26 9.8 19.6 22.0 29.4 36.7 44.1 49.0 51.4 buffer of 200m was applied around the urban centres, five Security 209.52 10.0 20.0 22.5 30.0 37.5 45.0 50.0 52.5 urban centres and one protected site were crossed by the Hybrid 215.11 10.3 20.5 23.1 30.8 38.5 46.2 51.3 53.9 optimal routes (Table 8). Of the affected urban centres, four were crossed by security optimal route while hybrid

Table 7: TCC estimates for the optimal routes based on CMU Model optimal route crossed one urban centre. The location of the Total construction cost (CMU refinery is within a 1km buffer around one of the protected Pipeline Model) in millions of BPD sites (Kaiso-Tonya Community Wildlife Management Optimal Routes length Pipeline diameter in inches Area). (km) 8 16 18 24 30 36 40 42 4.9 Monitoring and maintenance planning along the Environmental 213.09 7.0 14.4 16.3 21.9 27.6 33.4 37.2 39.2 optimal routes Construction 205.26 6.8 14.0 15.8 21.3 26.8 32.3 36.1 37.9 In order to properly monitor and maintain efficient Security 209.52 6.9 14.2 16.1 21.6 27.2 32.9 36.7 38.6 operation of the pipeline, pipeline routes were preferred to Hybrid 215.11 7.1 14.5 16.4 22.1 27.9 33.7 37.5 39.5 be near linear features such as roads, rail roads and utility lines since they provide quick and easy access to the Considering the total construction cost for a 24-inch pipeline facility. Also, locations near streams were diameter pipeline, The total construction costs for the preferred to allow access using water navigation means. Government’s proposed pipeline route is 29.34 million For planning purposes such as installation of monitoring BPD, whereas for security, environmental and hybrid and maintenance facilities such as engineering workshops routes are 30.0, 30.5 and 30.8 million BPD respectively and security installations, areas with clear line of sight are using the MIT Model. Also using the CMU Model similar recommended. The study therefore performed Viewshed trend in results are shown where the baseline route (the analysis [24] on the on topographical data to determine shortest) also doubling as the cheapest route estimated at visible areas. Fig. 9 (B) shows the locations visible from 21.3 million BPD, followed by security, then each of the four optimal routes as determined from ArcGIS environmental and finally hybrid at 21.6, 21.9 and 22.1 Viewshed Analysis. Although, the Viewshed analysis million BPD respectively. performed on DEM does not consider the above-ground obstructions from land cover types such as vegetation and Therefore, the financial implication of each optimal route buildings, it can be compensated by installing such shows the construction criteria optimal route as the monitoring facilities at the appropriate height above cheapest and most feasible. The other three optimal routes ground while maintaining the site location. (security, environmental and hybrid) although longer and more expensive, are all under 1.59 and 2.54 million BPD from the CMU and MIT models’ construction costs 5. Sensitivity testing of weighting schemes estimates.

4.8 Effects of optimal routes on land cover and uses 5.1 The effect of equal weighting and weights obtained from expert opinion on the optimal routes Twelve different land cover types were considered in the study, seven of which (Table 9) were crossed by at least Equal weightings were applied to combine criteria one of the four optimal routes. Woodland, grassland, objectives and generate criteria cost surfaces as the first small-scale farmland, wetlands and degraded tropical high stage of analysis. Weights normalised from expert opinions forests all were crossed by the optimal routes. were then used to provide comparative results of the Environmental and hybrid optimal routes were the only analysis for environmental, construction and security routes that crossed Bushland. Also, construction and criteria. The hybrid criteria was not affected because non- security optimal routes were the only routes that crossed equal weightings were applied at the objectives evaluation stocked tropical high forest. level. The significant result was shown in the

environmental criteria route where the 25% weight change in the DEE objective resulted in a 7.79% (16.61 km) Table 8: Number of crossings by the optimal routes through buffer zones increase in the overall pipeline length under environmental Features Environmental Construction Security Hybrid criteria. This was the highest change in the pipeline length Roads 10 12 10 13 Lakes & followed by security criteria at (0.44 km) and lastly 0 0 0 0 construction criteria at 0.05 km. Environmental criteria Rivers Minor optimal route was also the longest route with a total length 14 9 13 16 at 229.70 km followed by hybrid at 215.11 km, security at Streams Utility 210.18 km and lastly construction criteria at 205.31 km. 2 2 2 2 Lines Although, the environmental route had the longest length, Rail roads 0 1 0 0 security criteria accumulated the highest cost while Urban 0 0 4 1 construction had the least accumulated cost distances. centres Protected 1 1 1 1 5.2 Application of non-equal weighting on criteria to sites generate hybrid route Total 27 25 30 33

Figures 5 & 11, shows the location of the hybrid optimal route generated from the application of equal weighting on Table 9: Areal coverage (square metres) of land cover type crossed by the three criteria (environmental, construction and security). each pipeline route The route is within 1.51 kilometres south of Hoima town. Land cover Environmental Construction Security Hybrid By applying an un-equal weighting where the environmental criteria accounted for 50% of the total Grassland 2,223,000 386,100 27,900 2,014,200 weight, security and construction at 25% each, the route was shifted 12 km further south of Hoima town (Fig. 11). Bushland 270,000 0 0 346,500 Other urban centres such as Kitoba and Butemba that were initially close to the equal weighted hybrid route (11.83 & Woodland 957,600 1,208,700 600,300 560,700 11.96 kms respectively) were also shifted (50 and 20 kms Small-Scale 2,219,400 4,383,900 4,161,600 3,029,400 respectively) away from the non-equal weighted route. Farmland Wetland 27,900 261,000 288,000 76,500 The length of the non-equal weighted hybrid route Tropical high decreased from 215.11 km to 212.94 km representing a 0 52,200 244,800 0 construction cost decrement of 0.3 BPD based on MIT forest (stocked) Tropical high Model for a 24-inch pipeline. Using CMU model, the 253,800 231,300 15,300 278,100 forest (degraded) construction costs decrement is at 0.2 BPD for the same pipeline diameter. Similarly, increasing the security and Total 5,951,700 6,523,200 5,337,900 6,305,400 construction criteria by 50% respectively, while maintaining the environmental criteria weights at 25% in each case resulted in cheaper routes but presented real risk to some urban centres. For instance, the 50% security criteria weighting resulted in the hybrid optimal route crossing the buffer zone of Ntwetwe town while avoiding Bukaya by 0.2 kilometre (Fig. 9C). Although the effect of applying un-equal weighting on the hybrid criteria optimal route had no incremental effect on the total length and costs of the pipeline, the potential effects on other criteria routes are visible. However, generally un-equal weighting had minimal adverse effects upon the environmental, construction and hybrid optimal routes.

Fig. 5: Location of the optimal routes

100

Table 10: Summary of normalised factor weights used in determination of cost surface layers 1. DEE Objective 2. ROW Objective 3. HTC Objective 4. QCK Objective 5. PRT Objective Factor/ Weight Factor/ Weight Factor/ Weight Factor/ Weight Factor/ Weight Constraint (%) Constraint (%) Constraint (%) Constraint (%) Constraint (%) Linear Linear Urban Urban centres 7.53 5.83 Land cover 6.48 20.16 20.16 features features centres Population Protected Land cover 50.92 0.55 Soil 38.52 Streams 30.62 30.62 density sites Protected Dense land Linear Protected sites 26.30 24.78 Topography 18.31 8.13 8.13 sites cover features Cultural Urban Cultural Wetlands 15.25 14.38 Linear features 10.88 41.08 41.08 landmarks centres landmarks Geology 25.18 Environmental Criteria Construction Criteria Security Criteria

A B C

Fig. 6: Cost surface maps showing DEE (A), GWP (B) objectives and combined environmental criteria cost surface (C)

Fig. 7: Cost surface maps showing ROW (A) and HT (B) objectives and combined Construction criteria cost surface (C)

101

Fig. 8: Cost surface map showing the ROW objective (A) and the PRT objective (B) and combined Security criteria cost surface (C)

A B B

Fig. 9: Hybrid cost surface map (A), visible locations to optimal routes (B) and all five route alternatives (C)

102

6. Conclusions

This paper presented a GIS-based methodology for the identification of an optimal and cost effective route for the oil and gas pipeline as well as taking into consideration the environmental, economic and security concerns associated with oil and gas pipeline routing. The effects on land cover and land uses, ground water contamination, costs of investments, human and wildlife security, and emergency responses to adverse effects such as oil spillage and pipeline leakages were considered in the routing activity. Given that governments and religious affiliations of the people can change any time, factors with long-term effects upon the installation and operation of the oil and gas pipelines were key in the decision making process. While the analyses were successful and objectives achieved, the study noted that community participation in pipeline routing is the most essential component of any complex multi criteria study. Factors such as socio-political, socio- economic and religious factors for which data are often unavailable or unreliable are recommended to be incorporated in any future studies. Similarly, land prices where compulsory land purchases are required should be conducted to estimate the pre-installation market values of land.

Acknowledgments

The Authors acknowledge the technical and software support obtained from the Faculty of Engineering and Science, University of Greenwich. The authors also thank the various departments of Uganda government, GIST- ITOS project, Nile Basin Initiative and USGS Earth Fig. 10: Construction costs variation Explorer Project to name but a few that provided the required data. Finally the lead author’s profound gratitude goes to Tullow Group Scholarship Scheme for providing the scholarship funding.

References [1] F.A.K. Kaliisa, “Uganda’s petroleum resources increase to 6.5 billion barrels oil in place”, Department of Petroleum, Uganda, 2014. Available at: http://www.petroleum.go.ug/news/17/Ugandas- petroleum-resources-increase-to-65-billion-barrels-oil- in-place. Last accessed: 15/06/2015 [2] C.A. Mwesigye, “Why Uganda is the Best Investment Location in Africa”, State House, Uganda, 2014. Available at: http://www.statehouse.go.ug/search/node/Why%20Uga nda%20is%20the%20Best%20Investment%20Location %20in%20Africa. Last accessed: 15/06/2015 [3] US EIA, “Uganda: Country Analysis Note”, 2014,

Available at: Fig. 11: Location of visible areas to the optimal routes http://www.eia.gov/beta/international/country.cfm?iso=

UGA. Last accessed: 15/04/2015.

103

[4] PEPD, “Uganda’s Refinery Project Tender Progresses [17] S.k.N. Hippu, S.K. Sanket, and R.A. Dilip, “Pipeline to the Negotiations Phase”, Department of Petroleum, politics—A study of India’s proposed cross border gas Uganda, 2014. Available at: projects”, Energy Policy, Vol. 62, 2013, pp. 145 – http://www.petroleum.go.ug/news/13/Ugandas- 156. Refinery-Project-Tender-Progresses-to-the- [18] G. Dietl, “Gas pipelines: politics and possibilities”. In: Negotiations-Phase. Last accessed: 15/06/2015. I.P. Khosla, Ed. “Energy and Diplomacy”, Konark [5] Business Week, “Oil boss says pipeline quickest option Publishers, New Delhi, 2005, pp.74–90 for Uganda”, East African Business Week, 2014. [19] F. Onuoha, “Poverty, Pipeline Vandalisation/Explosion Available at: and Human Security: Integrating Disaster Management http://www.busiweek.com/index1.php?Ctp=2&pI=205 into Poverty Reduction in Nigeria”, African Security 2&pLv=3&srI=53&spI=20. Last accessed: 05/11/2014. Review, Vol. 16, No. 2, 2007, pp.94-108, DOI: [6] I. Yeo, and J. Yee, “A proposal for a site location 10.1080/10246029.2007.9627420. planning model of environmentally friendly urban [20] T.L. Saaty, The Analytical Hierarchy Process. New energy supply plants using an environment and energy York: Wiley, 1980 geographical information system (E-GIS) database (DB) [21] R.A.N. Al-Adamat, I.D.L. Foster, and S.N.J. Baban, and an artificial neural network (ANN)”, Applied “Groundwater vulnerability and risk mapping for the Energy, Vol. 119, 2014, pp. 99 – 117. Basaltic aquifer of the Azraq basin of Jordan using GIS, [7] P. Jankowski, “Integrating geographical information Remote sensing and DRASTIC”, Applied Geography, systems and multiple criteria decision-making Vol. 23, 2003, pp. 303–324. methods”, International Journal of Geographical [22] S. Secunda, M.L., Collin, A.J. Melloul, “Groundwater Information Systems, Vol. 9, No. 3, 1995, pp. 251-273. vulnerability assessment using a composite model [8] B. Anifowose, D.M. Lawler, V.D. Horst, and L. combining DRASTIC with extensive agricultural land Chapman, “Attacks on oil transport pipelines in use in Israel’s Sharon region”, Journal of Nigeria: A quantitative exploration and possible Environmental Management, Vol. 54, 1998, pp. 39 – explanation of observed patterns”, Applied Geography, 57. Vol. 32, No. 2, 2012, pp. 636 – 651. [23] MWE, “WATER SUPPLY ATLAS 2010”, 2012, [9] S. Bagli, D. Geneletti, and F. Orsi, “Routeing of power Available at: lines through least-cost path analysis and multicriteria http://www.mwe.go.ug/index.php?option=com_docma evaluation to minimise environmental impacts”, n&task=cat_view&gid=12&Itemid=223. Environmental Impact Assessment Review, Vol. 31, Last accessed: 15/02/2015. 2011, pp. 234 – 239. [24] E.E. Jones, “Using Viewshed Analysis to Explore [10] C.W. Baynard, “The landscape infrastructure footprint Settlement Choice: A Case Study of the Onondaga of oil development: Venezuela’s heavy oil belt”, Iroquois”, American Antiquity, Vol. 71, No. 3, 2006, Ecological Indicators, Vol. 11, 2011, pp.789 – 810. pp. 523-538. [11] UBOS, “National Population and Housing Census 2014, Provincial Results”, 2014, Available at: Dan Abudu was awarded a BSc in Computer Science (First http://www.ubos.org/onlinefiles/uploads/ubos/NPHC/ Class) by Gulu University, Uganda in 2010 and continued NPHC%202014%20PROVISIONAL%20RESULTS% his career in Data Management serving at organisiations 20REPORT.pdf. Last accessed: 13/02/2015. such as Joint Clinical Research Centre and Ministry of [12] B. Bai, X. Li, and Y. Yuan, “A new cost estimate Finance, Planning and Economic Development in Uganda, and at Kingsway International Christian Centre, UK. He methodology for onshore pipeline transport of CO2 in briefly served in Academia as Teaching Assistant (August China”, Energy Procedia, Vol. 37, 2013, pp. 7633 – 2010 – April 2011). Dan has been active in GIS and Remote 7638. Sensing research since 2013 with keen interests in GIS [13] CCSTP, “Carbon Management GIS: CO2 Pipeline applications in Oil and Gas sector. He is a member of the Transport Cost Estimation. Carbon Capture and African Association of Remote Sensing of the Environment (AARSE) and was awarded an MSc in GIS with Remote Sequestration Technologies Program Massachusetts nd Institute of Technology”, 2009, Available at: Sensing from the University of Greenwich, UK on the 22 July 2015. http://sequestration.mit.edu/energylab/uploads/MIT/Tr ansport_June_2009.doc. Last accessed: 02/05/ 2015 Dr. Meredith Williams is a Senior Lecturer in Remote [14] G. Heddle, H. Herzog, and M. Klett, “The Economics Sensing and GIS at the Centre for Landscape Ecology & of CO2 Storage. MIT LFEE 2003-003 RP”, 2003, GIS, University of Greenwich, Medway Campus, UK, with Available at: http://mitei.mit.edu/system/files/2003-03- over 23 years experience in applied GIS and Remote rp.pdf. Last accessed: 12/05/2014. Sensing. He specialises in the application of Remote [15] Oil & Gas, “GIS leads to more efficient route Sensing and GIS to the monitoring of vegetation health, land planning”, Oil & Gas Journal, Vol. 91, No. 17, 1993 cover change, and fluvial systems. He has supervised a wide range of PhD and MSc projects, including several in pp. 81. collaboration with the oil and gas industry. [16] S. Pandian, “The political economy of trans-Pakistan gas pipeline project: assessing the political and economic risks for India”, Energy Policy, Vol. 33, No. 5, 2005, pp. 659–670.

104

Hybrid Trust-Driven Recommendation System for E-commerce Networks

Pavan Kumar K. N1, Samhita S Balekai1, Sanjana P Suryavamshi1, Sneha Sriram1, R. Bhakthavathsalam2

1Department of Information Science and Engineering, Sir M. Visvesvaraya Institute of Technology Bangalore, Karnataka, India 1 [email protected], [email protected], [email protected], [email protected]

2Super computer Education and Research Centre, Indian Institute of Science Bangalore, Karnataka, India [email protected]

Abstract personalized, different users or user groups receive In traditional recommendation systems, the challenging issues diverse suggestions. In addition there are also in adopting similarity-based approaches are sparsity, cold-start non-personalized recommendations. These are much users and trustworthiness. We present a new paradigm of simpler to generate and are normally featured in recommendation system which can utilize information from magazines or newspapers. Typical examples include the social networks including user preferences, item's general top ten selections of books, CDs etc. While they may be acceptance, and influence from friends. A probabilistic model, useful and effective in certain situations, these types of particularly for e-commerce networks, is developed in this paper to make personalized recommendations from such information. non-personalized recommendations are not typically Our analysis reveals that similar friends have a tendency to addressed by RS research. select the same items and give similar ratings. We propose a trust-driven recommendation method known as 1.1 Recommendation System Functions HybridTrustWalker. First, a matrix factorization method is utilized to assess the degree of trust between users. Next, an First, we must distinguish between the roles played by the extended random walk algorithm is proposed to obtain RS on behalf of the service provider from that of the user recommendation results. Experimental results show that our of the RS. For instance, a travel recommendation system proposed system improves the prediction accuracy of is typically introduced by a travel intermediary (e.g., recommendation systems, remedying the issues inherent in Expedia.com) or a destination management organization collaborative filtering to lower the user’s search effort by listing (e.g., Visitfinland.com) to increase its turnover (Expedia), items of highest utility. i.e. sell more hotel rooms, or to increase the number of Keywords: Recommendations system, Trust-Driven, Social Network, e-commerce, HybridTrustWalker. tourists to the destination. Whereas, the user’s primary motivations for accessing the two systems is to find a suitable hotel and interesting events/attractions when 1. Introduction visiting a destination [1]. In fact, there are various reasons as to why service providers may want to exploit this Recommendation systems (RS) (sometimes replacing technology: "system" with a synonym such as platform or engine) are a subclass of information filtering system that seek to Increase in sales: This goal is achieved because the predict the 'rating' or 'preference' that user would give to recommended items are likely to satisfy users’ functional an item. RSs have changed the way people find products, preferences. Presumably the user will recognize this after information, and even other people. They study patterns having tried several recommendations. From the service of behaviour to know what someone will prefer from providers’ point of view, the primary goal of introducing among a collection of things he has never experienced. a RS is to increase the conversion rate, i.e. the number of RSs are primarily directed towards individuals who lack users that accept the recommendation and consume an sufficient personal experience or competence to evaluate item compared to the number of visitors browsing the potentially overwhelming number of alternative items through for information. that a Web site, for example, may offer .A case in point is a book recommendation system that assists users to select Exposure to a wider product range: Another major a book to read. In the popular Website, Amazon.com, the function of a RS is to enable the user to select items that site employs a RS to personalize the online store for might be hard to find without a precise recommendation. each customer. Since recommendations are usually For instance, in a movie RS such as Netflix, the service

105

provider is interested in renting all the DVDs in the horror genre, then the system can learn to recommend catalogue, not just the most popular ones. other movies from this genre.

Consolidating user satisfaction and fidelity: The user Demographic: This type of system recommends items will find the recommendations interesting, relevant, and based on the demographic profile of the user. The accurate, and when combined with a properly designed assumption is that different recommendations should be human-computer interaction she will also enjoy using the generated for different demographic niches. Many Web system. Personalization of recommendations improves sites adopt simple and effective personalization solutions user loyalty. Consequently, the longer the user interacts based on demographics. For example, users are with the site, the more refined her user model becomes, dispatched to particular Web sites based on their language i.e., the system representation effectively customizing or country. Or suggestions may be customized according recommendations to match the user’s preferences. to the age of the user. While these approaches have been quite popular in the marketing literature, there has been Improve QoS through customer feedback: Another relatively little proper RS research into demographic important function of a RS, which can be leveraged to systems. many other applications, is the description of the user’s preferences, either collected explicitly or predicted by the Knowledge-based: Recommendation based on specific system. The service provider may then decide to reuse domain knowledge about how certain item features meet this knowledge for a number of other goals such as users’ needs and preferences and, ultimately, how the item improving the management of the item’s stock or is useful for the user. In these systems a similarity function production. estimates how well the user needs match the recommendation. The similarity score can be directly 1.2 Common Recommendation Techniques interpreted as the utility of the recommendation for the user. Content-based systems are another type of In order to implement its core function, identifying the knowledge-based RSs In terms of used knowledge, both useful items for the user, an RS must predict that an item systems are similar: user requirements are collected; is worth recommending. In order to do this, the system repairs for inconsistent requirements are automatically must be able to predict the utility of some of them, or at proposed in situations where no solutions could be found; least compare the utility of some items, and then decide and recommendation results are explained. The major what items to recommend based on this comparison. The difference lies in the way solutions are calculated. prediction step may not be explicit in the Knowledge-based systems tend to work better than others recommendation algorithm but we can still apply this at the beginning of their deployment but if they are not unifying model to describe the general role of a RS. Some equipped with learning components they may be surpassed of the recommendation techniques are given below: by other shallow methods that can exploit the logs of the human/computer interaction (as in CF). Collaborative filtering: The simplest and original implementation of this approach recommends the items 1.3 Problems in Existing Recommendation Systems that other users with similar tastes liked, to the target user. The similarity of taste between two users is calculated Sparsity problem: In addition to the extremely large based on the rating history of the users. Collaborative volume of user-item rating data, only a certain amount of filtering is considered to be the most popular and widely users usually rates a small fraction of the whole available implemented technique in RS. Neighbourhood methods items. As a result, the density of the available user focus on relationships between items or, alternatively, feedback data is often less than 1%. Due to this data between users. An item-item approach models the sparsity, collaborative filtering approaches suffer preference of a user to an item based on ratings of similar significant difficulties in identifying similar users or items by the same user. Nearest-neighbours methods items via common similarity measures, e.g., cosine enjoy considerable popularity due to their simplicity, measure, in turn, deteriorating the recommendation efficiency, and their ability to produce accurate and performance. personalized recommendations. The authors will address the essential decisions that are required when Cold-start problem: Apart from sparsity, cold-start implementing a neighbourhood based recommender problem, e.g., users who have provided only little system and provide practical information on how to make feedback or items that have been rated less frequently or such decisions [2]. even new users or new items, is a more serious challenge in recommendation research. Because of the lack of user Content-based: The system learns to recommend items feedback, any similarity-based approaches cannot handle that are similar to the ones that the user liked in the past. such cold-start problem. The similarity of items is calculated based on the features associated with the compared items. For example, if a Trustworthiness problem: Prediction accuracy in user has positively rated a movie that belongs to the recommendation systems requires a great deal of

106

consideration as it has such a strong impact on customer The most common trust-driven recommendation experience. Noisy information and spurious feedback approaches make users explicitly issue trust statements with malicious intent must be disregarded in for other users. Golbeck proposed an extended-breadth recommendation considerations. Trust-driven first-search method in the trust network for prediction recommendation methods refer to a selective group of called TidalTrust [5]. TidalTrust finds all neighbours who users that the target user trusts and uses their ratings have rated the to-be recommended service/item with the while making recommendations. Employing 0/1 trust shortest path distance from the given user and then relationships , where each trusted user is treated as an aggregates their ratings, with trust values between the equal neighbour of the target user , proves to be given user and these neighbours as weights. Mole rudimentary as it does not encapsulate the underlying Trust [6] is similar to TidalTrust but only considers the level of trust between users. raters within the limit of a given maximum-depth. The maximum-depth is independent of any specific user As a solution , the concept of Trust Relevancy [3] and item. is introduced first , which measures the trustworthiness factor between neighbours , defining the extent to which the trusted user's rating affects the target user's predicted 3. Proposed System rating of the item. Next, the algorithm HybridTrustWalker performs a random walk on the In a trust-driven recommendation [7] paradigm, the trust weighted network. The result of each iteration is relations among users form a social network. Each user polymerised to predict the rating that a target user might invokes several web services and rates them according to award to an item to be recommended. Finally, we conduct the interaction experiences. When a user needs experiments with a real-world dataset to evaluate the recommendations, it predicts the ratings that the user accuracy and efficiency of the proposed method. might provide and then recommends services with high predicted ratings. Hence, the target of the recommendation system predicts users’ ratings on 2. Related Work services by analysing the social network and user-service rating records. Since the first paper published in 1998, research in

recommendation systems has greatly improved reliability There is a set of users U = {u1, u2, ...,um} and a set of of the recommendation which has been attributed to services S = {s1, s2, ..., sn} in a trust driven several factors. Paolo Massa and Bobby Bhattacharjee in recommendation system. The ratings expressed by users their paper Using Trust in Recommendation System: An on services/items are given in a rating matrix R = [Ru,s]mxn. Experimental Analysis (2004) show that any two users In this matrix, Ru,s denotes the rating of user u on service have usually few items rated in common. For this reason, (or item) s. Ru,s can be any real number, but often ratings the classic RS technique is often ineffective and is not are integers in the range of [3]. In this paper, without loss able to compute a user similarity weight for many of the of generality, we map the ratings 1 ,…, 5 to the interval users. In 2005, John O'Donovan and Barry Smyth [0,1] by normalizing the ratings. In a social rating described a number of ways to establish profile-level and network, each user u has a set Su of direct neighbours, and item-level trust metrics, which could be incorporated into tu,v denotes the value of social trust u has on v as a real standard collaborative filtering methods. number in [0, 1]. Zero means no trust, and one means full trust. Binary trust networks are the most common trust Shao et al (2007) proposed a user-based CF algorithm networks (Amazon, eBay, etc.). The trust values are using Pearson Correlation Coefficient (PCC) to compute given in a matrix T = [Tu,v]mxm. Non-zero elements Tu,v in user similarities. PCC measures the strength of the T denote the existence of a social relation from u to v. association between two variables. It uses historical item Note that T is asymmetric in general [8]. ratings to classify similar users and predicts the missing QoS values of a web service by considering QoS value of E-commerce u u service used by users similar to her [4]. Network

Zheng et al furthered the collaborative filtering dimension u u of recommendation systems for web service QoS prediction by systemically combining both item-based u u PCC (IPCC) and user-based PCC (UPCC). However, the correlation methods face challenges in providing recommendations for cold-start users as these methods Web consider users with similar QoS experiences for same Services/Item services to be similar [3]. s s s Fig. 1. Illustration of trust-driven recommendation approach

107

Thus, the task of a trust-driven service recommendation t(u,v) is the degree of trust of u towards v. By computing system is as follows: Given a user u0 belonging to U and the trust relevancy between all connected users in a social + a service s belonging to S for which Ruo,s is unknown, network, we can obtain a weighted trust network (SN ), predict the rating for u0 on service s using R and T. This is where the weight of each edge is the value of trust done by first determining a degree of trust between users relevancy. The aim of calculating trust relevancy is to in the social network to obtain a weighted social network determine the degree of association between trusted from the Epinions data , using 0/1 trust relation from the neighbours. input dataset and cosine similarity measures of user and service latent features. Then, a random walk performed In RSs, the user-item/service rating matrix is usually very on this weighted network yields a resultant predicted large in terms of dimensionality but most of the score rating. Ratings over multiple iterations are polymerized to data is missing. Therefore, matrix factorization (MF) has obtain the final predicted ratings. been widely utilized in recommendation research to improve efficiency by dimension reduction [9]. For an 3.1 Trust-Driven Recommendation Approach m * n user-service rating matrix R, the purpose of matrix factorization is to decompose R into two latent feature Incorporating trust metrics in a social network does not matrices of users and items with a lower dimensionality d absolutely affect the target user’s ratings because the such that , target user and trusted users might differ in interests, preferences and perception. The concept of trust R ≈ PQT (2) relevancy considers both the trust relations between users together with the similarities between users. This section mxd nxd presents our approach in detail for trust-driven service where P ∈ R and Q ∈ R represent the user and item recommendations. First, we define the concept of trust latent feature matrices, respectively. Each line of the relevancy, on which our recommendation algorithm is respective matrix represents a user or service latent based. Then, we introduce the algorithm feature vector. After decomposing the matrix, we use the HybridTrustWalker by extending the random walk cosine similarity measure to calculate the similarity algorithm in [7]. Lastly, the predicted ratings are returned. between two users. Given the latent feature vectors of two The methodology is summarized as shown in Fig. 2. users, u and v, their similarity calculation is as follows:

simU(u ,v) = cos(u ,v) = (3) User Set , Service Set , Social network , Ratings

where u and v are latent feature vectors of users u and v. Trust Relevancy Calculation 3.2 Recommendation Algorithm

Weighted Network, User/Service Features The HybridTrustWalker algorithm attains a final result through multiple iterations. For each iteration, the random walk starts from the target user u0 in the weighted trust Random Walk network SN+. In the kth step of the random walk in the trust network, the process will reach a certain node u. If user u has rated the to-be-recommended service s, then One result from each Iteration the rating of s from user u is directly used as the result for the iteration. Otherwise, the process has two options, one of which is: Ratings Prediction  The random walk will stop at the current node u with a certain probability φu,s,k. Then, the service si is selected from RSu based on the probability Fu(si). The Termination condition based rating finalization rating of si from u is the result for the iteration.

Fig. 2. Proposed Methodology The probability that the random walk stops at user u in the k-th step is affected by the similarity of the items that u has rated and the to-be-recommended service s. The Given user u and v, the trust relevancy between u and v is more similar the rated items and s, the greater the as follows: probability is to stop. Furthermore, a larger distance between the user u and the target user u0 can introduce tr(u,v) = simU(u,v) *t(u,v) (1) more noise into the prediction. Therefore, the value of Here, simU(u,v) is the similarity of users u and v, and probability φu,s,k should increase when k increases [10].

108

Thus, the calculation for φu,s,k is as follows: where tr(u, v) is the trust relevancy introduced earlier. The trust relevancy guarantees that each step of the walk will choose the user that is more similar to the current φ (s s) * (4) u,s,k = ∈ i, user, making the recommendation more accurate and thus

enhancing productivity and user acceptance. where simS(si, s) is the similarity between the services si and s. The sigmoid function of k can provides value 1 for 3.3 HybridTrustWalker Algorithm big values of k, and a small value for small values of k. In Input: U(user set), S(service set), R(rating matrix), contrast to collaborative filtering techniques [2], this + method can cope with services that do not have ratings SN (weighted social network), u0(the target user), s(to- from common users. Service similarities are calculated be-recommended service). using Matrix Factorization [8]: Output: r (predicted rating).

SimS(s ,s ) = cos(s ,s ) (5) Pseudocode: i j i j =

When it is determined that user u is the terminating point 1 set k = 1 ; //the step of the walk of the walk, the method will need to select one service 2 set u = u0 ; //set the start point of the walk as u0 from RSu. The rating of si from u is the outcome for the 3 set max-depth = 6 ; //the max step of the walk iteration. The probability of the chosen service F (s ) is u i 4 set r = 0 ; calculated according to the following formula: 5 while (k<=max-depth) { 6 u = selectUser(u) ; //select v from TUu F (s )= (6) u i as the target of the next step based on ∑ ∈ the probability Eu(v). Services are selected Fu(si) through a roulette-wheel 7 if (u has rated s) { selection [11], that is, services with higher values of Fu(si) 8 r = ru,s ; are more possible to be selected. Also, adopting the "six 9 return r ; degrees of separation" [12], by setting the maximum step 10 } of each walk to 6, prevents infinite looping of the random 11 else { walk. 12 if (random (0,1) < φu,s,k ||k == max-depth) { //stop at the current node

13 si = selectService(u); //service si is selected from RSU based on the probability FU(si). 14 r = ru,si ; 15 return r; 16 } 17 else Fig. 3. Example of HybridTrustWalker 18 k++ ; 19 } The other alternate option during the walk if the user u 20 } has not rated the to-be recommended service s is: 21 return r;

Fig.3 shows an example to illustrate the algorithm clearly.  The walk can continue with a probability of 1-φu,s,k. In which case, a target node for the next step is The weight of each edge represents the probability Eu(v). Suppose the service s is to be recommended for the user selected from the set of trusted neighbours of the user 3 u . For the first step of the walk, u is more likely to be u. 1 2 selected as the target node since the value of Eu(u2) is To distinguish different users’ contribution to the larger. If u2 has rated s3 with the rating r, r will be recommendation prediction, we propose that the target returned as the result of this walk (Line.7–9). Otherwise, node v for the next step from the current user u is selected if the termination condition (Line.12) is not reached, the according to the following probability: walk would continue. For the second step, u5 is selected. It should also check whether u5 has rated s3. If u5 has not rated s3 but the termination condition is reached, it will Eu(v) = (7) select the most similar service to s3 from the items u5 has ∑ ∈ rated (Line.13). Then, the rating of the selected service by u5 is returned as the result of this walk.

109

3.4 Ratings Prediction RMSE into a precision metric in the range of [0, 1]. The precision is denoted as follows: The HybridTrustWalker algorithm attains a final result through multiple iterations. The final predicted rating is precision = 1 - (12) obtained by polymerizing the results returned from every iteration:

puo,s ∑ (8) To combine RMSE and coverage into a single evaluation

metric, we compute the F-Measure as follows : where ri is the result of each iteration, n is the number of

iterations. F-Measure = (13) To obtain a stable predict result, the algorithm needs to perform an adequate number of random walks. We can decide the termination condition of the algorithm through Comparison analysis of performance measure for various the calculation of the variance of the prediction values. RS paradigms including collaborative filtering The variance of the prediction results after a random walk approaches: is denoted and calculated as: Table 1: Comparing results for all users 2 σi ∑ ̅ (9)

Algorithms RMSE Coverage (%) F-measure

where rj is the result of every iteration, i is the total 2 number of iterations until the current walk, and σi the Item based CF 1.345 67.58 0.6697 variance obtained from the last i iterations, which will 2 2 eventually tend to a stable value. When |σi+1 - σi | ≤ ε, the User based CF 1.141 70.43 0.7095 algorithm terminates ( = 0.0001).

Tidal trust 1.127 84.15 0.7750 4. Results and Discussion Mole trust 1.164 86.47 0.7791 We use the dataset of Epinions published by the authors of [11]. The large size and characteristically sparse user- item rating matrix makes it suitable for our study. This Trust Walker 1.089 95.13 0.8246 contains data of 49,290 users who have rated 139,738 items. There are a total of 664,824 ratings with 487,181 trust relations within the network. HybridTrustWalker 1.012 98.21 0.8486

We adopt the Root Mean Squared Error (RMSE), which is widely used in recommendation research, to measure 1.2 the error in recommendations: 1 0.8

∑ ̂ RMSE = √ (10) 0.6

Precision 0.4 Coverage where Ru,s is the actual rating the user u gave to the 0.2 F-measure service s and Ȓu,s: which is the predicted rating the user u 0 gave to the service s. N denotes the number of tested ratings. The smaller the value of RMSE is, the more precisely the recommendation algorithm performs. We use the coverage metric to measure the percentage of pairs of , for which a predicted value can be generated: Fig.4. Comparing results of all users.

Coverage = (11) The reduction of precision of the proposed model is compensated by the increased coverage and F-measure as where, S denotes the number of predicted ratings and N shown in Table 1 and Table 2 (in the case of cold-start denotes the number of tested ratings. We have to convert users).

110

Table 2: Comparing results of cold-start users Furthermore, for this model, we develop a hybrid random walk algorithm. Existing methods usually randomly Algorithms RMSE Coverage (%) F-measure select the target node for each step when choosing to walk. By contrast, the proposed approach selects the Item based CF 1.537 23.14 0.3362 target node based on trust and similarity. Thus, the recommendation contribution from trusted users is more User based CF 1.485 18.93 0.2910 accurate. We also utilize large-scale real data sets to evaluate the accuracy of the algorithm. The experimental results show that the proposed method can be directly Tidal trust 1.237 60.75 0.6463 applied in existing e-commerce networks with improved accuracy. Personalized service recommendation systems Mole trust 1.397 58.29 0.6150 have been heavily researched in recent years and the proposed model provides an effective solution for the Trust Walker 1.212 74.36 0.7195 same. We believe that there is scope for improvement. For example, here, the trust relationships between users in HybridTrustWalker 1.143 79.64 0.7531 the social trust network are considered to be invariant. But in reality, the trust relationship between users can change over time. In addition, the user ratings are also time sensitive. As a result, ratings that are not up-to-date 1.2 may become noise information for recommendations. In large user communities, it is only natural that besides 1 trust also distrust starts to emerge. Hence, the more users 0.8 issuing distrust statements, the more interesting it becomes to also incorporate this new information. 0.6 Therefore, we plan to include time sensitivity and the Precision 0.4 distrust factor in our future work. Coverage 0.2 F-measure 0 Acknowledgments The authors sincerely thank the authorities of Supercomputer Education and Research Center, Indian Institute of Science for the encouragement and support during the entire course of this work.

Fig. 5 Comparing results of cold-start users References [1] Ricci, Francesco, Lior Rokach, and Bracha This means the ratings from most number of relevant Shapira. Introduction to recommender systems handbook. users is considered during the rating prediction in each Springer US, 2011. step of the walk in HybridTrustWalker. Due to cold-start [2] Koren, Yehuda. "Factorization meets the neighborhood: a users (Fig 5), item-based and user-based CF performs multifaceted collaborative filtering model." In Proceedings poorly. They have highest RMSE and lowest coverage of the 14th ACM SIGKDD international conference on than all the other algorithms considered during analysis. Knowledge discovery and data mining, pp. 426-434. ACM, 2008. Due to the introduction of trust factor, TidalTrust, [3] Deng, Shuiguang, Longtao Huang, and Guandong Xu. MoleTrust and TrustWalker have improved coverage "Social network-based service recommendation with trust compared to CF whereas precision does not change much. enhancement." Expert Systems with Applications 41, no. 18 (2014): 8075-8084. [4] Shao, Lingshuang, Jing Zhang, Yong Wei, Junfeng Zhao, 5. Conclusion Bing Xie, and Hong Mei. "Personalized qos prediction forweb services via collaborative filtering." InWeb Services, The proposed recommendation system has three main 2007. ICWS 2007. IEEE International Conference on, pp. objectives: (1) Tackling the problem of recommendations 439-446. IEEE, 2007. with cold-start users; (2) Address the problem of [5] Golbeck, Jennifer Ann. "Computing and applying trust in recommendations with a large and sparse user-service web-based social networks." (2005). [6] Massa, Paolo, and Paolo Avesani. "Trust-aware rating matrix and (3) Solve the problem with trust recommender systems." InProceedings of the 2007 ACM relations in a recommendation system. Thus, the main conference on Recommender systems, pp. 17-24. ACM, contributions of HybridTrustWalker presented in this 2007. paper, include, introducing the concept of trust relevancy, [7] Jamali, Mohsen, and Martin Ester. "Trustwalker: a random which is used to obtain a weighted social network. walk model for combining trust-based and item-based

111

recommendation." In Proceedings of the 15th ACM Ms. Samhita S Balekai received her B.E degree in Information SIGKDD international conference on Knowledge discovery Science and Engineering from Visvesvaraya Technological and data mining, pp. 397-406. ACM, 2009. University. She secured an offer for the position of Software Engineer in Accenture, Bangalore, India. Her areas of interests [8] Sarwar, Badrul, George Karypis, Joseph Konstan, and John are Data Analytics, Social Networks, Data Warehousing and Riedl. Application of dimensionality reduction in Mining. recommender system-a case study. No. TR-00-043. Minnesota Univ Minneapolis Dept of Computer Science, Ms. Sanjana P Suryavamshi was awarded her B.E. degree with 2000. distinction in Information Science and Engineering from Visvesvaraya Technological University. Presently she is [9] Koren, Yehuda, Robert Bell, and Chris Volinsky. "Matrix employed as a Software Engineer in Tata Consultancy Services factorization techniques for recommender (TCS), Bangalore, India. Her areas of interests are Networks and systems." Computer 8 (2009): 30-37. Cyber Security. [10] Salakhutdinov, Ruslan, and Andriy Mnih. "Probabilistic Matrix Factorization Advances in Neural Information Ms. Sneha Sriram earned her B.E. degree in Information Processing Systems 21 (NIPS 21)."Vancouver, Science and Engineering from Visvesvaraya Technological University. She is pursuing her M.S. degree in Information Canada (2008). Technology Management from University of Texas, Dallas. [11] Lipowski, Adam, and Dorota Lipowska. "Roulette-wheel Her areas of interests are Enterprise Systems and Information selection via stochastic acceptance." Physica A: Statistical Technology. Mechanics and its Applications 391, no. 6 (2012): 2193- 2196. Dr. R. Bhakthavathsalam is presently working as a Principal Research Scientist in SERC, Indian Institute of Science, [12] Milgram, Stanley. "The small world problem." Psychology Bangalore. His areas of interests are Pervasive Computing and today 2, no. 1 (1967): 60-67. Communication, Wireless Networks and Electromagnetics with a special reference to exterior differential forms. Author held the Mr. Pavan Kumar K N obtained his B.E. degree with distinction position of Fellow of Jawaharlal Nehru Centre for Advanced in Information Science and Engineering from Visvesvaraya Scientific Research during 1993 - 1995. He is a Member of IEEE Technological University. Presently he is taking up the position of Communication Society, ACM and CSI. Trainee Decision Scientist in Mu Sigma, Bangalore, India. His areas of interests include Data Analytics and Cyber Security.

112

Correlated Appraisal of Big Data, Hadoop and MapReduce

Priyaneet Bhatia1, Siddarth Gupta2

1 Department of Computer Science and Engineering, Galgotias College of Engineering and Technology Uttar Pradesh Technical University Greater Noida, Uttar Pradesh 201306, India [email protected]

2Department of Computer Science and Engineering, Galgotias University Greater Noida, Uttar Pradesh 203208, India [email protected]

Abstract in real world scenarios. Section 3 describes the Big data has been an imperative quantum globally. comparison between RDBMS and NoSQL and why Gargantuan data types starting from terabytes to NoSQL rather than RDBMS is used in today‟s petabytes are used incessantly. But, to cache these world. Section 4 explores the Apache Hadoop in database competencies is an arduous task. Although, detail and its use in big data. Section 5 analyzes the conventional database mechanisms were integral MapReduce paradigm, its use in Hadoop paradigm, elements for reservoir of intricate and immeasurable datasets, however, it is through the approach of NoSQL and its significance in enormous data reduction. that is able to accumulate the prodigious information in a Section 6 explicates the table comparisons of big proficient style. Furthermore, the Hadoop framework is data and Hadoop of various survey papers. Finally, used which has numerous components. One of its the paper is concluded in the end. foremost constituent is the MapReduce. The MapReduce is the programming quintessential on which mining of purposive knowledge is extracted. In this paper, the 2. Big Data Concepts postulates of big data are discussed. Moreover, the Hadoop architecture is shown as a master- slave 2.1 Outline of Big Data procedure to distribute the jobs evenly in a parallel style. The MapReduce has been epitomized with the help of an Let‟s start with big data. What is big data? Why has algorithm. It represents WordCount as the criterion for mapping and reducing the datasets. it created a buzz? Why is big data so essential in Keywords: Big Data, Hadoop, MapReduce, RDBMS, our daily chores of life? Where is it used? All these NoSQL, Wordcount unanswerable questions have made everyone curious. Moving on, big data is actually a collection of large and complex datasets that has become very 1. Introduction difficult to handle using traditional relational database management tools [5]. As immense amount of data is being generated day by day, efficiency of storage capacity for this huge 2.2 Four Vs‟ of Big Data information becomes a painful task [1]. Therefore exabytes or petabytes of database known as big data need to be scaled down to smaller datasets through an architecture called Hadoop. Variety Velocity Apache Hadoop is an open source framework on which the big data is processed with the help of MapReduce [2]. A programming model, MapReduce uses basic divide and conquer Volume Veracity technique to its own map and reduce functions. On 4V's of Big Data copious datasets it processes key/value pairs to generate intermediate key/value pairs and then, with the help of these pairs, merges the intermediate values to form a smaller sets of key/values sets [3][4]. The reduced data bytes of Figure 1: 4Vs‟ of Big Data massive information are produced. The rest of the paper is formulated as follows. Section 2 covers the Big data has its own 4 characteristics shown above concepts of big data, its 3 V‟s and its applications in Fig 1 as:

113

i. Volume: refers to the amount of space or 3.2 NoSQL quantity of something. Since data is huge scaled complex sized, even larger than 1024 NoSQL commonly refers to „Not Only SQL‟, has terabytes, it becomes a challenge to extract become a necessity in replacement to RDBMS, relevance information from traditional since its main characteristics focuses on data database techniques. E.g. the world produces duplication and unstructured schemes. It allows 2.5 quintillion bytes in a year. unstructured compositions to be reserved and replicated across multiple servers for future needs. ii. Variety: represent the state of being varied or Eventually, no slowdown in performance occurs diversified. In this, all types of formats are unlike in RDBMS. Companies such as Facebook, available. Structured, unstructured, semi- Google and Twitter use NoSQL for their high structured data etc. These varieties of formats performance, scalability, availability of data with are needed to be searched, analyzed, stored regards to the expectations of the users [11]. and managed to get the useful information. E.g.: geospatial data, climate information, audio and video files, log files, mobile data, 4. Hadoop social media etc [6]. 4.1 Hadoop in Brief iii. Velocity: the rate of speed with which something happens. In this, to deal with bulky Hadoop was created by Doug Cutting and Mike spatial dimensional data which is streaming at Caferella in 2005 [12]. It was named after his son‟s an eccentric rate is still an eminent challenge toy elephant [13]. It comprises of 2 components for many organizations. E.g. Google produces and other project libraries such as Hive, Pig, 24 TB/day; Twitter handles 7 TB/day etc. HBase, Zookeeper etc:

iv. Veracity: refers to the accuracy of the a. HDFS: open source data storage architecture information extracted. In this, data is mined with fault tolerant capacity. for profitable purposes [7]. b. MapReduce: programming model for distributed processing that works with all 2.3 Big Data in Real World Scenarios types of datasets. [14].

a) Facebook generates 10-20 billions photos 4.2 Motive behind Hadoop in Big Data which is approximately equivalence to 1 petabytes. Despite, one might get worried that since RDBMS b) Earlier, hard copy photographs take space is a dwindling technology, it cannot be used in big around 10 gigabytes in a canon camera. But, data processing; however, Hadoop is not a nowadays, digital camera is producing replacement to RDBMS. Rather, it is a supplement photographic data more than 35 times the old to it. It adds characteristics to RDBMS features to camera used to take and it is increasing day by improve the efficiency of database technology. day [8]. Moreover, it is designed to solve the different sets c) Videos on youtube are being uploaded in 72 of data problems that the traditional database hours/min. system is unable to solve. d) Data produced by google is approximately 100 peta-bytes per month [9]. 4.3 CAP Theorem for Hadoop Cap theorem shown in Fig 2, can be defined as 3. RDMS VS NOSQL consistency, scalability and flexibility.

3.1 RDBMS a) Consistency: simultaneous transactions are needed in continuity for withdrawing from the For several decades, relational database account and saving into the account. management system has been the contemporary b) Availability: flexibility in making multiples benchmark in various database applications. It copies of data. If one copy goes down, then organizes the data in a well structured pattern. another is still accessible for future use. Unfortunately, ever since the dawn of big data era, c) Partitioning: to partition the data in multiple the information comes mostly in unstructured copies for storage in commodity hardware. By dimensions. This culminated the traditional default, 3 copies are normally present. This is database system in not able to handle the to make for easy feasibility for the customers competency of prodigious storage database. In [15]. consequence, it is not opted as a scalable resolution to meet the demands for big data [10].

114

consistency 5.2 Principles of MapReduce

a. Lateral computing: provides parallel data processing across the nodes of clusters using the Java based API. It works on commodity hardware in case of any hardware failure. CAP b. Programming languages: uses Java, Python theorem and R languages for coding in creating and running jobs for mapper and reducer

partitioning availabilty executables. c. Data locality: ability to move the computational node close to where the data is. That means, the Hadoop will schedule MapReduce tasks close to where the data exist, on which that node will work on it. The Figure 2: CAP Theorem idea of bringing the compute to the data rather than bringing data to the compute is the key of 4.4 Hadoop Business Problems understanding MapReduce. d. Fault tolerant with shared nothing: The i. Marketing analysis: market surveys are being Hadoop architecture is designed where the used to understand the consumer behaviours tasks have no dependency on each other. and improve the quality of the product. Lot of When node failure occurs, the MapReduce companies used feedback survey to study jobs are retried on other healthy nodes. This is shopper attitudes. to prevent any delays in the performance of ii. Purchaser analysis: It is best to understand the any task. Moreover, these nodes failure are interest of the current customer rather than the detected and handled automatically and new one. Therefore, the best thing is collect as programs are restarted as needed [18]. much information as one can to analyze what the buyer was doing before he left the 5.3 Parallel Distributed Architecture shopping mall. iii. Customer profiling: it is essential to identify The MapReduce is designed as the master slave specific group of consumers having similar framework shown in Fig 4, which works as job and interest and preferences in purchasing goods task trackers. The master is the Jobtracker which from the markets. performs execution across the mapper or reducer iv. Recommendation portals: These online over the set of data. However, on each slave node, shopping browsers not only collect database the Tasktracker executes either the map or reduce from your own data but also from those users task. Each Tasktracker reports its status to its who match the profile of yours, so that these master. search engines can make recommend webites that are likely to be useful to you. E.g.: Jobtracker Flipkart, Amazon, Paytm, Myntra etc. (master) v. Ads targeting: we all know ads are a great nuisance when we are doing online shopping, but they stay with us. These ad companies put their ads on popular social media sites so they TaskTracker TaskTracker TaskTracker (slave 1) (slave 2) (slave 3) can collect large amount of data to see what we are doing when we are actually shopping [16]. Figure 3: Master Slave Architecture

5. MapReduce 5.4 Programming Model The MapReduce consists of 2 parts: map and 5.1 Understanding MapReduce reduce functions. MapReduce is a programming paradigm, developed st by Google, which is designed to solve a single a) Map part: This is the 1 part of MapReduce. problem. It is basically used as an implementation In this, when the MapReduce runs as a job, the mapper will run on each node where the procedure to induce large datasets by using map and reduce operations [17]. data resides. Once it gets executed, it will

115

create a set of pairs on each node. For each word in file contents: b) Reduce part: In the 2nd part of MapReduce, the reducer will execute on some nodes, not Emit (word, 1) all the nodes. It will create aggregated sets of pairs on these nodes. The output Reducer (word, values): of this function is a single combined list. Sum=0

For each value in values

Sum+= value

Emit (word, sum)

The pseudocode of MapReduce contains the mapper and reducer. The mapper has the filename and file contents and a loop for each to iterate the information. The word is emitted which has a value 1. Basically, spliiting occurs. In the reducer, from the mapper, it takes the output and produces lists of

keys and values. In this case, the keys are the words Figure 4: MapReduce Paradigm and value is the total manifestation of that word. After that, zero is started as an initializer and loop Figure 4 displays the MapReduce prototype, occurs again. For each value in values, the sum is comprised of three nodes under Map. Variegated taken and value is added to it. Then, the aggregated categories of data have been represented through count is emitted. numerous colors in the Map. Accordingly, in essence, these nodes are running in three separate 5.6 Example [21] machines i.e. the commodity hardware. Thus, the chunks of information are implemented on discrete Consider the question of counting the occurrence of machines. Furthermore, the intermediate portion of each word in the accumulation of large documents. this model resides the magical shuffle that is in fact Let‟s take 2 input files and perform MapReduce quite complicated and hence is the key aspect of operation on them. MapReduce. Literally, this framework has set the mind to think. How does this list come out from the File 1: bonjour sun hello moon goodbye world Map function and is then aggregated to the Reduce File 2: bonjour hello goodbye goodluck world earth function? Is this an automated process or some codes have to be written? Actually, in reality, this Map: paradigm is the mixture of both. As a matter of First map: second map fact, whenever the MapReduce jobs are written, the default implementations of shuffle and sort are studied. Consequently, all these mechanisms are tunable; one may accept the defaults or tune in or < goodluck,1> can change according to one‟s own convenience. 5.5 Word count Reduce: The Hello World of MapReduce program is the WordCount. It comes from Google trying to solve the data problem by counting all the words on the Web. It is de facto standard for starting with Hadoop programming. It takes an input as some text and produces a list of words and does counting < goodluck,1> on them. Following below is the Pseudocode of WordCount [19][20]:

Mapper (filename, file contents):

116

5.7 Algorithm Steps [22]: Table 2: Approach on Hadoop Author’s Approach on S.No Year Results/ Conclusion Name Hadoop a) Map step: in this step, it takes key and value 1. Mahesh 2011 MapReduce, Experimental pairs of input data and transform into output Maurya et Linkcount, setup to count intermediate list of key/ value pairs. al WordCount number of words & links (double square brackets) ( ) available in ( ) (1) Wikipedia file. Results depend on b) Reduce step: in this step, after being shuffled data size & Hadoop cluster. and sorted, the output intermediate key/value 2. Puneet 2013 HDFS, Studied Map pairs is passed through a reduce function Duggal et MapReduce, Reduce where these values are merged together to al joins, techniques form a smaller sets of values. indexing, implemented for clustering, Big Data analysis classification using ( ( ) HDFS. 3. Shreyas 2013 HDFS, Hadoop„s not an ( )) (2) Kudale et MapReduce, ETL tool but al ETL, platform supports Associative ETL processes in Rule Mining parallel. 6. Tabular Comparisons on Big Data 4. Poonam S. 2014 HDFS, Parallelize & and Hadoop Patil et al MapReduce, distribute HBase, Pig, computations Hive, Yarn, tolerant. Table 1: Approach on Big Data 5. Prajesh P. 2014 k-means Experimental Author’s Approach on S.No Year Results/ Conclusion Anchalia name Big Data clustering setup for et al algorithms MapReduce 1. Puneet 2013 Big Data Used for storing technique on k- Singh analysis and managing means clustering Duggal tools, Big Data. Help algorithm which et al Hadoop, organizations to clustered over 10 HDFS, understand better million data MapReduce customers & points. market 6. Radhika 2015 HDFS, Combination of 2. Min Cloud Focus on 4 phases M. Kharode MapReduce , data mining & K- Chen 2014 computing, of value chain of et al k-means means clustering et al Hadoop Big Data i.e., data algorithms, algorithm make generation, data cloud data management acquisition, data computing easier and quicker storage and data in cloud analysis. computing model. 3. P.Sara 2014 Hadoop, Introduces ETL da extract process in taking Devi et transform business al load (ETL) intelligence 7. Conclusion tools like decisions in ELT, Hadoop To summarize, the recent literature of various ELTL. 4. Poona 2014 RDBMS, Study challenges architectures have been surveyed that helped in the m S. NoSQL, to deal analysis of reduction of big data to simple data which mainly Patil Hadoop, big data. Gives composed of immense knowledge in gigabytes or et al MapReduce flexibility to use megabytes. The concept of Hadoop, its use in big any language to write algorithms. data has been analyzed and its major component 5. K.Arun 2014 mining Study big data HDFS and MapReduce have been exemplified in et al techniques classifications to detail. Overall, the MapReduce model is illustrated like business needs. with its algorithm and an example for the readers to association Helps in decision understand it clearly. To sum up, applications of rule making in learning, business big data in real world scenario has been elucidated. clustering environment by classification implementing data mining Acknowledgements techniques, Priyaneet Bhatia and Siddarth Gupta thanks Mr. Deepak Kumar, Assistant Professor, Department of Information Technology, and Rajkumar Singh Rathore, Assistant Professor, Department of Computer Science and Engineering, Galgotia

117

College of Engineering and Technology, Greater International Conference on Cloud, Big Data and Noida, for their constant support and guidance Trust, RGPV, 2013, pp 269-276 throughout the course of whole survey. [15] lynn langit, Hadoop fundamentals, ACM, 2015, Lynda.com , http://www.lynda.com/Hadoop- tutorials/Hadoop-Fundamentals/191942-2.html [16] ShaokunFana, Raymond Y.K.Laub, and J. References LeonZhaob,. “Demystifying Big Data Analytics for Business Intelligence through the Lens of [1] Shreyas Kudale, Advait Kulkarni and Leena A. Marketing Mix”, Elsevier, ScienceDirect, 2015, pp Deshpande, “Predictive Analysis Using Hadoop: A 28-32. Survey”, International Journal of Innovative [17] Mahesh Maurya and Sunita Mahajan Research in Computer and Communication ,“Comparative analysis of MapReduce job by Engineering, Vol. 1, Issue 8, 2013. pp 1868-1873 keeping data constant and varying cluster size [2] P.Sarada Devi, V.Visweswara Rao and technique”, Elseveir, 2011, pp 696-701 K.Raghavender, “Emerging Technology Big Data [18] Dhole Poonam and Gunjal Baisa, “Survey Paper on Hadoop Over Datawarehousing, ETL” in Traditional Hadoop and Pipelined Map Reduce”, International Conference (IRF), 2014, pp 30-34. International Journal of Computational [3] Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Engineering Research (IJCER), Vol. 3, Issue 12, Simplified Data Processing on Large Clusters”, 2013, pp 32-36 Google, Inc in USENIX Association OSDI ‟04: 6th [19] MapReduce, Apache Hadoop, Yahoo Developer Symposium on Operating Systems Design and Network, Implementation, 2009, pp 137-149. https://developer.yahoo.com/hadoop/tutorial/modul [4] Jeffrey Dean And Sanjay Ghemawat, e4.html “MapReduce: Simplified data Processing on Large [20] Mahesh Maurya and Sunita Mahajan, Cluster” in OSDI‟04: Sixth Symposium on “Performance analysis of MapReduce programs on Operating System Design and Implementation, CS Hadoop Cluster” IEEE, World Congress on 739 Review Blog, 2004, http: Information and Communication Technologies //pages.cs.wisc.edu/~swift/classes/cs739- (WICT2012), 2012, pp 505-510. sp10/blog/2010/04/mapreduce_simplified_data_pr [21] Ms. Vibhavari Chavan and Prof. Rajesh. N. oc.html Phursule, “Survey Paper on Big Data”, [5] Munesh Kataria and Ms. Pooja Mittal, “Big Data International Journal of Computer Science and and Hadoop with Components like Flume, Pig, Information Technologies (IJCSIT), Vol. 5, No.6, Hive and Jaql”, International Journal of Computer 2014, pp 7932-7939. Science and Mobile Computing (IJCSMC), Vol. 3, [22] Radhika M. Kharode and Anuradha R. Deshmukh, Issue 7, 2014. pp.759 – 765, “Study of Hadoop Distributed File system in Cloud [6] Jaseena K.U. and Julie M. David, “Issues, Computing”, International Journal of Advanced Challenges, and Solutions: Big Data Mining”, Research in Computer Science and Software NeTCoM, CSIT, GRAPH-HOC, SPTM – 2014, Engineering (IJARCSSE) , Vol.5, Issue 1, 2015, pp 2014, pp. 131–140 990-993. [7] K.Arun and Dr. L. Jabasheela, “Big Data: Review, Classification and Analysis Survey”, International Journal of Innovative Research in Information First Author Priyaneet Bhatia has done her B.Tech in IT Security (IJIRIS), 2014, Vol. 1, Issue 3, pp 17-23 from RTU, Jaipur, Rajasthan, India in 2012. Currently, she [8] T. White, Hadoop: The Definitive Guide, O‟Reilly is pursuing M.Tech in CSE from Galgotia College of Media, Yahoo! Press, 2009. Engineering and Technology, UPTU, Greater Noida, Uttar [9] Min Chen, Shiwen Mao and Yunhao Liu, “Big Pradesh, India. She is working on the project “Big Data in Data: A Survey”, Springer, New York, 2014, pp- Hadoop MapReduce”.

171-209 Second Author Siddarth Gupta has done B.Tech in CSE [10] Poonam S. Patil and Rajesh. N. Phursule, “Survey from UPTU, Lucknow, Uttar Pradesh, India in 2012.He has Paper on Big Data Processing and Hadoop completed M.tech in CSE from Galgotias University, Components”, International Journal of Science and Greater Noida, Uttar Pradesh, India in May 2015. He is Research (IJSR), Vol.3, Issue 10, 2014 pp 585-590 currently working on “Big Data optimization in Hadoop” [11] Leonardo Rocha, Fernando Vale, Elder Cirilo, Dárlinton Barbosa and Fernando Mourão, “A Framework for Migrating Relational Datasets to NoSQL”, in International Conference on Computational Science, , Elsevier, Vol.51, 2015, pp 2593–2602 [12] Apache Hadoop, Wikipedia https://en.wikipedia.org/wiki/Apache_Hadoop [13] Ronald C Taylor, “An overview of the Hadoop/ MapReduce/ HBase framework and its current applications in bioinformatics” in Bioinformatics Open Source Conference (BOSC), 2010, pp 1-6. [14] Puneet Singh Duggal and Sanchita Paul, “Big Data Analysis: Challenges and Solutions” in

118

Combination of PSO Algorithm and Naive Bayesian Classification for Parkinson Disease Diagnosis

Navid Khozein Ghanad1,Saheb Ahmadi 2

1 Islamic Azad university Of Mashhad, Faculty of Engineering, Department Of Computer, Mashhad, Iran [email protected]

2 Islamic Azad university Of Mashhad, Faculty of Engineering, Department Of Computer, Mashhad, Iran [email protected]

Abstract significantly low symptoms [1]. It is claimed that 90% of Parkinson is a neurological disease which quickly affects human’s motor organs. Early diagnosis of this disease is very Parkinson patients can be recognized important for its prevention. Using optimum training data and Through voice disorders[2]. Parkinson patients have a set omitting noisy training data will increase the classification of voice disorders by which their disease can be accuracy. In this paper, a new model based on the combination of diagnosed. These voice disorders have indices whose PSO algorithm and Naive Bayesian Classification has been measurement can be used for diagnosing the disease [3] presented for diagnosing the Parkinson disease, in which [4]. In the previous studies, problems of Parkinson disease optimum training data are selected by PSO algorithm and Naive diagnosis were considered. Using SVM Classification with Bayesian Classification. In this paper, according to the obtained Gaussian kernel, the obtained result was 91.4% at best [4]. results, Parkinson disease diagnosis accuracy has been 97.95% In order to diagnose the Parkinson disease, a new non- using the presented method, which is indicative of the superiority of this method to the previous models of Parkinson disease linear model based on Dirichlet process mixing was diagnosis. presented and compared with SVM Classification and decision tree. At best, the obtained result was 87.7% [5]. Keywords: Parkinson disease diagnosis, Naive Bayesian In [6], different methods have been used to diagnose the Classification, PSO algorithm Parkinson disease, in which the best result pertained to the

classification using the neural network with 92.9% accuracy. In [7], the best features were selected for SVM 1. Introduction Classification through which 92.7% accuracy could be Parkinson disease is one of the nervous system diseases, obtained at best. In [8], using sampling strategy and multi- which causes quivering and losing of motor skills. Usually class multi-kernel relevance vector machine method this disease occurs more in people over 60 years old, and 1 improvement, 89.47% accuracy could be achieved. In [9], out of 100 individuals suffers from this disease. However, the combination of Genetic Algorithm and Expectation it is also observed in younger people. About 5 to 10% of Maximization Algorithm could bring 93.01% accuracy for patients are in younger ages. After Alzheimer, Parkinson is Parkinson disease diagnosis. In [10], using fuzzy entropy the second destructive disease of the nerves. Its cause has measures, the best feature was selected for classification not been recognized yet. In the first stages, this disease has and thereby 85.03% accuracy could be achieved for

119

classification. In [11], the combination of non-linear fuzzy and the objective function f(x) from a set like V. Bayesian method and SVM Classification could detect the speaker’s method for the new sample classification is such that it gender with 93.47% accuracy. In [12], the combination of detects the most probable class or the target value vMAP

RF and CFS algorithms could diagnose the Parkinson having trait values, which describes the new disease with 87.01% accuracy. In [13], using parallel sample. forward neural network, Parkinson disease was diagnosed vmap=argvi=vmax p(vj I a1, a2,…….,an) (1) with 91.20% accuracy. In [14], with improvements in OPF Using Bayesian ’ theorem, term (1) can be rewritten as Classification, Parkinson disease was diagnosed with term (2): 84.01% accuracy. In [15], fuzzy combination with the Nearest Neighbor Algorithm could achieve 96.07% ( ) ( ) Vmap=argvi=vmax ( ) accuracy. In [16] and [17], by focusing on voice analysis, =argvi=vmaxP(a1,a2,…..,an,Ivj)P(vj) (2) they attempted to gain 94% accuracy. In the previous presented methods, attempts have been made to offer the Now using the training data, we attempt to estimate the best classification methods and no attention has been paid two terms of the above equation. Computation based on to the quality of the training data. In this paper, we the training data to find out what is the repetition rate of vj presented a new model based on the combination of PSO in the data, is easy. However, computation of different algorithm and Naive Bayesian Classification for terms P(a1,a2,…an | Vj) by this method will not be diagnosing the Parkinson disease. This algorithm selects acceptable unless we have a huge amount of training data the best training data for Naive Bayesian Classification available. The problem is that the number of these terms is and this causes no use of non-optimal training data. Due to equal to the number of possible samples multiplied by the using optimum training data and not using non-optimal number of the objective function values. Therefore, we training data, this new model presented in this paper should observe each sample many times so that we obtain increases the classification accuracy and Parkinson disease an appropriate estimation. diagnosis to 97.95%. Objective function output is the probability of observing First we consider Naive Bayesian Classification and PSO the traits a1,a2,…an equal to the multiplication of separate algorithm. Then, the presented algorithm, results and probabilities of each trait. If we replace it in Equ.2, it references will be investigated. yields the Naive Bayesian Classification, i.e. Equ.3:

1.1. Naive Bayesian Classification VNB=arg max P(Vj) ∏ ( | ) (3)

One very practical Bayesian learning method is naive Where vNB is Naive Bayesian Classification output for the Bayesian learner which is generally called the Naive objective function. Note that the number of terms P(ai|vj) Bayesian Classification method. In some contexts, it has that should be computed in this method is equal to the been shown that its efficiency is analogous to that of the number of traits multiplied by the number of output classes methods such as neural network and decision tree. for the objective function, which is much lower than the Naive Bayesian classification can be applied in problems number of the terms P(a1,a2,…an | Vj) in which each sample x is selected by a set of trait values

120

We conclude that naive Bayesian learning attempts to the sake of maintaining the algorithm’s probabilistic estimate different values of P(vj) and P(ai|vj) using their property. Each particle’s next speed is obtained by Equ.5: repetition rate in the training data. This set of estimations Xi+1=Xi+Vi+1 (5) corresponds to the learnt assumption. After that, this assumption is used for classifying the new samples, which 2. Considering the presented algorithm is done through the above formula. When conditional independence assumption of Naive Bayesian Classification In the introduction section, we considered that different method is estimated, naive Bayesian class will be equal to methods have been presented for Parkinson disease the MAP class. diagnosis, but no attention has been paid to the quality of the training data. In this paper, we attempt to select the 1.2. PSO algorithm best training data using PSO algorithm for Naive Bayesian Classification. The selection of the best training data is the most important part for training the Naive Bayesian Each particle is searching for the optimum point. Each Classification training. This is due to the fact that we particle is moving, thus it has a speed. PSO is based on the observed in our studies that adding or omitting two particles’ motion and intelligence. Each particle in every training data in the whole set of training data caused 4 to stage remembers the status that has had the best result. 5% more accuracy in disease diagnosis. The suggested

method will be introduced in detail in the following. Particle’s motion depends on 3 factors: The diagram below shows the general procedure of the 1- Particle’s current location new presented algorithm. 2- Particle’s best location so far (pbest) 3- The best location which the wholeset of particles were in so far (gbest)

In the classical PSO algorithm, each particle i has two main parts and includes the current location, and Xiis the particle’s current speed (Vi). In each repetition, particle’s change of location in the searching space is based on the particle’s current location and its updated speed. Particles’ speed is updated according to three main factors: particle’s current speed, particle’s best experienced location (individual knowledge), and particle’s location in the best status of group’s particles (social knowledge), as Equ.4.

Vi+1 =K(wVi+C1i(Pbest i– Xi) + C1i(Gbesti - Xi)) (4) Where W is the ith particle’s inertia coefficient for moving with the previous speed. C1i and C2i are respectively the individual and group learning coefficients of the ith particle, which are selected randomly in range {2-0} for

121

Table1. Primary values given to PSO algorithm parameters Start No. Parameter title The used parameter value 1 Bird in swarm 50

Selecting the best training data and the 2 Number of Variable 1 intended parameters for naive Bayesian training using PSO algorithm 3 Min and Max Range 2-46

4 Availability type Min

5 Velocity clamping factor 2 Naive Bayesian Classification training using the best training data and forming 6 Cognitive constant 2 the Parkinson disease diagnosis model

7 Social constant 2

8 min of inertia weight 0.4 Parkinson disease diagnosis through the formed model 9 max of inertia weight 0.4

3. Experiments and results

End

3.1. Dataset descriptions Fig1. The procedure of the suggested method for Parkinson disease diagnosis In this article, we used the dataset of the Parkinson disease belonging to UCI. This dataset is accessible through this The general procedure is very simple. In this paper, first link [18]. The number of the items of this dataset is 197, the best data for Naive Bayesian Classification are selected and its features are 23. Features used in Parkinson disease using PSO algorithm, and Naive Bayesian Classification is diagnosis are presented in Table2: trained by the optimum training data. Thereby, the Parkinson disease diagnosis model is formed. After the formation of the intended model, the Parkinson disease is diagnosed and identified. PSO algorithm fitness function for the selection of the optimum training data is expressed in Equ.6:

Fitness = ∑ | | (6)

where is the real value of the test data, and is the value that has been determined using Naive Bayesian

Classification.

Values of the primary parameters of PSO algorithm for the selection of the optimum training data are presented in

Table1.

122

Table2. Features used in Parkinson disease diagnosis Table3. The accuracy of Parkinson disease diagnosis class using the 1 MDVP: FO(HZ) Average vocal optimum training data selected by PSO algorithm fundamental The number of the frequency 2 MDVP: Fhi (HZ) Maximum vocal optimum training data fundamental No. selected for Naive Classification accuracy frequency 3 MDVP: Flo(HZ) Minimum vocal fundamental Bayesian Classification frequency using PSO algorithm 4 MDVP: Jitter (%) 5 MDVP: Jitter (Abs) 1 8 97.95% 6 MDVP: RAP 2 10 96.93% 7 MDVP: PPQ 8 Jitter: DDP 3 12 97.95% 9 MDVP: Shimmer Several measures of variation in

fundamental frequency In Table3, some of the optimum training data selected 10 MDVP: Shimmer (dB) 11 Shimmer : APQ3 using PSO algorithm along with the classification accuracy 12 Shimmer : APQ5 obtained through the optimum training data can be found. 13 MDVP :APQ 14 Shimmer :DDA As can be seen in No. 2 of Table3, by adding two training 15 NHR Two measures of data, classification accuracy has decreased 1.02%. ratio of noise to tonal components Therefore, it can be concluded that by increasing the in the voice 16 NHR training data, there is no guarantee that classification 17 RPDE accuracy be increased. The important point in increasing 18 DFA 19 Spread 1 Two nonlinear the classification accuracy is the use of optimum training dynamical complexity data and no use of noisy training data which decrease the measure classification accuracy. We increased the number of 20 Spread 2 21 D2 training data respectively to 50, 60, 70, 80 and 90 training 22 PPE data. The accuracy of the obtained results of this high

number of training data can be observed in Table4. 3.2. The optimum training data selected for Naive

Bayesian Classification using PSO algorithm Table4. The relationship between Parkinson disease diagnosis accuracy and training data increase As stated in the previous sections, selecting the best The number of the training data is the most important part of Naive Bayesian No. Classification accuracy training data Classification for increasing the accuracy and Parkinson 1 50 88.79% disease diagnosis. In Table3, the number of the optimum 2 60 77.55% training data selected by PSO algorithm can be observed: 3 70 76.53%

4 80 69.38%

5 90 67.54%

123

In Table4, we can see that using Naive Bayesian Bayesian Classification. Due to the fact that this presented Classification with increasing the training data will algorithm selects the best training data and avoids decrease the classification accuracy. choosing those that cause drop and decrease in According to the optimum training data selected by PSO classification accuracy, it gets the classification accuracy algorithm, it is concluded that by having only 8 training and Parkinson disease diagnosis to 97.95%. This data, the highest accuracy in the possible classification can classification accuracy shows the superiority of the be obtained for Parkinson disease diagnosis. suggested method to the previous models of Parkinson In Table5, the result of the algorithm presented in this disease diagnosis. Also, according to the result obtained in paper is compared with the results of the previous works: the paper, it can be reminded that in order to increase the classification accuracy, it is not always necessary to Table5. Comparison of the suggested method’s accuracy and previous present a new classification method; rather by selecting the models of Parkinson disease diagnosis best training data and omitting the inappropriate training Result and accuracy of No. Presented works data, classification accuracy can be significantly increased. the presented model

1 [9] 93.01% References 2 [11] 93.01% [1] Singh, N., Pillay, V., &Choonara, Y. E. (2007). 3 [13] 91.20% Advances in the treatment of Parkinson’s disease. Progress in Neurobiology.81,29-44 4 [15] 96.01% [2] Ho, A. K., Iansek, R., Marigliani, C., Bradshaw, J. L., 5 [16][17] 94% & Gates, S. (1998).Speechimpairment in a large sample of patients with Parkinson’s disease. Behavioural Neurology, 6 Proposed Algorithm 97.95% 11,131-138 [3] Little, M. A., McSharry, P. E., Hunter, E. J., Spielman, J., &Ramig, L. O (2009). Suitability of dysphonia According to the comparison made between the suggested measurements for telemonitoring of Parkinson’s disease. method and the previous models of Parkinson disease IEEE Transactions on Biomedical Engineering, 56(4), 1015–1022 diagnosis in Table5, it is shown that the suggested method [4] Rahn, D. A., Chou, M., Jiang, J. J., & Zhang, Y. is superior to the previous models of Parkinson disease (2007). Phonatory impairment in Parkinson’s disease: evidence from nonlinear dynamic analysis and diagnosis. Based on the comparison it can be concluded perturbation analysis. Journal of Voice, 21, 64–71. that in order to increase the classification accuracy, it is [5] Shahbaba, B., & Neal, R. (2009). Nonlinear models using Dirichlet process mixtures.The Journal of Machine not always necessary to present a new classification Learning Research, 10, 1829–1850. method; rather by selecting the best training data and [6] Das, R. (2010). A comparison of multiple classification omitting the inappropriate training data, classification methods for diagnosis of Parkinson disease. Expert Systems with Applications, 37, 1568–1572. accuracy can be significantly increased. [7] Sakar, C. O., &Kursun, O. (2010). Telediagnosis of Parkinson’s disease using measurements of dysphonia. Journal of Medical Systems, 34, 1–9 4. Conclusion [8] Psorakis, I., Damoulas, T., &Girolami, M. A. (2010). Multiclass relevance vectormachines: sparsity and In this paper, we suggested a new model for Parkinson accuracy. Neural Networks, IEEE Transactions on, 21,1588–1598. disease diagnosis based on the combination of PSO [9] Guo, P. F., Bhattacharya, P., &Kharma, N. (2010). algorithm and Naive Bayesian Classification. Using PSO Advances in detecting Parkinson’s disease. Medical Biometrics, 306–314. algorithm, the best training data were selected for Naive

124

[10] Luukka, P. (2011). Feature selection using fuzzy entropy measures with similarity classifier. Expert Systems with Applications, 38, 4600–4607. [11] Li, D. C., Liu, C. W., & Hu, S. C. (2011). A fuzzy- based data transformation for feature extraction to increase classification performance with small medical data sets. Artificial Intelligence in Medicine, 52, 45–52. [12] Ozcift, A., &Gulten, A. (2011). Classifier ensemble construction with rotation forest to improve medical diagnosis performance of machine learning algorithms. Comput Methods Programs Biomed, 104, 443–451. [13] AStröm, F., &Koker, R. (2011). A parallel neural network approach to prediction of Parkinson’s Disease. Expert Systems with Applications, 38, 12470–12474. [14] Spadoto, A. A., Guido, R. C., Carnevali, F. L., Pagnin, A. F., Falcao, A. X., & Papa, J. P. (2011). Improving Parkinson’s disease identification through evolutionarybased feature selection. In Engineering in Medicine and Biology Society, EMBC, 2011 Annual International Conference of the IEEE (pp. 7857–7860). [15] Hui-Ling Chen a, Chang-Cheng Huang a, Xin-Gang Yu b, Xin Xu c, Xin Sun d, Gang Wang d, Su-Jing Wang(2013). An efficient diagnosis system for detection of Parkinson’s disease using fuzzy k-nearest neighbor approach. In Expert Systems with Applications 40 (2013) 263–271 [16] Yoneyama, M.;kurihara, y.;watanabe,k;mitoma, h. accelerometry-Based Gait Analysis and Its Application to Parkinson's Disease Assessment— Part 1: Detection of Stride Event(Volume:22 , Issue: 3)page:613-622, May 2014 [17] Yoneyama, M.;kurihara, y.;watanabe,k;mitoma, h.Accelerometry-Based Gait Analysis and Its Application to Parkinson's Disease Assessment— Part 2: New Measure for Quantifying Walking Behavior (Volume:21 , Issue: 6)page:999-1005,Nov. 2013 [18] UCI machine learning repository. (http://archive.ics.uci.edu/ml/datasets/Parkinsons)

125

Automatic Classification for Vietnamese News Phan Thi Ha1, Nguyen Quynh Chi2

1 Posts and Telecommunications Institute of Technology Hanoi, Vietnam [email protected]

2 Posts and Telecommunications Institute of Technology Hanoi, Vietnam [email protected]

Abstract and processing and classifying documents by topic has This paper proposes an automatic framework to classify been interested and researched on the worldwide [2]. Vietnamese news from news sites on the Internet. In this Therefore, they build the methods of text classification to proposed framework, the extracted main content of Vietnamese strongly support for finding information of Internet users. news is performed automatically by applying the improved performance extraction method from [1]. This information will This paper proposes an automatic framework to classify be classified by using two machine learning methods: Support vector machine and naïve bayesian method. Our experiments Vietnamese news from electronic newspaper on the implemented with Vietnamese news extracted from some sites Internet under Technology, Education, Business, Law, showed that the proposed classification framework give Sports fields to build archives which serve the construction acceptable results with a rather high accuracy, leading to of internet electronic library of Vietnam. In this proposed applying it to real information systems. framework, the extracted main content of Vietnamese Keywords: news classification; automatic extraction; support news is performed automatically by applying the improved vector machine, naïve bayesian networks performance extraction method from [1]. This information will be classified by using two machine learning methods: Support vector machine and naïve bayesian method. Our 1. Introduction experiments implemented with Vietnamese news extracted from some sites showed that the proposed classification In the modern life, the need to update and use of framework gives an acceptable result with a rather high information is very essential for human’s activities. Also accuracy, leading to applying it to real information we can see clearly the role of information in work, systems. education, business, research to modern life. In Vietnam, with the explosion of information technology in recent The rest of the paper is presented as the followings. In years, the demand for reading newspapers, searching for section 2, the methods of news classification based on information on the Internet has become a routine of each automatically extracted contents of Web pages on the person. Because of many advantages of Vietnamese Internet are considered. The main content of the automatic documents on the Internet such as compact and long-time classification method is presented in section 3. Our storage, handy in exchange especially through the Internet, experiments and the results are analyzed and evaluated in easy modification, the number of document has been section 4. The conclusions and references are the last increasing dramatically. On the other hand, the section. communication via books has been gradually obsolete and the storage time of document can be limited. 2. Related works and motivation From that fact, the requirement of building a system to store electronic documents to meet the needs of academic In recent years, natural language processing and document research based on the Vietnamese rich data sources on the content classification have had a lot of works with site. However, to use and search the massive amounts of encouraging results of the research community inside and data and to filter the text or a part of the text containing the outside Vietnam. data without losing the complexity of natural language, we cannot manually classify text by reading and sorting of The relevant works outside Vietnam have been published a each topic. An urgent need to solve the issue is how can lot. In [3], they used clustering algorithm to generate the automatically classify the document on the Vietnamese sample data. They focused on optimizing for active sites. Basically, the sites will contain pure text information machine learning. An author at University of Dortmund

126

Germany presented that the use and improvement of In recent years, extracting contents of the site have been support vector machine (SVM) technique has been highly researched by many groups in countries and their results effective in text classification [4]. The stages in the a text were rather good [16, 17, 18, 19, 1]. These approaches classification system including indexing text documents include: HTML code analysis; pattern framework using Latent semantic Indexing (LSI), learning text comparison; natural language processing. The method of classification using SVM, boosting and evaluating text pattern framework extracts information from two sites. categorization have been shown in [5]. “Text This information is then aligned together based on the categorization based on regularized linear classification foundation of pattern recognition applied by Tran Nhat methods” [6] focused in methods based on linear least Quang [18]. This author extracted content on web sites squares techniques fit, logistic regression, support Vector aiming to provide information on administration web sites. Machine (SVM). Most researchers have focused on The method of natural language processing considers the processing for machine learning and foreign language, dependence of syntax and semantics to identify relevant English in particularly. In the case of applying for information and extract needed information for other Vietnamese documents, the results may not get the desired processing steps. This method is used for extracting accuracy. information on the web page containing text following The work of Vietnamese text categorization can be rules of grammar. HTML method accesses directly content mentioned by Pham Tran Vu et al. They talked about how of the web page displayed as HTML then performs the to compute the similarity of text based on three aspects: dissection based on two ways. The first is based on the text, user and the association with any other person or Document Object Model tree structure (DOM) of each not [7]. The authors applied this technique to compute the HTML page, data specification is then built automatically similarity of text compared with training dataset. Their based on the dissected content. The second is based on subsequent work referred matching method with profiles statistical density in web documents. Then, dissect the based on semantic analysis (LSA). The method presented content, data obtained will become independent from the in [8] was without the use of ontology but still had the source sites, it is stored and reused for different purposes. ability to compare relations on semantics based on the To automatically extract text content from the web with statistical methods. The research works in Vietnam various sources, across multiple sites with different mentioned have certain advantages but the scope of their layouts, the authors [1] studied a method to extract web text processing is too wide, barely dedicated for a pages content based on HTML tags density statistics. With particular kind of text. Moreover, the document content the current research stated above, we would like to from Internet is not extracted automatically by the method propose a framework for automatic classification of news proposed in [1]. Therefore, the precision of classification including Technology, Education, business, Law, Sports is not consistent and difficult to evaluate in real settings. fields. We use the method in [1] which is presented in the To extract document content on the Internet, we must next section. mention to the field of natural language processing – a key field in science and technology. This field includes series of Internet-related applications such as: extracting 3. Automatic News Classification information on the Web, text mining, semantic web, text 3.1 Vietnamsese web content extraction for summarization, text classification... Effective exploitation of information sources on the Web has spurred the classification development of applications in the natural language The authors have automatically collected news sites under processing. The majority of the sites is encoded in the 5 fields from the Internet and used content dissection format of Hyper Text Mark-up Language (HTML), in method based on word density and tag density statistics of which each Web page’s HTML file contains a lot of extra the site. The extracting text algorithm was improved from information apart from main content such as pop-up the algorithm proposed by Aidan Finn [11] and the results advertisements, links to other pages sponsors, developers, were rather good. copy right notices, warnings…Cleaning the text input here is considered as the process of determining content of the Aidan Finn proposed the main idea of BTE algorithm as sites and removing parts not related. This process is also follows: Identify two points i, j such that some HTML tag- known as dissection or web content extraction (WCE). tokens under i and on j is maximum and the signs of text- However, structured websites change frequently leading to tokens between i and j is maximum. The extraction result extracting content from the sites becomes more and more is the text signs between interval [i, j] which are difficult [9]. There are a lot of works for web content separated. extraction, which have been published with many different Aidan Fin did experiments by using BTE algorithm to applications [10, 11, 12, 13, 14, 15]. extract text content for textual content classification in

127

digital libraries, mainly collecting new articles in the field encode[] array, which significantly reduces the size of of sports and politics in the news website. This algorithm binary_tokens[] array. The complexity of this has the advantage that dissection does not depend on the algorithm is O(n) given threshold or language but the algorithm is not Step 2: Locate two points i, j from binary_tokens[] array suitable for some Vietnamese news sites containing some recently obtained in step 1 so that the total number of advanced HTML tags. elements which have value -1 between [i, j] and 1 outside By observing some different Vietnamese news sites, the [i, j] is the largest. Perform data dissection in the scope [i, paper [1] showed that the news sites in general have a j] and remove HTML tags. The complexity of this main characteristic: in each page’s HTML code, text body algorithm is O(n3). part contains fewer tags and many signs of text. The The BTE-improved algorithm is tested and compared with authors have improved algorithm BTE (by adding step 0) original algorithm proposed by Aidan Finn with the same to extract text body from Vietnamese new sites to build number of sites in the test set. The experiments and results Vietnamese vocabulary research corpus. are as follows: Construction algorithm: The experimental observations First time: run BTE algorithm of Aidan Finn on HTML show that the text body of Web pages always belong to a file obtained respectively from the URL. parent tag that is located in pair ( … ) in Second time: run improved BTE on HTML file obtained which HTML tags like or scripts is embedded in tags like respectively from the URL.