Applied Computing Review

Mar. 2013, Vol. 13, No. 1

Frontmatter

Editors 3 SIGAPP FY’13 Quarterly Report S. Shin 4 A Message from the Editor S. Shin 5

Selected Research Articles

A Flexible Approach for Abstracting and Personalizing Large J. Kolb and M. Reichert 6 Business Process Models

An Experimental Evaluation of IP4-IPV6 IVI Translation A. Tsetse, A. Wijesinha, R. 19 Karne, A. Loukili, and P. Appiah-Kubi

Fairness Scheduler for Virtual Machines on Heterogonous C. Shih, J. Wei, S. Hung, J. 28 Multi-Core Platforms Chen, and N. Chang

Robust and Flexible Tunnel Management for Secure Private Y. Lu and C. Kuo 41 Cloud

An On-line Hot Data Identification for Flash-based Storage D. Park, Y. Nam, B. Debnath, 51 using Sampling Mechanism D. Du, Y. Kim, and Y. Kim

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 2 Applied Computing Review

Editor in Chief Sung Y. Shin

Designer/Technical Editor John Kim

Associate Editors Richard Chbeir Hisham Haddad Lorie M. Liebrock W. Eric Wong Editorial Board Members

Wood Alan Carlos Duarte Shih-Hsi Liu Eliot Rich Davide Ancona Mario Freire Rui Lopes Pedro Rodrigues Papadopoulos Lorenz Froihofer Mamei Marco Giráldez Raúl Rojo Apostolos João Gama Eduardo Marques Alexander Romanovsky Alessio Bechini Xiao-Shan Gao Paulo Martins Agostinho Rosa Giampaolo Bella M. Karl Goeschka Stan Matwin Davide Rossi Umesh Bellur Claudio Guidi Manuel Mazzara Corrado Santoro Ateet Bhalla Svein Hallsteinsen Ronaldo Menezes Rodrigo Santos Stefano Bistarelli Jason Hallstrom Marjan Mernik Guido Schryen Olivier Boissier Hyoil Han Schumacher Michael Kumar Madhu SD Gloria Bordogna A. Ramzi Haraty Fabien Michel Jean-Marc Seigneur Auernheimer Brent Jean Hennebert Dominique Michelucci Alessandro Sorniotti Barrett Bryant Jiman Hong Ambra Molesini Nigamanth Sridhar Alex Buckley Andreas Humm Eric Monfroy Lindsay Yan Sun Artur Caetano L. Robert Hutchinson Barry O'Sullivan Junping Sun Luís Carriço Maria-Eugenia Iacob Rui Oliveira Chang Oan Sung Rogério Carvalho Francois Jacquenet Andrea Omicini Emiliano Tramontana Andre Carvalho Hasan Jamil Fernando Osorio Dan Tulpan Matteo Casadei Robert Joan-Arinyo Edmundo Monteiro Seidita Valeria Jan Cederquist Andy Kellens V. Ethan Munson Teresa Vazão Alvin Chan Tei-Wei Kuo Peter Otto Mirko Viroli Richard Chbeir Op't Martin Land Brajendra Panda Fabio Vitali Jian-Jia Chen Ivan Lanese Gabriella Pasi Giuseppe Vizzari Claramunt Christophe Paola Lecca Manuela Pereira Denis Wolf Yvonne Coady Maria Lencastre G. Maria Pimentel W. Eric Wong Luca Compagna Va Hong Leong Antonio Cosimo Prete Kokou Yetongnon Arthur Wm. Conklin Ki-Joune Li Kanagasabai Rajaraman R. Osmar Zaiane Massimo Cossentino Lorie Liebrock Rajiv Ramnath Federico Divina Giuseppe Lipari Chandan Reddy

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 3 SIGAPP FY’13 Quarterly Report

January 2013 – March 2013 Sung Shin Mission

To further the interests of the computing professionals engaged in the development of new computing applications and to transfer the capabilities of computing technology to new problem domains.

Officers

Chair Sung Shin South Dakota State University, USA Vice Chair Richard Chbeir Bourgogne University, Dijon, France Secretary W. Eric Wong University of Texas at Dallas, USA Treasurer Lorie M. Liebrock New Mexico Institute of Mining and Technology, USA Web Master Hisham Haddad Kennesaw State University, USA ACM Program Coordinator Irene Frawley ACM HQ Notice to Contributing Authors

By submitting your article for distribution in this Special Interest Group publication, you hereby grant to ACM the following non-exclusive, perpetual, worldwide rights:

• to publish in print on condition of acceptance by the editor • to digitize and post your article in the electronic version of this publication • to include the article in the ACM Digital Library and in any Digital Library related services • to allow users to make a personal copy of the article for noncommercial, educational or research purposes

However, as a contributing author, you retain copyright to your article and ACM will refer requests for republication directly to you.

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 4 A Message from the Editor

I am happy to release the spring issue of Applied Computing Review. This issue includes two selected papers presented at the 2012 ACM Symposium on Applied Computing (SAC) and three from the 2012 ACM Research in Applied Computation Symposium (RACS). All the selected papers have been revised & expanded for inclusion in ACR. I am proud to tell you that all of them maintain high quality and would like to show my appreciation to the authors for contributing the state-of-the-art methods in their research area.

ACR is available to everyone who is interested in the modern applied computing research trends. Our goal is to provide you with a platform for sharing innovative thoughts among professionals in various fields of applied computing. We are working with the ACM SIG Governing Board to further expand SIGAPP by increasing membership and developing a new journal on applied computing in the near future.

I would like to take this opportunity to announce that the first ACM European Computing Research Congress (ECRC) will be held at the Palais des Congrès in Paris from May 2nd to May 4th. More information about ECRC can be found below. In addition, I am delighted to remind you once again that the 28th SAC will be held in Coimbra, Portugal, from Mar. 18th to Mar. 22nd. Please join us and enjoy the conference. Your kind support and cooperation will be highly appreciated. Thank you.

Sincerely,

Sung Shin Editor in Chief & Chair of ACM SIGAPP

ACM ECRC

Registration is now open for ECRC 2013 (http://ecrc.acm.org/): The First ACM European Computing Congress in collaboration with CHI13 “Changing Perspectives”, May 2-4, 2013 (Early Registration Deadline – April 3, 2013).

Organized by ACM Europe, this event is co-locating multiple research conferences, workshops, and meetings to create a significant gathering of European computing researchers. The ACM ECRC plenary sessions will address important issues in European computing research. The general reception will foster an enhanced level of networking among European researchers, a tangible contribution of ACM Europe to the computing community. ACM ECRC is being built around SIGCHI's 2013 CHI conference that will be held at the Palais des Congrès in Paris. CHI'13 begins Saturday, April 27, 2013 and runs through Thursday, May 2, 2013.

Next Issue

The planned release for the next issue of ACR is June 2013.

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 5 A Flexible Approach for Abstracting and Personalizing Large Business Process Models Jens Kolb Manfred Reichert Ulm University, Ulm University, Germany Germany [email protected] [email protected]

PMS4: Large Process Model

Illustration.

ABSTRACT scienceOne of toour partners increase from maintainability the automotive domain and has provided to reduce us with costs detailed of insights into product changeplanning [2]. (PP) The, which increasing constitutes a adoptioncore process in of vehicle PAISs development has resulted [1]. The part of the PP process In process-aware information systems (PAISs), usually, dif- we considered, for example, comprises a large number of activities for planning production facilities in large process model collections (cf. Figure 1). In turn, ferent user groups have distinguished perspectives on the and resources. Furthermore, it defines the flow of about 50 relevant documents. When studying this each process model may refer to different domains, orga- business processes supported and on related business data. case we got access to a large model that was plotted on a 1,5 m x 5 m wallpaper - a fragment of this Hence, personalized views and proper abstractions on these nizationalprocess is depicted units, in and Fig. XX1. user Altogether, groups, th ande PP process comprise comprises dozens several or hundreds activities with business processes are needed. However, existing PAISs do evencomplex hundreds inter-dependencies. of activities Furthermore, (i.e., process there exists steps) a process [3]. handbook Usually, with detailed descriptions not provide adequate mechanisms for creating and visualiz- differentof each activity. user groupsThis handbook need mainly customized serves for viewstraining purposes on the and process provides detailed task ing process views and process model abstractions. Usually, modelsdescriptions. relevant – From for interviews them, with enabling process owners a personalized we have learned process that the current model contains process models are displayed to users in exactly the same modelseveral abstraction flaws, is known and in its visualizationcomplete form to only [4, 5,very 6, few 7]. experts, For exam-and is outdated in certain parts. In particular, the model is considered as being too large and costly regarding its maintenance. way as originally modeled. This paper presents a flexible ple, managers rather prefer an abstract process overview, whereasInterestingly, process due to participantsan enterprise-wide need harmoniza a moretion initiative detailed the current view process of model needs to be approach for creating personalized views based on param- transformed into another notation as well as into a more comprehensible form. eterizable operations. Respective view creation operations the process parts they are involved in. can be flexibly composed to either hide non-relevant process

......

......

...... information or to abstract it. Depending on the parameter- ...... ization of the selected view creation operations, one obtains ......

...... process views with more or less relaxed properties, e.g., re- ...... garding the degree of information loss or the soundness of ......

...... the resulting model abstractions. Altogether, the realized ...... view concept allows for a more flexible abstraction and vi- ......

...... sualization of large business process models satisfying the ...... 1 ...... needs of different user groups......

......

...... Categories and Subject Descriptors ...... D.2.2 [Design Tools and Techniques]: Computer-aided ......

...... software engineering (CASE); H.1.2 [User/Machine Sys- ......

...... tems]: Human factors ......

......

......

......

......

......

...... General Terms ......

...... Management, Design, Human Factors ...... Keywords ......

......

......

...... Process Model Abstraction, Process View, View Update, ...... Figure 1. Complex Process Model (Partial View)...... Process Visualization, Human-oriented Business Process Man- agement, Process Change Fig. XX1: Fragment of the product planning process with about 100 activities

Reference 1. INTRODUCTION Hence, providing personalized process views is a much needed Process-aware information systems (PAISs) provide support PAIS[1] Bobrik, feature. Ralph Several (2008) Configurable approaches Visualization for creating of Complex process Process model Models. PhD Thesis, for business processes at the operational level [1]. A PAIS abstractionsUniversity of Ulm. based on process views have been proposed [8, strictly separates process logic from application code, rely- 9]. However, none of them provide parametrizable opera- ing on explicit process models. This enables a separation of tions to assist users in easily creating or changing process concerns, which is a well established principle in computer views. Furthermore, existing approaches do not consider another fundamental aspect of flexible PAISs: change and 1Copyright is held by the authors. This work is based on an earlier work: SAC’12 Proceedings of evolution [1, 2]. More precisely, it is not possible to change the 2012 ACM Symposium on Applied Comput- a large process model through editing or updating one of its ing, Copyright 2012 ACM 978-1-4503-0857-1/12/03. view-based abstractions. http://doi.acm.org/10.1145/2245276.2232043

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 6 In the proView 2 project, we address these challenges in an cal activities (e.g., data transformation steps) to be excluded integrated and consistent way by supporting the creation from visualization. Finally, selected process nodes may have and visualization of process views as well as enabling users to be hidden or aggregated to meet confidentiality needs [17]. to change a process model through updates of a related pro- cess view. In this context, all other views associated with The proView framework allows creating respective process the changed process model need to be migrated to the new views based on well-defined, parameterizable view opera- model version as well. Besides view-based abstractions and tions. These rely on both graph reduction and graph ag- changes, proView allows for alternative process model ap- gregation techniques. While the former can be used to re- pearances (e.g., tree-, form-, and diagram-based represen- move nodes from a process model, the latter are applied to tation) as well as interaction techniques (e.g., gesture- vs. abstract from certain process information (e.g., aggregat- menu-based) [10, 11, 12, 13]. Note that the proView project ing several activities to one abstract node). Additionally, extends previous results from our Proviado project [14, 15] proView supports the flexible composition of basic view op- by providing sophisticated change features and process vi- erations to realize more sophisticated process model abstrac- sualizations. Our overall goal is to enable domain experts tions. The basic idea of creating process views has been al- to “understand” and “interact” with the (executable) process ready sketched in the context of Proviado [15, 18]. In this pa- models they are involved in. per, we introduce more advanced view operations and their formal properties. Further, we outline their implementation. üü ü Visualization Engine visualize Create Create Finally, we combine elementary view creation operations en- PAIS1 5 Refactor 6 7 Schema Appearance View 1 abling parameterizable high-level view operations. üü ü CPM CS1 CS2 CS3

PAIS2 & Monitoring View 2 Section 2 gives background information required to under- Engine Business Process 1

... stand this paper. Section 3 introduces the formal founda-

execute Migrate Update 4 3 Refactor 2 change

Execution tions of parameterizable process views. Section 4 gives in- üü ü Views CPM 1 View 3 PAISn Change Engine sights into practical issues and presents more complex ex- amples for defining and creating process views. Section 5 Figure 2. The proView Framework presents the proof-of-concept prototype and a first valida- tion we conducted. Section 6 discusses related work and Figure 2 gives an overview of the proView framework: A Section 7 concludes with a summary. business process is captured and represented through a Cen- tral Process Model (CPM). In addition, for a particular CPM, 2. BACKGROUNDS so-called creation sets (CS) are defined. Each creation set Each process is represented by a process schema consist- specifies the schema and appearance of a particular process ing of process nodes and the control flow between them (cf. view. For defining, visualizing, and updating process views, Figure 3). For control flow modeling, control gateways (e.g., the proView framework provides engines for visualization, ANDsplit, XORsplit) and control edges are used. change, and execution & monitoring.

The visualization engine generates a process view based on Definition 1 (Process Schema): A process schema is defined a given CPM and the information maintained in a creation by a tuple P = (N,E,EC,NT,ET ) where: set CS, i.e., the CPM schema is transformed to the view • N is a set of process nodes, schema by applying the corresponding view creation oper- • E ⊂ N × N is a precedence relation ations specified in CS (Step 5 ). Afterwards, the resulting (notation: e = (n , n ) ∈ E), view schema is simplified by applying well-defined refactor- src dest • EC : E → Conds ∪ {True} assigns optionally transi- ing operations (Step ). Finally, Step customizes the 6 7 tion conditions to control edges, visual appearance of the view, e.g., creating a tree-, form-, • NT : N → {Activity, ANDsplit, ANDjoin, ORsplit, or activity-based appearance [10, 14]. ORjoin, XORsplit, XORjoin} assigns to each n ∈ N a node type NT (n); N is divided into disjoint sets of ac- When a user updates a view schema, the change engine is tivity nodes A (NT = Activity) and gateways S (NT 6= triggered (Step 1 - 4 ), which updates the process schema of Activity). underlying CPM and updates all associated process views. • ET : E → {ControlEdge, LoopEdge} assigns a type ET (e) [16] gives detailed insights into the proView architecture and to each edge e ∈ E. the view update operations based on which business process models can be changed through updating process views. Activity AND Split C D E F G AND Join

This paper focuses on the parameterizable visualization en- A B u H I J s Q T U V P gine component (cf. Figure 2), i.e., on the provision of a Activity States: z Kx L M t R S XOR Join Complete XOR Split Running SESE block flexible and parameterizable component for creating process Activated Control Flow Edge branching condition (Single Entry Single Exit) views and process model abstractions, respectively. Such a x Skipped component must cover a variety of use cases. For exam- ple, it should be possible to create views only containing Figure 3. Example of a Process Instance activities the current user is involved in or only showing non-completed process regions. As another example con- sider executable process models, which often contain techni- Note that this definition focuses on the control flow per- spective. In particular, it can be applied to existing activity- 2http://www.dbis.info/proView oriented modeling languages (e.g., BPMN). Additional, view

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 7 creation operations specifically addressing the data flow are Assume that each of the activity sets {A, B}, {H, I, J, K, L, presented in [19]. Furthermore, the handling of loop struc- M}, and {T, U, V } shall be aggregated, i.e., the process frag- tures is described [20]. ments induced by the respective activity set shall be replaced We assume that a process schema has one start and one end by one abstract activity. Further, assume that activity sets node. Further, it has to be connected; i.e., each activity {E,F,G} and {R,S} shall be hidden from the user. Fig- can be reached from the start node, and from each activity ure 4b shows a possible process view resulting from respec- the end node is reachable. Finally, branches may be arbi- tive process model aggregations and reductions. trarily nested, but must be safe (e.g., a branch following a Generally, process views exhibit an information loss when XORsplit must not merge with an ANDjoin). compared to the original process (i.e., central process model (CPM)). As important requirement, view creation opera- Definition 2 (SESE): Let P = (N,E,EC,NT,ET ) be a pro- tions should have a precise semantics and be applicable to cess schema and let X ⊆ N be a subset of activity nodes. both process schemas and instances. Further, it should be The subgraph P 0 induced by X is called SESE (Single En- possible to remove process nodes (i.e., reduction) or to re- try Single Exit) fragment iff P 0 is connected and has exactly place them by abstracted ones (i.e., aggregation). When one incoming and one outgoing edge connecting it with P. creating process views, it is fundamental to preserve the If P 0 has no preceding (succeeding) nodes, P 0 has only one structure of non-affected process regions. Finally, the ef- outgoing (incoming) edge. fects of view creation operations should be parameterizable to meet application needs best and to be able to control the degree of information loss in a flexible manner. Based on a process schema P , related process instances can We first give an abstract definition of a process view. Note be created and executed at run-time. Regarding the pro- that the concrete properties of such a view depend on the cess instance from Figure 3, for example, activities A and B view operations applied and their parameterization as spec- are completed, C is activated (i.e., offered as work items in ified in a respective creation set (CS). user worklists), H is running, and K is skipped (i.e., is not executed). Generally, a large number of process instances might run on a given process schema. Definition 4 (Process View): Let P = (N,E,EC,NT,ET ) be a process schema (i.e., central process model) with ac- tivity set A ⊆ N. Then: A process view on P is a process Definition 3 (Process Instance): A process instance I is de- schema V (P ) = (N 0,E0,EC0,NT 0,ET 0) whose activity set fined by a tuple (P, NS, H) where A0 ⊆ N 0 can be derived from P by reducing and aggregating • P denotes the process schema on which I is running, activities from A ⊆ N. Formally: 0 • AU = A ∩ A denotes the set of activities present in • NS : N → ExecutionStates := {NotActivated, Acti− both P and V (P ), vated, Running, Skipped, Completed} describes the ex- 0 • AD = A \ A denotes the set of activities present in P , ecution state of each node n ∈ N, but not in V (P ); i.e., reduced or aggregated activities: AD ≡ AggrNodes ∪ RedNodes •H = he1, . . . , eni denotes the execution history of I 0 • AN = A \ A denotes the set of activities present in where each entry ek is related either to the start or completion of a particular process activity. V (P ), but not in P . Each a ∈ AN is an abstract activity aggregating a set of activities from A: For an activity n ∈ N with NS(n) ∈ {Activated, Running}, 1. ∃AggrNodesi, i = 1, . . . , n with all preceding activities either must be in state Completed • S or Skipped, and all succeeding activities must be in state AggrNodes = AggrNodesi NotActivated. Further, there is a path π from the start i=1,...,n node to n with NS(n0) = Completed ∀n0 ∈ π. 2. There exists a bijective function aggr with: aggr : {AggrNodesi|i = 1, . . . , n} → AN a) Central Process Model: Reduce(E,F,G) Aggregate(A,B) C D E F G Aggregate(T,U,V) Using the notions from Definition 4 for a given central pro- A B H I J s Q T U V P cess model P and related view V (P ), we introduce function K L M t R S 0 Reduce(R,S) V Node : A → A . This function maps each process activ- Aggregate(H,I,J,K,L,M) create view ity c ∈ AU ∪ AggrNodes to a corresponding activity in the b) Related Process View: C D respective process view: AB s Q TUV  HIJKLM P c c ∈ AU t  Abstracted Nodes aggr(AggrNodes ) ∃i ∈ {1, . . . , n}: V Node(c) = i  c ∈ AggrNodesi  Figure 4. Example of a Process View undefined c∈ / AU ∪ AggrNodes For each view activity c0 ∈ A0, V Node−1(c0) denotes the corresponding activity or the set of activities aggregated by 3. FUNDAMENTALS ON VIEW CREATION c0 in the central process model. We first introduce basic view creation operations and reason Finally, more complex process views are created by compos- about the properties of the resulting process view schemas. ing a set of view operations, which also define the semantics As first example consider the process schema from Figure 4a. of the process view (cf. Section 4).

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 8 0 3.1 Creating Process Views Based on Schema preserving iff ∀n1, n2 ∈ A with n1 6= n2 and n1  n2 : n1 = 0 0 0 Reduction V Node(n1) ∧ n2 = V Node(n2) ⇒ ¬(n2  n1). Any view management component should be able to remove activities in a process schema. For example, this is required This property expresses that the order of two activities to hide irrelevant or confidential process details from a par- in a process schema must not be reversed in a correspond- ticular user group. For this purpose, proView provides an ing view. Obviously, the reduction operations depicted in elementary reduction operation (cf. Figure 5b). Based on it, Figure 5ab are order preserving. Generally, this property is higher-level reduction operations for hiding a set of activities fundamental for ensuring the integrity of process schemas are realized. Reduction of an activity (i.e., RedActivity) and related view schemas. A stronger notion is provided is realized by removing its node together with its incom- by Definition 6. As we will see later, in comparison to Fig- ing/outgoing edges from the process schema. Then, a new ure 5ab there are view operations which do not comply with control edge is inserted between the predecessor and succes- Definition 6. sor of the removed activity (cf. Figure 5b). For reducing activity sets, the single-aspect view operation ReduceCF is Definition 6 (Strong Order-Preserving Views): Let P = provided. Single-aspect operations focus on elements of one (N,E,EC,NT,ET ) be a process schema with activity set particular process aspect. Reduction is performed stepwise, A ⊆ N and let V (P ) = (N 0,E0,EC0,NT 0,ET 0) be a corre- i.e., for all activities to be removed, operation RedActivity sponding view with A0 ⊆ N 0. Then: V (P ) is strong order- is applied (cf. Figure 5a). 0 preserving iff ∀n1, n2 ∈ A with n1 6= n2 and n1  n2 : n1 = 0 0 0 V Node(n1) ∧ n2 = V Node(n2) ⇒ n1  n2. REDUCECF({B,C,D}) RedActivity(B) s B A B C D E A t C 3.2 Creating Process Views Based on Schema RedActivity(B) s A Aggregation A C D E b) t C The aggregation operation allows merging a set of activities RedActivity(C) r s into one abstracted activity. Depending on the structure A D E t A B of the subgraph induced by the respective activities, dif- RedActivity(D) ferent schema transformations have to be applied. In par- r Ú s ticular, the aggregation of non-connected activities necessi- A E t a) c) A B tates a more complex restructuring of the original process B schema. Figure 6 shows the elementary operations provided A F D A D for creating aggregated views. The depicted operations fol- F low the policy to substitute the activities in-place by an

B abstract node (if possible), while ensuring properly order- A D F A D preservation (cf. Definition 5). Note that in-place substitu- tion is always possible when aggregating a SESE fragment d) e) F (cf. Definition 2). If none of the operations from Figure 6ade can be applied, in turn, (cf. Figure 6b) is Figure 5. Reduction and Refactoring Operations AggrAddBranch used. It identifies the nearest common ancestor and suc- cessor of all the activities to be aggregated and adds a new Note, the algorithms we apply are context-free and may in- branch between them (cf. Figure 6b). Alternatively, aggre- troduce unnecessary process elements (e.g., empty branches). gation of non-connected activities can be handled by apply- Respective elements are purged afterwards by applying well- ing elementary operations of type AggrSESE (cf. Section 4). defined, behaviour-preserving refactoring rules to the cre- Finally, when aggregating activities directly following a split ated view schema [21]. For example, when reducing a com- node, there exist two alternatives (cf. Figure 6e): the first plete branch of a parallel branching, the resulting control one aggregates activities applying AggrAddBranch, the sec- edge may be removed as well (cf. Figure 5d). In case of an ond one shifts activities to the position preceding the split XOR-/OR-branching, however, the empty path (i.e., con- node (i.e., AggrShiftOut). trol edge) needs to be preserved. Otherwise an inconsistent Except AggrAddBranch, the presented operations are strongly schema would result (cf. Figure 5b). Similarly, when apply- order-preserving (cf. Definition 6). However, AggrAddBranch ing simplification rules to XOR-/OR-branches, respective violates this property. For example, in Figure 6b, order re- transition conditions must be recalculated (cf. Figure 5c). lation D  E can not be preserved when applying this op- Note that any reduction of activities is accompanied by an eration. information loss, while preserving the structure of the non- reduced schema parts, i.e., the activities present in both the process and the process view schema. The latter can be ex- Definition 7 (Dependency Set): Let P = (N,E,EC,NT,ET ) pressed using the notion of order preservation. For this, we be a process schema with activity set A ⊆ N. Then: DP = introduce partial order relation  (⊆ N × N) on a process {(n1, n2) ∈ A × A|n1  n2} is denoted as dependency set reflecting all direct and transitive control flow dependencies schema P with n1  n2 ⇔ ∃ path π in P from n1 to n2. between any two activities. Definition 5 (Order-Preserving Views): Let P = (N,E,EC, NT,ET ) be a process schema with activity set A ⊆ N We are interested in the relation between the dependency set and let V (P ) = (N 0,E0,EC0,NT 0,ET 0) be a view on P of a process schema and a related view schema. For this pur- with activity set A0 ⊆ N 0. Then: V (P ) is called order- pose, let DP be the dependency set of P and DV (P ) be the de-

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 9 A B C D E A SESE Z erties (cf. Definition 8) and order-preservation property (cf. BCE Definition 5). A N Z A D a) AggrSESE b) AggrAddBranch Theorem 1: Let P = (N,E,EC,NT,ET ) be a process schema r B C r B C A and let V (P ) be a corresponding view. Then: s D E A s D E t CD F G r Ú s i) V (P ) is dependency-erasing ⇒ r Ú s A r B BCDE V (P ) is not strong order-preserving s A E t F G ii) V (P ) is dependency-preserving ⇒ c) AggrAddBranch d) AggrComplBranches V (P ) is strong order-preserving B C A D E Alternative 1: Alternative 2: AggrAddBranch BD AggrShiftOut The proof of Theorem 1 is based on the definition of the A C C properties and dependency sets. A BD E E e) Proof : Let P = (N,E,EC,NT,ET ) be a process schema and V (P ) be a corresponding process view. Then: Figure 6. Elementary Aggregation Operations i) V (P ) is dependency-erasing and relation d = (a, b) ∈ 0 DP exists with d 6∈ DV (P ) (w.l.o.g., a ∈ AggrNodes pendency set of view V (P ). We further introduce a projec- 0 0 and b 6∈ AggrNodes). Hence, a 6 b. Let a = V Node(a) tion of the dependencies from V (P ) on P denoted as . 0 0 0 DV (P ) and b = V Node(b), i.e., a ∈ AN and b ∈ AU . Then The latter can be derived by substituting the dependencies 0 0 0 a 6 b based on the calculation of DV (P ). If V (P ) is of the abstract activities by the ones of the original activities. strong order-preserving, dependency a0  b0 relation As example consider AggrShiftOut in Figure 6e. We ob- has to exist. tain DP ={(A, B), (B,C), (C,F ), (A, D), (D,E), (E,F )} and DV (P ) = {(A, BD), (BD,C), (BD,E), (C,F ), (E,F )}. Fur- ii) V (P ) is dependency-preserving, i.e., for all dependency ther 0 = {(A, B), (B,C), (D,C), (C,F ), (A, D), DV (P ) relations d = (a, b) ∈ DP : a  b. The calculation (B,E), (D,E), (E,F )} holds. As one can see, 0 con- 0 0 0 0 DV (P ) of DV (P ) results in a  b with a = V Node(a) and tains additional dependencies. We denote this property as b0 = V Node(b). The proof of Theorem 1 is based on dependency-generating. the transitivity of the relation .

0 0 Calculation of DV (P ): For n1 ∈ AN , n2 ∈ A : remove −1 all (n1, n2) ∈ DV (P ); insert {(n, n2)|n ∈ aggr (n1)} in- stead (analogously for n2 ∈ AN ); finally insert the depen- To conclude, we have presented a set of elementary aggrega- dencies between AggrNodes, i.e., DP [AggrNodes] = {d = tion operations. Each of them fits to a specific ordering of (n1, n2) ∈ DP |n1 ∈ AggrNodes ∧ n2 ∈ AggrNodes} the activities to be aggregated. In Section 4.1 we combine these operations into more complex ones utilizing the dis- Now we can classify effects on the dependencies between cussed properties. Furthermore, this section has focused on activities when building a view. the control flow schema of a process view. Generally, addi- tional process aspects must be covered, including data ele- Definition 8 (Dependency Relations): Let P = (N,E,EC, ments, data flow, and process attributes (cf. Section 3.4) as NT,ET ) be a process schema and V (P ) be a corresponding defined for the different process elements (e.g., activities or 0 view schema. Let further DP and DV (P ) be the dependency data elements). When creating a process view, proView con- sets as defined above. Then: siders these aspects as well. Regarding data elements, for in- stance, Figure 7a shows an example of a simple aggregation; • V (P ) is denoted as dependency-erasing iff there are in this example, data edges connecting activities with data 0 dependency relations in DP not existing in DV (P ) any- elements are re-linked when aggregating {B,C,D,E} in or- more. der to preserve a valid model. More details about view oper- ations covering the data flow perspective and corresponding 0 • V (P ) is denoted as dependency-generating iff DV (P ) correctness issues are discussed in [19]. contains dependency relations not existing in DP .

• V (P ) is denoted as dependency-preserving iff it is data D1 D3 D1D2 element neither dependency-erasing nor dependency-generating. C

A B D A BCDE Generally, reduction operations are dependency-erasing. When data flow E aggregating activities, however, there exist elementary op- edge D2 erations of all three types. In Figure 6, for example, Ag- optional data flow D3 grSESE is dependency-preserving, while AggrAddBranch is AggrShiftOut 0 dependency-erasing since (B,C) ∈/ DV (P ) holds. Finally, 0 Figure 7. Aggregation of Data Elements AggrShiftOut is dependency-generating: (B,E) ∈ DV (P ). Theorem 1 expresses the relation between dependency prop-

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 10 3.3 Applying Views to Process Instances Examples are depicted in Figure 8. Figure 8c shows the So far, we have only considered views on process schemas. scenario from Figure 6e with execution states added. Note This section additionally introduces views on process in- that applying AggrShiftOut yields an inconsistent state as stances (cf. Definition 3). When creating respective instance two subsequent activities are in state Running. views their execution state (i.e., states of concrete and ab- stract activities) must be determined and the trace relating Definition 10 (State Consistency): Let I = (P, NS, H) be a to the instance view logically needs to be adapted. Exam- process instance and let V (I) be a corresponding view on I. ples are depicted in Figure 8a (reduction) and Figure 8b Then: (aggregation). Function VNS then calculates the state of • V (I) is strong state-consistent iff for all paths π an abstract activity based on the states of the aggregated (cf. Section 2) in V (I) from start to end, and not con- activities. taining activities in state Skipped, there exists exactly one activity in state Activated or Running. NotActivated ∀x ∈ X : NS(x) ∈/ {Activated, Run−   ning, Completed} ∧ • V (I) is state-consistent iff for all paths π in V (I)   ∃x ∈ X : NS(x) = NotActivated  from start to end and not containing activities in state  Activated ∃x ∈ X : NS(x) = Activated ∧ ∀x ∈ X : Skipped, there does not exist more than one activity   NS(x) ∈/ {Running, Completed}  in state Activated or Running. Running ∃x ∈ X : NS(x) = Running ∨  VNS(X) = ∃x1, x2 ∈ X : NS(x1) = Completed ∧  As indicated in Figure 8a, reducing activities from a process  (NS(x2) = NotActivated ∨ NS(x2)   = Activated) instance may result in a “gap” during instance execution, if  Completed ∀x ∈ X : NS(x) ∈/ {NotActivated,  no activity is in state Running or Activated. Hence, reduc-   Running, Activated}∧ tion is not strong state-consistent. Theorem 2 shows how   ∃x ∈ X : NS(x) = Completed  state inconsistency is correlated with dependency relations Skipped ∀x ∈ X : NS(x) = Skipped (cf. Definition 8):

Theorem 2: Let I be a process instance and V (I) a corre- Definition 9 (View on Process Instance): Let I = (P, NS, H) sponding view. V (I) is dependency-generating ⇒ V (I) is be an instance of schema P = (N,E,EC,NT,ET ) and not state-consistent. V (P ) = (N 0,E0,EC0,NT 0,ET 0) be a view on P with cor- responding aggregation and reduction sets AggrNodes and Proof: Let I = (P, NS, H) be a process instance of pro- RedNodes (cf. Definition 4). Then: V (I) on I is a tuple cess P = (N,E,EC,NT,ET ) and V (I) = (V (P ),NS0, H0) (V (P ),NS0, H0) with: be a view on I with V (P ) = (N 0,E0,EC0,NT 0,ET 0) being 0 0 0 0 0 0 0 dependency-generating. Then, there exists n1, n2 ∈ N in • NS : N → ExecutionStates with NS (n ) = 0 0 −1 0 V which generates a dependency, i.e., n1  n2. Let n1 = VNS(V Node (n )) assigns to each view activity a −1 0 −1 0 corresponding execution state. V Node (n1) ∈ N and n2 = V Node (n2) ∈ N. Then: n1  n2. Further there are two paths in P from start to end containing n and n . Therefore, there exists a state of in- •H0 is the reduced/aggregated history of the instance. 1 2 stance I with NS(n ) = Running and NS(n ) = Running. It is derived from H by (1) removing all entries e 1 2 i Thus, NS(n0 ) = Running and NS(n0 ) = Running. Since related to activities in RedNodes and (2) for all j: 1 2 there is a path from start to end in V (P ), containing both replacing the first (last) occurrence of a start event 0 0 n1 and n2, the claim follows. (end event) of activities in AggrNodesj and remove the remaining start (end) events. Theorem 2 shows that AggrShiftOut always causes an in- consistent execution state, whereas the application of oper- ation AggrAddBranch maintains a consistent state. A B C D A B C D Concerning the process instances, proView allows aggregat- H: ‹S(A),E(A),S(B),E(B),S(C)› H: ‹S(A),E(A),S(B),E(B),S(C)› ing collections of them; i.e., multiple instances of the same process schema may be condensed to an aggregated one A D A BC D in order to provide abstracted information regarding their H’: ‹S(A),E(A)› H’: ‹S(A),E(A),S(BC)› progress or key performance data.

a) RedSESE b) AggrSESE 3.4 View Operations Affecting Attributes of Process Elements B C A Process nodes are not elementary, but constitute complex D E objects comprising various process attributes (e.g., attributes BD of an activity may be cost, start time, and end time). In A C C Figure 9, activities A, B, and C are aggregated into an ab- A BD E E stracted activity ABC. When applying this aggregation, Alternative 1: Alternative 2: related attributes must be aggregated as well. For this c) AggrAddBranch AggrShiftOut purpose, proView provides transformation functions, which may be applied to aggregate attributes, e.g., function CON- Figure 8. View Operations for Process Instances CAT allows concatenating the name of aggregated activities.

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 11 Table 1. Properties of View Creation Operations In the context of a process schema, attributes may be grouped into four categories: Properties • C1 (Activity State): This category includes process attribute VNS representing the execution state of a process node (cf. Section 3.3).

• C2 (Default Attributes): This category represents de- fault attributes that are common to all process nodes. der preserving epend. preserv. epend. erasing epend. generat. tr. order preserving tr. state consistent tate consistent s Operation s or s d d d For example, each process node comprises attributes RedActivity + + - + - + - describing its name, start, and end time. AggrSESE + + + + + - - AggrComplBranches + + + + + - - • C3 (Type-specific Attributes): This category comprises AggrShiftOut + + - - - - + AggrAddBranch - + + + - + - all process attributes, which are only available for a AggrAttr + + + + + + + specific type of process node (e.g., activity or data el- ement). Attribute cost, for example, is only available Trans. Functions for activities, but is undefined for gateways. name: B name: ABC name = concat(ni.name) start: 07.08.12 start = min(n .start) start: 01.08.12 end: 14.08.12 i end = max(ni.end) end: 14.08.12 • C4 (Other Attributes): This category comprises all cost: 4100 cost: 7200 cost = sum(ni.cost) process attributes not available for all occurrences of A B C ABC a specific type of process node. For example, process AggrSESE name: A name: C attribute personnel cost may be only available for ac- start: 01.08.12 start: 06.08.12 tivities executed by a human resource. end: 05.08.12 end: 07.08.12 cost: 2500 cost: 600 In the following, view creation operations considering pro- Figure 9. Aggregation of Attributes cess attributes are presented: Attribute Reduction Operation: View creation opera- tion ReduceAttr(A.x) removes attribute A.x in the corre- Generally, there exist two application scenarios in which sponding process view. For example, in Figure 10 attribute transformation functions are applied to process attributes: A.z is reduced when creating the respective process view. Generally, function ReduceAttr(A.x) may be applied to hide • AS1 (Integrated): Transformation functions are inte- specific attributes from a user for privacy reasons [17]. grated with view creation operations (as indicated in Figure 9). This scenario is particularly relevant for aggregation operations. A B C A BC • AS2 (Stand-Alone): Transformation functions are ap- AggrSESE({B,C}) A.x B.x C.x ReduceAttr(A.z) A.xy BC.x plied to aggregate or reduce selected process attributes AggrAttr( in the respective process view. A.y B.y C.y {A.x,A.y},SUM) BC.y A.z B.z C.z BC.z To discuss the application of transformation functions, Def- inition 1 has to be enriched with attributes. Figure 10. Combination of Attribute Operations Definition 11 (Process Schema with Attributes): A process schema is defined by a tuple P = (N, E, EC, NT, ET, attr, val) Attribute Aggregation Operation: View creation op- with: eration AggrAttr(AS’, func) combines a set of process at- • N,E,EC,NT,ET as defined in Definition 1 and A the tributes AS’ to an attribute using transformation function set of supported attributes. func. • attr : N ∪ E → AS assigns to each process element a Definition 12 (Transformation Function): Based on a set of corresponding attribute set AS ⊆ A. attribute values, transformation function func calculates a • val :(N ∪ E) × AS → valueDomain(AS) assigns to new attribute value. Let Wi with i = 1, . . . , n and W be the an attribute a ∈ AS of process element n ∈ (N ∪ E) a value domains of selected attributes as well as the abstracted respective value: attribute ( func : ... → . value of a, a ∈ attr(n) W1 Wn W val(n, a) = null, a 6∈ attr(n) Depending on the type of attributes to be aggregated, var- ious transformation functions can be applied. For exam- Accordingly, a process view with attributes is denoted as ple, a transformation function aggregating activity execu- 0 0 0 0 0 0 0 V (P ) = (N ,E ,EC ,NT ,ET , attr , val ). Further, A.x tion states (i.e., VNS) has been introduced in Section 3.3. denotes process attribute x of process node A. Table 2 summarizes transformation functions supported by proView. For example, function SUM adds numerical values

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 12 of attributes returning their sum (cf. Figure 10). 4.1 Parameterizable View Creation Operations One major use case of our process view framework is pro- Control Flow Aggregation Operations: As aforemen- cess visualization. However, other use cases (e.g., process tioned, when aggregating activities (e.g., using AggrSESE), modeling) are covered as well. Thus, different requirements their attributes need to be aggregated as well (cf. Figure 10). regarding the properties of the resulting view schema exist. For this purpose, transformation functions, as introduced in In one of our case studies, in the automotive domain, for Definition 12, are applied. To automatically aggregate the example, we have shown that for the visualization of large attributes of different activities, proView provides default process models minor inconsistencies or information loss will transformation functions. For example, attribute start (end) be tolerated by the users as long as an appropriate visual- time may be aggregated using the MIN (MAX) transforma- ization can be obtained for them. Opposed to this, incon- tion function (cf. Table 2). For aggregating activity names, sistencies will not be accepted if process updates based on function CONCAT can be applied. Generally, it is possible views shall be enabled [16]. to override such standard behaviour in a given context. To deal with these varying requirements, proView expands Finally, when aggregating XOR/OR branches, we must take single-aspect operations with a parameter set. This allows into account that not all branches will be executed in all specifying the properties of the resulting view schema. The cases. For example, summing up all available attribute val- parameters of operation AggregateCF, for example, are ues in such a situation might result in erroneous attribute summarized in Table 3; e.g., when aggregating activities di- aggregation. Regarding numerical transformation functions, rectly succeeding an ANDsplit (cf. Figure 6e) and requiring execution probabilities of the different branches are used to state-consistency of the resulting view AggrAddBranch has calculate the attribute values expected. If such values are to be chosen (cf. Figure 11a). not available, one might initially assume that all branches are selected with the same probability. Note that for textual In certain cases, the specified parameters might be too strict; transformation functions, this is not required. i.e., no elementary view operations exist to realize the de- sired properties. We provide two strategies for addressing Table 2. Trans. Functions for Attribute Values such scenarios. The first one subdivides the set of activities until elementary operations can be applied and the desired Transformation Functions for Numeric Values properties be ensured. The second strategy expands the SUM sum of attribute values activity set to achieve this. Regarding reduction, in turn, AVG average of attribute values view generation is always possible due to the way activity MIN minimum of attribute values sets are split into single activities and the application of MAX maximum of attribute values RedActivity. COUNT number of attributes Transformation Functions for Textual Values Figure 11 illustrates the use of the single-aspect operation CONCAT concatenation of attribute values AggregateCF. It depicts a process schema together with FIRST first value of attribute list the set of activities to be aggregated. The view operation LAST last value of attribute list analyzes the structure of the activities and determines which MAXFREQ most frequently used attribute value elementary operation shall be applied. If the application of MINFREQ less frequent used attribute value these operations results in a view schema complying with the RANDOM random attribute value properties defined by the desired parameters (dependencies, execution states ), it is applied as shown in Figure 11a. If parameter strategy forces us to process the set of activities as it is, and an appropriate operation cannot be found, view 4. ADVANCED VIEW CREATION generation is aborted with an error message. Figure 11b CONCEPTS shows the result we obtain when expanding the activity set To enable more complex view creation operations, the ele- to be aggregated to the minimum SESE-block that contains mentary operations presented in Section 3 may be combined. all activities to be aggregated. Note that this strategy has Table 1 summarizes the view properties we can guarantee for the elementary operations described. Based on this, we may reason about the properties of the views resulting from Table 3. Overview of Parameters for AggregateCF the combined use of elementary operations. For any pro- cess visualization component, however, manually selection Parameter Values1 Description of the elementary operations applied in the given context dependencies preserving, The view operation applied non-erasing, should be dependency- is inconvenient for users. Note that this would require in- non-generating, preserving, not dependency- depth knowledge of the different operations and their seman- any erasing, or not dependency- tics. To tackle this challenge, proView additionally provides generating. Otherwise no restrictions regarding single-aspect view operations on top of elementary opera- dependencies are considered. tions. In turn, this allows us to cope with more complex use exec. states inconsistent, The view operation applied cases as well. Single-aspect operations analyze the context consistent should be state consistent or of the activities to be reduced or aggregated in a process may be state inconsistent. strategy as-is,subdivide, Activity set should be aggre- schema, and they automatically determine the appropriate expand gated as-is, may be subdi- elementary operations required to build the view with the vided or expanded. desired properties. 1default values are printed in bold face

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 13 Central Process Model AGGREGATE({B,F,G,J,M,N,P,Q,R}) 4.2 A Leveled Operational Approach for Re- (Single-Aspect Operation) s F G C D E alizing Views t H I A B J K L T P Q R S So far, we have presented a set of elementary and single- M N O aspect view creation operations. Additionally, proView of- s F G fers high-level operations hiding as much complexity from a C D E t H I end-users as possible. As configuration parameter, the single- A B J K L T P Q R S aspect view operations take the sets of activities to be re- View Parameterization: M N O strategy = as-is AggrAddBranch states = default duced or aggregated, and then determine appropriate ele- dependencies = default ({B,F,G,J,M,N,P,Q,R}) (Elementary Operation) mentary operations. What is still needed are view opera- BFGJMNPQR C D E s tions allowing for a predicate-based specification of the re- t H I A K L T spective activity sets. Besides, operations with built-in in- S O telligence are useful, e.g. ”show only the activities of a par- ticular user role”. s F G b C D E t H I A B J K L T P Q R S To meet these requirements, proView organizes view oper- View Parameterization: M N O strategy = expand AggrSESE({B,C,D,E,F,G,H, ations into four layers (cf. Figure 12). Thereby, high-level states = consistent I,J,K,L,M,N,O,P,Q,R,S}) dependencies = preserving operations may access lower-level ones. For defining a view, (Elementary Operation) A BCDEFGHIJKLMNOPQRS T operations from all layers can be used.

s F G c C D E High-level Operations ShowMyActivities AggrExecutedPart ... t H I A B J K L T Multi-Aspect Operations Aggregate Reduce ... P Q R S AggrAddBranch({B,J,M,N}) M N O Single-Aspect Operations AGGREGATECF REDUCECF ... AggrSESE({F,G}) AggrShiftOut({B,J,M,N}) Attribute Op .

AggrSESE({P,Q,R}) AggrSESE({F,G}) AggrSESE AggrAddBranch AggrComplBranch Data Flow Op . FG Elementary Operations s AggrSESE({P,Q,R}) AggrShiftOut RedActivity ... C D E t H I A K L T Process PQR S View Parameterization: Control Flow Attributes Application Data O strategy = subdivide states = default BJMN dependencies = default View Parameterization: Figure 12. Multi-Layer View Operations strategy = subdivide s FG states = consistent C D E dependencies = default t H I A BJMN K L T PQR S Elementary operations are designed for a specific ordering of O the selected activity set within the process schema (cf. Sec- s F G tion 3). Single-aspect operations receive a set of activities d C D E t H I as input to be processed. They analyze the structure of the A B J K L T P Q R S View Parameterization: M N O activities in the process schema and select the appropriate el- strategy = subdivide AggrSESE({B}) states = consistent AggrSESE({J}) ementary operations based on the chosen parameterization. dependencies = preserving AggrSESE({M,N}) In turn, multi-aspect operations consider elements of differ- s FG AggrSESE({F,G}) C D E AggrSESE({P,Q,R}) ent type (e.g., activities, data elements) and delegate their t H I A B J K L T processing to single-aspect operations. High-level operations PQR S MN O abstract from the aggregations or reductions neccessary to build a particular view: AggrExecutedPart only shows those parts of the process schema, that still may be executed, and Figure 11. Views Depending on Quality Parameters aggregates already finished activities. Figure 13 shows an example illustrating the way high-level operation AggrExe- cutedPart is translated into a combination of single-aspect been proposed in literature as well [22, 23]. Generally, for vi- and elementary operations, respectively. In a first step, all sualization purposes it is not always acceptable to aggregate activities with completed and skipped execution states are activities originally not contained in the aggregation set. determined (i.e., activites A, B, C, E, H, and I). Based on Figure 11cd can be derived by subdividing the set of ac- this activity set AggregateCF with the following parameter tivities to be aggregated. This is done stepwise: First, all settings is applied: strategy=as-is, states=default, and connected fragments are identified (cf. Figure 11c). If the dependencies=default (cf. Section 4.1). Finally, in the aggregation of these fragments does not meet the required resulting process view the execution states are set. properties, the fragments are further subdivided until each subset constitutes a SESE (cf. Figure 11d). Initial Process B C D 1.High-level Operation: AggrExecutedPart A E F G Altogether, parameterization of view operations significantly K 2.Multi-Aspect Operation: H I J Aggregate(A,B,C,E,H,I) increases the flexibility of our view creation approach. Fur- Result D 3.Single-Aspect Operation: ther, it allows defining exactly the view properties to be AGGREGATECF(A,B,C,E,H,I) preserved. Considering reduction, a parameterization at the ABCEHI F G K Elementary Operation: level of single-aspect operations is not useful since reduc- J 4.AggrShiftOut(A,B,C,E,H,I) tion of a complex set of activities can be realized by calling RedActivity repeatedly as explained in Section 3.1. Figure 13. Example of AggrExecutedPart

Another high-level operation, ShowMyActivities, extracts ex- actly those parts of a process model the user is involved in.

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 14 Figure 14 shows an example of applying this operation to company for different user groups. This process consists create a personalized process view for user U1. Operation of 56 activities and involves six different user roles. In ShowMyActivities first determines all activities the selected the top right, Figure 15 shows this process; on the bot- user U1 is not involved in and reduces them. Finally, refac- tom right, an automatically generated view of an involved toring operations are applied to simplify the control-flow. electrical-electronic engineer is displayed. This view is gen- erated through high-level operation ShowMyActivities. Initial Process 1.High-level Operation: C D E ShowMyActivities(U1) U2 U3 U3 We automatically created views for each involved user mea- F G H Determine all activities user U1 is not involved in: A B U3 U1 U1 N O P suring the complexity of the resulting view schemas based on U1 U1 I J K U3 U1 U3 S={C,D,E,F,I,J,K,N,P,J} well-known process metrics, i.e., number of activities, num- U4 U4 U2 ber of gateways, and McCabe metric measuring the com- J M U2 U1 plexity of the control flow schema [26, 27]. The results are Multi-Aspect Operation: Initial Process C D E 2. depicted in Table 4. The first row shows the metrics for Reduce({C,D,E,F,I,J,K, U2 U3 U3 N,P,J}) the initial process model Order Process. In the rows below, F G H 3.Single-Aspect Operation: A B U3 U1 U1 N O P ReduceCF({C,D,E,F,I,J, calculated metrics of the automatically generated views for U1 U1 I J K U3 U1 U3 K,N,P,J}) user roles clerk, accountant, electrical-electronic engineer, U4 U4 U2 4.Elementary Operation: J M RedActivity(x) mechanical engineer, and project manager are listed. for all x in {C,D,E,F,I,J, U2 U1 K,N,P,J} Initial Process 5. Refactoring Operation Table 4. Abstraction through High-Level Operation G H A B U1 U1 O Removing unnecessary control-flow structures U1 U1 U1 Process #Activities #Gateways McCabe CPM: Order Process 56 14 8 M V1: Clerk 2 0 0 U1 V2: Accountant 17 2 1 B G H 6. Final Process View for V3: EE Engineer 17 2 2 User U1 Initial Process U1 U1 U1 V4: Mech. Engineer 11 0 0 A O V5: Proj. Mgmr. 9 2 1 U1 M U1 U1 As it can be seen, when providing personalized process views to users, process complexity can be reduced for them. In the Figure 14. Example of ShowMyActivities given scenario, the number of activities is reduced to 2-17 depending on the respective view, i.e., to 3%-30% of the ini- Note that proView supports additional operations at the tial process model. Furthermore, the number of gateways different levels to cover data flow and attributes as well; e.g., decreases from initially 14 to 0-2 gateways, i.e., the result- to handle adjacent data elements when aggregating activities ing personalized process views have at most one branching (remove, aggregate, or maintain) [16, 20]. (i.e., one split and join gateway). Finally, McCabe metric for control flow complexity is between 0 and 2 in the pro- 5. EVALUATION cess views, which is a significant decrease. The calculated The proView framework presented in this paper has been im- metrics show that process model size as well as complexity plemented as a proof-of-concept prototype in a client-server is decreased in the process views. application. This prototype enables users to simultaneously create and change process models based on process views Overall, this evaluation has shown promising results. In [24, 25]. Overall, the proView prototype demonstrates the particular, it becomes easier for process participants to un- applicability of our framework (cf. Figure 15). derstand those process aspects relevant for them. 6. RELATED WORK IEEE 1471 recommends user-specific viewpoints for software architectures [28]. These viewpoints are templates from which individual views are created for a concrete software architecture. Since this standard does not define any meth- ods, tools or processes, proView could provide a powerful framework in the context of PAIS. [29] introduces a meta model for views and shows a general overview of process view patterns. However, no implementation is provided. Some view creation approaches deal with inter-organizational processes and apply views to create abstractions of private processes hiding sensitive process parts [30, 9, 31, 32]. In particular, views are specified by the designer.

Figure 15. Proof-of-Concept Prototype [8] presents an approach with predefined view types (i.e. hu- man tasks, collaboration view). As opposed to proView, it is limited to the specified view types and it is not possible We further applied this prototype in an industry project, to define user-specific views. [33] applies graph reduction to e.g., to visualize the order handling process of a mid-sized verify structural properties of process schemas. proView ac-

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 15 complishes this via aggregation and respective high-level op- parameterization of the respective operations allows specify- erations. [34] uses SPQR-tree decomposition for abstracting ing the quality level the resulting view schema must comply process models. This approach neither provides high-level with. This enables adaptable process visualization not fea- abstractions nor does it take other process aspects (e.g. data sible with existing approaches. We have implemented large flow) into account. parts of the described view mechanism in a prototype and evaluated it in the context of an industry project. For usabil- [35] determines semantic similarity between activities by ity reasons, it is important to provide appropriate methods analysing the structural information of a process model. The for defining and maintaining process views. Therefore, we discovered similarity is used to abstract the given process have designed and implemented a comprehensive set of user- model. However, the approach neither distinguish between oriented, high-level operations as well as on a view definition different user perspectives on a process model nor does it language. Both will be evaluated in a user experiment. provide concepts to manually create process views.

An approach for creating aggregated views is presented in 8. REFERENCES [36]. It proposes a two-phase procedure for aggregating parts [1] Reichert, M., Weber, B.: Enabling Flexibility in of a process model that must not be shown to public. How- Process-Aware Information Systems - Challenges, ever, this approach focuses on block-structured graphs and Methods, Technologies. Springer (2012) neither considers data flow nor attributes are considered. [2] Weber, B., Sadiq, S., Reichert, M.: Beyond Rigidity - Implementations of process views focusing on process mon- Dynamic Process Lifecycle Support: A Survey on itoring are presented in [37, 38]. These approaches focus Dynamic Changes in Process-Aware Information on the mapping of run-time information to process views. Systems. Computer Science - Research and Respective views have to be pre-specified manually by the Development 23 (2009) 47–65 designer. [3] Weber, B., Reichert, M., Mendling, J., Reijers, H.A.: Several approaches exist that align business process mod- Refactoring Large Process Model Repositories. els with technical workflow models [39, 40, 41, 42] and to Computers in Industry 62 (2011) 467–486 synchronize them with changes. However, neither an auto- [4] Reichert, M.: Visualizing Large Business Process mated synchronization of changes nor high-level operations Models: Challenges, Techniques, Applications. In: 1st are provided. In this context, proView supports high-level Int’l Workshop on Theory and Applications of Process operations to automatically create and update both business Visualization (TAProViz’12), BPM’12 Workshops and technical process models [25]. (Invited Keynote), Tallinn, Estonia (2012) An approach enabling abstractions of large, object-centric [5] Reichert, M., Kolb, J., Bobrik, R., Bauer, T.: process structures is presented in [43]. In particular, state Enabling Personalized Visualization of Large Business abstractions and coordination components are used to visu- Processes through Parameterizable Views. In: Proc alize (and execute) process structures. 26th Symposium On Applied Computing (SAC’12), Riva del Garda (Trento), Italy (2012) 1653–1660 The proView project provides a holistic framework for user- [6] Streit, A., Pham, B., Brown, R.: Visualization centric view creation based on elementary as well as high- Support for Managing Large Business Process level operations. Thereby, the behaviour and information Specifications. In: Proc 3rd Int’l Conf. Business perspectives are taken into account. Additionally, it consid- Process Management (BPM’05). (2005) 205–219 ers run-time information. None of the existing approaches [7] Kabicher-Fuchs, S., Rinderle-Ma, S., Recker, J., covers all these aspects. Furthermore, existing approaches Indulska, M., Charoy, F., Christiaanse, R., Dunkl, R., for creating views are based on rigid constraints not tak- Grambow, G., Kolb, J., Leopold, H., Mendling, J.: ing practical requirements into account. For example, our Human-Centric Process-Aware Information Systems first design of a visualization-oriented view mechanism was (HC-PAIS). CoRR abs/1211.4 (2012) based on reduction and aggregation techniques for block- [8] Tran, H.: View-Based and Model-Driven Approach for structured process graphs [20]. Presenting this solution to Process-Driven, Service-Oriented Architectures. TU business users, however, we figured out that block-structured Wien, Dissertation (2009) aggregation does not always meet the practical requirements [9] Chiu, D.K., Cheung, S., Till, S., Karlapalem, K., Li, coming with the visualization of large processes. For this Q., Kafeza, E.: Workflow View Driven Cross- reason, proView allows to flexibly specify the acceptable de- Organizational Interoperability in a Web Service gree of imprecision. A validation of proView was conducted, Environment. Inf. Techn. and Mgmt. 5 (2004) 221–250 where users are confronted with with complex, long-running [10] Kolb, J., Reichert, M.: Using Concurrent Task Trees development processes [20]. for Stakeholder-centered Modeling and Visualization of Business Processes. In: Proc S-BPM ONE 2012, 7. SUMMARY AND OUTLOOK CCIS 284. (2012) 237–251 We introduced the proView framework and its formal foun- [11] Kolb, J., Rudner, B., Reichert, M.: Towards dation. Further, high-level view creation operations provide Gesture-based Process Modeling on Multi-Touch the required flexibility since process schemas can be adapted Devices. In: Proc 1st Int’l Workshop on to specific user groups. Reduction operations provide tech- Human-Centric Process-Aware Information Systems niques hiding irrelevant parts of the process, whereas aggre- (HC-PAIS’12), Gdansk, Poland (2012) 280–293 gation operations allow abstracting from process details by [12] Kolb, J., Hubner,¨ P., Reichert, M.: Automatically aggregating arbitrary sets of activities in one node. Finally, Generating and Updating User Interface Components

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 16 in Process-Aware Information Systems. In: Proc 10th Business Information Systems (BIS’2006), Klagenfurt, Int’l Conf on Cooperative Information Systems Austria (2006) 1–12 (CoopIS 2012). (2012) 444–454 [28] IEEE: (IEEE 1471-2000 - Recommended Practice for [13] Kolb, J., Hubner,¨ P., Reichert, M.: Model-Driven User Architectural Description for Software-Intensive Interface Generation and Adaptation in Systems) Process-Aware Information Systems. Technical report, [29] Schumm, D., Leymann, F., Streule, A.: Process UIB 2012-04, Technical Report, Ulm University (2012) Viewing Patterns. In: Proc 14th IEEE International [14] Bobrik, R., Bauer, T., Reichert, M.: Proviado - EDOC Conference, EDOC 2010, IEEE Computer Personalized and Configurable Visualizations of Society (2010) 89–98 Business Processes. In: Proc 7th Int’l Conf Electronic [30] Chebbi, I., Dustdar, S., Tata, S.: The View-based Commerce & Web Technology (EC-WEB’06), Approach to Dynamic Inter-Organizational Workflow Krakow, Poland (2006) 61–71 Cooperation. Data & Know. Eng. 56 (2006) 139–173 [15] Bobrik, R., Reichert, M., Bauer, T.: View-Based [31] Kafeza, E., Chiu, D.K.W., Kafeza, I.: View-Based Process Visualization. In: Proc 5th Int’l Conf. on Contracts in an E-Service Cross-Organizational Business Process Management, Brisbane, Australia Workflow Environment. In: Techn. E-Services. (2001) (2007) 88–95 74–88 [16] Kolb, J., Kammerer, K., Reichert, M.: Updatable [32] Schulz, K.A., Orlowska, M.E.: Facilitating Process Views for User-centered Adaption of Large Cross-Organisational Workflows with a Workflow Process Models. In: Proc 10th Intl. Conf. on Service View Approach. Data & Knowledge Engineering 51 Oriented Comp. (ICSOC’12), Shanghai, China (2012) (2004) 109–147 [17] Reichert, M., Bassil, S., Bobrik, R., Bauer, T.: The [33] Sadiq, W., Orlowska, M.E.: Analyzing Process Models Proviado Access Control Model for Business Process Using Graph Reduction Techniques. Information Monitoring Components. Enterprise Modelling and systems 25 (2000) 117–134 Inf. Sys. Arch. - An Int’l Journal 5 (2010) 64–88 [34] Polyvyanyy, A., Smirnov, S., Weske, M.: The [18] Rinderle, S., Bobrik, R., Reichert, M., Bauer, T.: Triconnected Abstraction of Process Models. In: Proc Business Process Visualization - Use Cases, 7th Int’l Conf. Business Process Management. (2009) Challenges, Solutions. In: Proc 8th Int’l Conf on [35] Smirnov, S., Reijers, H.A., Weske, M.: A Semantic Enterprise Information Systems (ICEIS’06). Number Approach for Business Process Model Abstraction. In: May, Paphos, Cyprus (2006) 204–211 Advanced Information Systems Engineering, Springer [19] Kolb, J., Reichert, M.: Data Flow Abstractions and Berlin / Heidelberg (2011) 497–511 Adaptations through Updatable Process Views. In: [36] Eshuis, R., Grefen, P.: Constructing Customized Proc 27th Symposium On Applied Computing Process Views. Data & Know. Engineering 64 (2008) (SAC’13), Coimbra, Portugal (2013) (to appear) [37] Shan, Z., Yang, Y., Li, Q., Luo, Y., Peng, Z.: A [20] Bobrik, R.: Konfigurierbare Visualisierung komplexer Light-Weighted Approach to Workflow View. APWeb Prozessmodelle. Phd thesis, Ulm University (2008) 2006 (2006) 1059–1070 [21] Weber, B., Reichert, M.: Refactoring Process Models [38] Schumm, D., Latuske, G., Leymann, F., Mietzner, R., in Large Process Repositories. In: Proc 20th CAiSE, Scheibler, T.: State Propagation for Business Process Montpellier, France (2008) 124–139 Monitoring on Different Levels of Abstraction. In: [22] Liu, D.R., Shen, M.: Workflow Modeling for Virtual Proc. 19th ECIS, Helsinki, Finland (2011) Processes: an Order-Preserving Process-View [39] Weidlich, M., Barros, A., Mendling, J., Weske, M.: Approach. Information Systems 28 (2003) 505–532 Vertical Alignment of Process Models - How Can We [23] Shen, M., Liu, D.R.: Discovering Role-Relevant Get There? In: Proc Ent., Bus. Proc. & Inf. Sys. Process-Views for Recommending Workflow Mod. Volume 29 of Lecture Notes in Business Information. In: Proc 14th DEXA’03, Prague, Czech Information Processing. Springer (2009) 71–84 Republic (2003) 836–845 [40] Branco, M.C., Troya, J., Czarnecki, K., Kuster,¨ J., [24] Kolb, J., Kammerer, K., Reichert, M.: Updatable V¨olzer, H.: Matching Business Process Workflows Process Views for Adapting Large Process Models: Across Abstraction Levels. In: Proc 15th Int’l Conf The proView Demonstrator. In: Proc of the Business Model Driven Engineering Languages and Systems Process Management 2012 Demonstration Track, (MODELS’12), Innsbruck, Italy (2012) Tallinn, Estonia (2012) [41] Favre, C., Kuster,¨ J., V¨olzer, H.: The Shared Process [25] Kolb, J., Reichert, M.: Supporting Business and IT Model. In: Demo Track 10th Int’l Conf on Business through Updatable Process Views: The proView Process Management (BPM’12). (2012) 12–16 Demonstrator. In: Demo Track of the 10th Int’l [42] Buchwald, S., Bauer, T., Reichert, M.: Bridging the Conference on Service Oriented Computing Gap Between Business Process Models and Service (ICSOC’12), Shanghai, China (2013) 460–464 Composition Specifications. In: Service Life Cycle [26] Cardoso, J., Mendling, J., Neumann, G., Reijers, Tools and Technologies: Methods, Trends and H.A.: A Discourse on Complexity of Process Models. Advances. IGI Global (2011) 124–153 In: Proc Workshops 4th Int’l Conf on Business [43] Kunzle,¨ V., Reichert, M.: PHILharmonicFlows: Process Management (BPM’06), Vienna, Austria Towards a Framework for Object-Aware Process (2006) 117–128 Management. Journal of Software Maintenance and [27] Laue, R., Gruhn, V.: Complexity Metrics for Business Evolution: Research and Practice 23 (2011) 205–244 Process Models. In: Proc 9th Int’l Conference on

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 17 ABOUT THE AUTHORS:

Jens obtained a Diploma in Computer Science at Ulm University, Germany, in 2007. After his graduation, he spent a couple of months working in the U.S. Afterwards, he started working as research assistant in the Institute for Databases and Information Systems (DBIS) at Ulm University. His research topics focus on human-oriented Business Process Management to support end-users in understanding, interacting, and evolving large and complex business process models.

Manfred Reichert holds a PhD in Computer Science and a Diploma in Mathematics. Since January 2008 he has been appointed as full professor at the University of Ulm. There he is co-director of the Institute of Databases and Information Systems as well as chairman of the examination board for bachelor and master studies in Computer Science, Media Informatics, and Software Engineering. Before, Manfred was working as associate professor in the Information Systems Group at the University of Twente, where he was also leader of the strategic research orientations on E- health and Applied Science of Services for Information Society Technologies, as well as member of the management board of the Centre for Telematics and Information Technology - CTIT is one of the largest academic ICT research institutes in Europe with 475 researchers.

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 18 An Experimental Evaluation of IP4-IPV6 IVI Translation Anthony K. Tsetse Alexander L. Wijesinha, Ramesh Karne, Alae Computer Information Systems, Livingstone College, Loukili, Patrick Appiah-Kubi Salisbury, NC 28147 Department of Computer & Information Sciences, Towson [email protected] University, Towson, MD 21252 {awijesinha, rkarne, aloukili, appiahkubi}@towson.edu

ABSTRACT 1. INTRODUCTION While IPv6 deployment in the Internet continues to grow slowly Internet Protocol version 6 (IPv6) [1] is the next generation IP at present, the imminent exhaustion of IPv4 addresses will protocol for the Internet. IPv6 was introduced to replace the encourage its increased use over the next several years. However, existing Internet Protocol version 4 (IPv4). Although IPv6 due to the predominance of IPv4 in the Internet, the transition to provides the benefit of increased address space (16 byte IPv6 is likely to take a long time. During the transition period, addresses), its deployment in the Internet has been slow primarily translation mechanisms will enable IPv6 hosts and IPv4 hosts to due to the established and familiar base of IPv4. Interim measures communicate with each other. For example, translation can be such as NAT and CIDR have also served to conserve the limited used when a server or application works with IPv4 but not with IPv4 address space. However, it is expected that IPv6 usage will IPv6, and the effort or cost to modify the code is large. Stateless increase as the number of networks and devices (such as low and stateful translation is the subject of several recent IETF RFCs. power sensors) that require IP addresses continues to grow. IPv6 We evaluate performance of the new IVI translator, which is and IPv4 are not compatible due to using different header sizes viewed as a design for stateless translation by conducting and formats. experiments in both LAN and Internet environments using a freely available implementation of IVI. To study the impact To enable IPv4 and IPv6 to co-exist during the transition period, of operating system overhead on IVI translation, we implemented several mechanisms have been proposed. Dual stack and the IVI translator on a bare PC that runs applications without an tunneling are described in RFC 2893 [2], with an update in RFC operating system or kernel. Our results based on internal timings 4213 [3] to reflect their usage in practice. Dual stack devices have in each system show that translating IPv4 packets into IPv6 implementations of both IPv4 and IPv6 protocol stacks running packets is more expensive than the reverse, and that address independently. This makes it possible for such devices to process mapping is the most expensive IVI operation. We also measured both IPv4 and IPv6 packets. Tunneling allows IPv6 packets to be packets per second in the LAN, roundtrip times in the LAN and carried on IPv4 networks and vice versa. Address translation is a Internet, IVI overhead for various prefix sizes, TCP connection transition mechanism that allows communication between IPv6 time, and the delay and throughput over the Internet for various devices and IPv4 devices by converting between packets of each files sizes. While both the Linux and bare PC implementations of type. Several RFCs dealing with stateless and stateful translation IVI have low overhead, a modest performance gain is obtained have been recently published by the IETF. The new IVI translator due to using a bare PC.1 [4] is an implementation of stateless IPv4-IPv6 translation that can also support stateful packet translation (the abbreviation IVI is Categories and Subject Descriptors derived from the Roman numerals IV for 4 and VI for 6). C.2.2 [Computer-Communication Networks]: Network We evaluate the performance of IVI translation experimentally in Protocols – Applications, Protocol architecture; C.2.6 a LAN and on the Internet. We use a freely available [Computer-Communication Networks]: Internetworking – implementation on Linux, and our own implementation on a bare Routers, Standard; C.4 [Performance of Systems]: Measurement PC that has no operating system or kernel. This enables us to techniques; D.4.4 [Operating Systems]: Communications compare the overhead due to only the IVI translation process with Management – Network communication. the overhead measured using the Linux IVI translator. General Terms The remainder of this paper is organized as follows. In Section 2, Measurement and Performance. we briefly discuss related work. In Section 3, we describe IVI translation. In Section 4, we discuss Linux and bare PC Keywords implementations of an IVI translator. In Section 5, we present the IPv6; IVI Translation; IPv4-IPv6 transition; bare PC; Linux. results of LAN and Internet experiments to evaluate IVI translation. In Section 6, we give the conclusion.

2. RELATED WORK Several techniques have been proposed for translating between IPv4 and IPv6 packets. One such technique, Network Address 1 Copyright is held by the authors. This work is based on an Translation–Protocol Translation (NAT-PT and NAPT-PT) [5], earlier work: RACS'12 Proceedings of the 2012 ACM Research was defined in RFC 2766. It allowed a set of IPv6 hosts to share a in Applied Computation Symposium, Copyright 2012 ACM single IPv4 address for IPv6 packets destined for IPv4 hosts, and 978-1-4503-1492-3/12/10. also allowed mapping of transport identifiers of the IPv6 hosts. http://doi.acm.org/10.1145/2401603.2401646 RFC 2766 has since been recommended for historical status (in

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 19 RFC 4966) [6] because of several issues. In RFC 3142 [7], the destination IPv4 or IPv6 addresses into their IVI mapped operation of IPv6-to-IPv4 transport relay translators (TRT) is equivalent addresses). To translate an IPv4 header into a IPv6 discussed. TRT enables IPv6-only hosts to exchange two-way header, the IHL, identification, flags, offset, header checksum, TCP or UDP traffic. and options are discarded; version (0x4), TOS, protocol, and TTL fields are mapped into version (0x6), traffic class, next header, Recent work on IPv4-IPv6 translation is detailed in a group of and hop limit respectively; and the value of total length is several related RFCs. A mechanism enabling an IPv6 client to decremented by 20 and assigned to payload length. Conversely, to communicate with an IPv4 server (stateful NAT64) is described in translate an IPv6 header into an IPv4 header, flow label is RFC 6146 [8]. Translation of IP/ICMP headers in NAT64 is done discarded; version (0x6), traffic class, next header, and hop limit based on the Stateless IP/ICMP Translation Algorithm (SIIT) of fields are mapped into version (0x4), TOS, protocol, and TTL RFC 6145 [9]. NAT64 uses an address mapping algorithm respectively; the value of payload length is incremented by 20 and discussed in RFC 6052 [10] for IPv4-IPv6 address mapping. RFC assigned to total length; IHL is set to 5; and the remaining fields 6147 [11] specifies the DNS64 mechanism, which can be used in the IPv4 header are assigned in the usual way (such as with NAT64 to enable IPv6 clients to communicate with IPv4 generating a value for the identification field and computing the servers using the fully qualified domain name of the server. IPv4 checksum). For convenience, we include Tables 1 and 2 Translation as a tool for IPv6/IPv4 coexistence along with eight taken from [4], which summarize the key elements of header translation scenarios are discussed in RFC 6144 [12]. These RFCs translation from IPv4 to IPv6 and from IPv6 to IPv4 respectively provide the background for the IVI translator and translation that were discussed above. approach given in [4], whose performance is the focus of this study. Details concerning an IVI implementation for stateless and stateful packet translation, and its deployment, are provided in [13]. Table 1. IPv4-to-IPv6 Header Translation (from [4]) IPv4 Field IPv6 Equivalent 3. IVI TRANSLATION Version (0x4) Version (0x6) IVI translation employs a prefix-specific and stateless address IHL Discarded mapping scheme that enables IPv4 hosts to communicate with Type of Service Traffic Class IPv6 hosts as described in RFC 6219 [4]. In this scheme, which is Total Length Payload Length = Total based on that given in [10], subsets of an ISP’s IPv4 addresses are Length – 20 embedded in the ISP’s IPv6 addresses. During translation, an Identification Discarded address of one type (v4 or v6) is converted to the other type. To Flags Discarded represent IPv4 addresses in IPv6, the ISP inserts its IPv4 address Offset Discarded prefix after a unique IPv6 prefix consisting of PL bits, where TTL Hop Limit PL=32 (4), 40 (5), 48 (6), 56 (7), 64 (8), or 96 (12) (the number in Protocol Next Header parenthesis is the number of bytes p=PL/8). The representation for Header Checksum Discarded p<12 is shown in Fig. 1, where v4p and v4s respectively denote the prefix and suffix of the IPv4 address and u is a byte of all 0s (the Source Address IVI address mapping length of each field in bytes appears above it). For example, if p=5 Destination Address IVI address mapping Options Discarded bytes (i.e. PL=40 bits), the PREFIX, v4p, u, v4s, and SUFFIX fields consist of 5, 3, 1, 1, and 6 bytes respectively. If p=12, the PREFIX field is followed immediately by the entire IPv4 address Table 2. IPv6-to-IPv4 Header Translation (from [4]) (u is not present). IPv6 Field IPv4 Equivalent P 8-p 1 p-4 11-p Version (0x6) Version (0x4) Traffic Class Type of Service PREFIX v4p u v4s SUFFIX Flow Label Discarded Payload Length Total Length = Payload Figure 1. Representing IPv4 ISP addresses in IPv6 Length + 20 Thus, an ISP with an IPv6/32 address can have a prefix of /40, Next Header Protocol where bits 32 through 39 are set to all ones. Similarly, /24 IPv4 Hop Limit TTL addresses are translated into /64 IPv6 addresses. The suffixes of Source Address IVI address mapping these IPv6 addresses are normally set to all zeros. So with this Destination Address IVI address mapping mapping scheme [4], an ISP with a /32 IPv6 address 2001:dba:: IHL=5 will translate the IPv4 address 202.38.97.205/24 into the Header Checksum equivalent IPv6 address 2001:dba:ffca:2661:cd::. Recalculated Translation of IPv4 and IPv6 headers is done as prescribed in [9] 4. IMPLEMENTATION with the source and destination address translated according to the address mapping scheme described above. After translating the 4.1 Linux Implementation transport layer (TCP or UDP) and ICMP headers, their checksums The Linux implementation of the IVI translator [14] has two main are recomputed to reflect the changes in the IP header. The IP components: a configuration utility and a translation utility. The header mapping scheme specified in [4] enables an IPv6 header to configuration utility is used to setup the routes using mroute or be constructed from an IPv4 header, and conversely, in the mroute6 commands. The translation utility is responsible for following manner (in addition to converting source and translating the packets. To run the IVI translator on Linux, a patch

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 20 has to be applied to the Linux kernel source. Once the patch is translated into an IPv6 packet based on the protocol type in the applied, the Linux kernel can be configured for IVI translation packet and forwarded after validating the higher layer protocol after recompiling the kernel. Patching the kernel results in some checksum as needed. If any checksum fails, the packet is dropped. kernel source files being changed, and four other files (mapping.c, proto_trans.c, mroute.c and mroute6.c) being added for 5. EXPERIMENTAL RESULTS configuration and translation. 5.1 LAN Experiments The functions required for translation referred to in the discussion below are all in the file proto_trans.c. When an IPv4 packet is 5.1.1 LAN Environment received, the function mapping_address4to6 is invoked to map Fig. 4 shows the LAN used for testing. the IPv4 address to an IPv6 address. Based on the value of the This LAN was used to generate TCP data. A similar LAN was protocol field in the received IPv4 header, icmp4to6_trans, used to generate UDP and ICMP data. In the figure, S1, S2, S3 tcp4to6_trans or udp4to6_trans is invoked to translate the and S4 are 100 Mbps Ethernet switches, and R1 and R2 are IPv4 appropriate protocol. Each of these methods calls a function that and IPv6 routers respectively. Router R1 serves the IPv4 routing computes the appropriate higher layer protocol (ICMP, TCP, or domain, which consists of an IPv4 client C4 and an IPv4 server UDP) checksum as well. Once the protocol translation is done, SV4. On the IPv6 side, C6 is an IPv6 client and SV6 is an IPv6 iphdr6to4_trans is called to translate the IP header. For translating server. C6 and SV6 obtain the necessary routing information from IPv6 to IPv4 packets, a similar process is used. the IPv6 router R2. The IVI translator is represented by XT.

4.2 Bare PC Implementation ETHOBJ In a bare PC, applications run without the support of any operating system or kernel. Bare PC applications are built based on the bare machine computing (BMC) paradigm, earlier called xto6 xto4 the dispersed operating system computing (DOSC) paradigm [15]. Handler Handler Bare PC applications include Web servers [16], email servers [17], SIP servers [18], and split protocol servers [19]. The first XLATEOBJ bare PC application with IPv6 and IPv4 capability was a softphone used for studying VoIP performance [20]. This softphone was built by adding IPv6 capability to the bare PC RCV softphone application described in [21]. The IVI translator application that runs on a bare PC was built by adding IVI translation capability on top of lean bare PC versions GW MAIN of the IPv6 and IPv4 protocols [22, 23]. The bare PC translator application included the implementation of two Ethernet objects Figure 2. Bare PC IVI architecture ETH0 and ETH1 that directly communicate with its two network interface cards (NICs), each connecting to one of the two IP networks. A key difference between the Linux and bare PC RCV PKT implementations is that the IP protocols, address mappings, and protocol mappings in the latter are implemented as part of a single object called the XLATEOBJ. The architecture of the bare PC IVI Check Valid IP DROP translator is shown in Fig. 2. Y V6 PKT? N N Protocol Chksum? PKT The MAIN and RECEIVE (RCV) tasks are the only tasks running in the bare PC IVI translator application. The MAIN task runs when the system is started and whenever the RCV task is not Compute Check Checksum Protocol running. The RCV task is invoked from the MAIN task when a packet (Ethernet frame) arrives on either interface. The capability to send out router advertisements on the IPv6 interface of the bare Valid DROP N Valid PC IVI translator is implemented in the GW object as part of the Chksum? PKT Chksum? MAIN task. In XLATEOBJ, the xto4handler and xto6handler Y methods respectively translate between IPv6 packets and IPv4 packets. To map the source and destination IPv6 addresses to the Translate Translate IP Translate corresponding IPv4 addresses, the xto4handler invokes the Address Header Address addr6to4 method.

The processing logic of the bare PC IVI translator is shown in Fig. Forward Pkt 3. The translator initially checks the arriving packet to determine if it is an IPv6 or IPv4 packet. If the packet is an IPv6 packet, the Figure 3. Bare PC IVI packet processing logic payload protocol type is determined from the next header field, and the appropriate protocol checksum is computed. If the checksum is valid, the IPv6 packet is translated into an IPv4 The translator runs on a Dell Optiplex GX270 desktop with an packet and forwarded. If the packet is an IPv4 packet, the IP Intel Pentium IV 2.4 GHz processor, 512 MB RAM, and Intel checksum is computed. If the checksum is valid, the packet is

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 21 PRO 10/100 and 3Com 10/100 NICs. The translator either runs translations between the IP versions in each direction, the IP with Fedora Linux (Fedora 12, Linux kernel 2.6.31) as the address translation time is 3.9 µs and 0.5 µs higher than TCP, operating system, or as a bare PC application with no operating UDP, and ICMP translation time for the IPv4 to IPv6 and IPv6 to system or kernel. The client and server systems run on Dell IPv4 translations respectively. The larger time for the IPv4 to Optiplex GX520s with a 2.6 GHz processor, 1 GB RAM, and an IPv6 address translation compared to the reverse is because the Intel PRO/1000 NIC, and Fedora Linux as the operating system. IPv6 address needs to be constructed by extending the IPv4 Apache HTTP Server 2.2.16 and Mozila Firefox v3.6.7 were used address, whereas the IPv4 address is simply extracted from the respectively as the Web server and Web browser. IPv6 address. To capture timings during translation with HTTP/TCP data, Fig. 6 shows the corresponding IVI translation times on a bare connections were made as follows. For IPv4 to IPv6 translation, PC. The times for IP header translation from IPv4 to IPv6 is 15.8 the IPv4 client (C4) initiated a connection via the browser to the µs compared to 12 µs for translation from IPv6 to IPv4. TCP and IPv6 Web server (SV6) in the IPv6 network. Router R1 forwarded ICMP translation take almost the same time: about 2.5 µs for IPv4 the IPv4 request for a 320 KB file to the IVI translator XT, which to IPv6 translation and 2.0 µs for IPv6 to IPv4 translation. For IP translated the received packets to IPv6 packets, and sent them to address mapping, the processing time is 5 µs from IPv4 to IPv6, the IPv6 Web server (SV6) via the IPv6 router R2. For IPv6 to and 2 µs from IPv6 to IPv4. For UDP, translating from IPv4 to IPv4 translation, the IPv6 client (C6) similarly initiated a IPv6 and from IPv6 to IPv4 takes 2 µs and 1.5 µs respectively. connection via the browser to the IPv4 Web server (SV4). Router R2 forwarded this request to the IVI translator XT. The packet Linux Translation was translated to an IPv6 packet and forwarded to the IPv6 router 35 R1, which delivered the packet to the IPv4 server (SV4). 30 25 S2 S3

XT 20 s) μ 15 V4-V6 10

R1 R2 Time ( V6-V4 5 0 IP IP Header TCP UDP ICMP S1 S4 Address C4 SV6 Figure 5. Linux translation overhead

SV4 C6 Bare PC Translation 20

Figure 4. Test network for LAN Experiments (HTTP/TCP) 15 s) 5.1.2 LAN Results μ V4-V6 The internal timing data presented in this section was captured by 10 inserting timing points in the Linux (Fedora) and the bare PC V6-V4 source code. The TCP traffic was generated by sending HTTP Time ( requests from clients to servers using the LAN in Fig. 4 as 5 described earlier. Using a similar LAN, Mgen [24] was used to generate UDP traffic and the ping/ping6 utility was used to 0 generate ICMP packets. Each experiment was run three times and IP IP TCP UDP ICMP the average data is reported (we omit the deviation since the Address Header differences in the values for each experiment were small).

Fig. 5 shows the time it takes internally to translate packets using Figure 6. Bare PC translation overhead the Linux implementation of the IVI translator. It is seen that the IP header translation is the most expensive operation for both Comparing the times for IVI translation from IPv4 to IPv6 for the IPv6-IPv4 and IPv4-IPv6 translation, while UDP, TCP or ICMP Linux and bare PC implementations (using Figs. 5 and 6), it can translation has almost the same translation time. The header be seen that 1) for TCP, UDP and ICMP translations, the time for translation time is larger since it includes both address and the bare PC is about 2.3 µs less than for the Linux protocol (TCP, UDP or ICMP) translation. implementation; 2) for IP address translation, the processing time for the bare PC is about 3.9 µs less than for the Linux It is also evident that the IP address translation time is larger than implementation; and 3) for IP header translation, the processing the UDP, TCP, and ICMP header translation time since the higher time for the bare PC is 15 µs less than for the Linux layer translations involve very little processing. Comparing implementation.

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 22 Similarly, comparing the times for IVI translation from IPv6 to translator (XT). The IVI translator then translates the IPv6 packets IPv4 translation for the Linux and bare PC implementations, it can to their equivalent IPv4 packets, and forwards the packets out on also be seen that 4) for TCP and UDP traffic, the improvement in its IPv4 interface to LC’s IPv4 border router LC4.These packets processing time due to using a bare PC is 2.2 µs; 5) for ICMP, the are seen as ordinary IPv4 packets, and the border router forwards improvement in processing time is 3.2 µs; and 6) for IP address them to the Internet. When LC4 receives the response from the mapping, it is 2.6 µs. Internet, it forwards the IPv4 packets to the IVI translator. The To compare packets per second (pps) processed by the IVI translator maps each packet to its IPv6 equivalent, and forwards it translator for IPv4 to IPv6 translation and IPv6 to IPv4 to the IPv6 router RT6, which delivers it to the IPv6 client CL. translation, the Mgen generator was used to make TCP The client CL and the router RT6 run on Fedora Linux 18 kernel connections between source and sink. The results when 3.6.7. The Linux IVI translator runs Fedora Linux 6. The same transferring 1024-byte packets for which the throughput is at a experiments were repeated with the bare PC IVI translator. maximum are shown in Fig. 7. As expected, pps for IPv4 to IPv6 translation is less than for IPv6 to IPv4 translation, which reflects the decrease in processing time for the latter; the drop in pps for Round Trip Time (RTT) IPv4 to IPv6 translation compared to IPv6 to IPv4 translation is about 18% for Linux and about 13% for the bare PC. The improvement due to using a bare PC instead of Linux is about 160 17% for IPv4 to IPv6 translation and about 9% for IPv6 to IPv4 140 translation. 120

100 We also compared the round trip times for 1024 byte ICMP Ping 80 µs packets. Request packets undergo IPv4 to IPv6 translation and the 60 reply packets undergo IPv6 to IPv4 translation. The results are shown in Fig. 8 and indicate that round trip time is reduced by 40 about 17% on a bare PC. 20 0 These results show that the overhead for IVI translation for both Linux Bare the Linux and bare PC implementations is small. However, there is a small improvement in processing time (and hence, packets processed per second) when using a bare PC. Figure 8. Ping round trip time (RTT) PPS 20,000 S1

15,000 S2

S3 10,000 Linux v4-v6 LC4

5,000 Linux v6-v4

- Bare v4-v6 v4-v6 v6-v4 v4-v6 v6-v4 XT Bare v6-v4 Linux Bare

RT6 CL Figure 7. Packets per second (pps) 5.2 Internet Experiments Figure 9. Test network for Internet experiments 5.2.1 Internet Environment 5.2.2 Internet Results Fig. 9 shows the network used to enable a client to connect to Figs. 10 and 11 respectively show the IVI translation overhead servers on the Internet via the IVI translator. One end of the test using the Linux and bare PC IVI implementations when IPv4 network with the client and translator was setup at Livingstone addresses are mapped to IPv6 addresses with varying prefix College (LC) in Salisbury, North Carolina. LC’s network only lengths. We used prefix lengths ranging from 32 to 64 bits in allows IPv4 connectivity. increments of 8 bits with IPv4 addresses represented in IPv6 as shown in Fig. 1. It can be seen that translation overhead is lowest As shown in the figure, our IPv6 client (CL) connected to the LC for a prefix length of 32 bits. With prefix lengths of 40 bits or network communicates with three different IPv4 Web servers (S1, more, the translation overhead is slightly higher for both the S2, and S3) on the Internet. In accessing the Web servers, their IP Linux and bare PC implementations. This can be explained by the addresses were used instead of the domain names since most ISPs fact that in mapping prefix lengths less than 40 bits as specified in on the Internet cannot resolve IVI mapped addresses. An HTTP [10], the IPv4 addresses appear within the IPv6 addresses in a request from the IPv6 client CL is initially routed to our default contiguous manner, which results in relatively less processing IPv6 router (RT6), which forwards the request to the IVI

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 23 overhead. Mapping IPv4 addresses to IPv6 addresses is more It is seen that irrespective of the translation function and expensive than the reverse irrespective of whether a Linux or bare implementation, IPv4 to IPv6 translation has a higher overhead PC implementation is used. compared to IPv6 to IPv4 translation. The results for Internet traffic are very similar to those for LAN traffic (Figs. 5 and 6) as Figs. 12 and 13 show the respective round trip times (RTTs) for would be expected. Linux and bare translators by pinging three different Internet Web sites: www.itu.dk (Site 1), www.fh-offenburg.de (Site 2), and www.knust.edu.gh (Site 3) with different hop counts. We compare Linux RTT RTTs with IVI translation and for a direct request using IPv4. 140 130 Requests to Site 1 have the highest RTT followed by Site 2 and 120 110 Site 3. This is because the hop count for Site 1 is higher than for 100 Site 2 and Site 3. It is also evident that the RTT when using IVI 90 80 can be between a few milliseconds to about 15 milliseconds 70 higher compared to a direct request to a site using IPv4. RTTs for 60 Translated RTT(ms) 50 Linux are higher than for the bare PC as expected. These results 40 V4 30 provide an estimate of the extra overhead due to IVI translation. 20 10 0 Linux Translation Times Site 1 Site 2 Site 3 8 Site 7.5 7 6.5 6 Figure 12. Linux RTTs for three Web sites 5.5 5 4.5 4 Bare RTT 3.5 V4 ->V6

Time(µs) 3 2.5 140 2 V6 ->V4 1.5 120 1 0.5 100 0 32 40 48 56 64 80 Prefix Length (bits) 60 Translated RTT(ms) 40 V4 Figure 10. Linux translation with different prefix lengths 20 0 Site 1 Site 2 Site 3 Bare Translation Time Sites

4.5 4 Figure 13. Bare PC RTTs for three Web sites 3.5 3 2.5 2 Linux Connection Time V4 ->V6 Time(µs) 1.5 140 V6 ->V4 1 120 0.5 Translated 0 100 32 40 48 56 64 80 V4 Prefix Length(bits) 60 Time(ms) 40 Figure 11. Bare PC translation with different prefix lengths 20 0 Figs. 14 and 15 show the respective connection times for Linux and bare translators using HTTP requests to the same three Web Site 1 Site 2 Site 3 sites. Connection time is defined as the time for a TCP connection Site to be established. The connection times for Linux range from 42- 121 ms for IPv4-IPv4 requests and from 46-126 ms for IVI Figure 14. Linux connection time for three Web sites translated requests. The corresponding connection times for the We also conducted experiments over the Internet to evaluate IVI bare PC range from 40-118 ms for IPv4-IPv4 requests and from performance when the IPv6 LC client (in Livingston, North 44-124 ms for IVI translated requests. Carolina) makes HTTP requests for files of sizes 104, 318, and Figs. 16 and 17 show the internal timings for the various IVI 531 KB respectively from an IPv4 Apache Web server running at translation functions using the Linux and the bare PC Towson University (TU) in Towson, Maryland. implementations respectively with traffic to and from the Internet.

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 24 calculated throughput in Fig. 19, which is the file size divided by Bare Connection Time the delay. The performance difference between the bare PC and 140 Linux IVI implementations for these Internet experiments is seen 120 Translated to be relatively small (as seen above for the case of the LAN experiments). 100 V4 80 6. CONCLUSION 60 IVI translation is a technique recently proposed by the IETF to Time(ms) 40 enable communication between hosts on IPv4 and IPv6 networks. We determined the overhead due to IVI translation by measuring 20 the internal timings for both IPv6 to IPv4 and IPv4 to IPv6 0 translation on an ordinary desktop running Linux Fedora. We also Site 1 Site 2 Site 3 implemented an IVI translator as a bare PC application on the Site same machine with no operating system or kernel, and compared the corresponding timings. Figure 15. Bare PC connection time for three Web sites Several experiments were conducted in LAN and Internet environments to evaluate IVI translation performance. The results show that in general it is more expensive to translate packets from IPv4 to IPv6 than from IPv6 to IPv4 with a 4.1 µs difference on Linux and a 3.8 µs difference on a bare PC. It was found that address mapping is the most expensive operation in IVI translation regardless of the system on which the translator is implemented. IVI translation has little overhead for both the Linux and bare PC implementations regardless of the higher layer protocol carried in the payload. However, eliminating the operating system overhead would enable more efficient translation. The left-over CPU cycles could be used to process more packets, or for enhanced security-related processing during packet translation. Connection Time Figure 16. Linux Translation Time 60

50

40

30

Time (ms) Time 20

10

0 Linux Bare

Figure 18. Connection time (LC to TU)

Delay 600 Figure 17. Bare Translation Time 500 Figs. 18-20 respectively compare the results for connection time, delay and throughput using the Linux and bare PC IVI 400 implementations. The connection time is defined as for Figs. 14 300 Linux and 15, and therefore is essentially the same for all three file sizes. Time(ms) 200 Delay is defined as the time between the TCP SYN request and Bare last ACK, and throughput is the data rate for the duration of the 100

TCP connection. As expected, the larger files have the higher 0 delays; they also have higher throughput. The throughput, which 104 318 531 was obtained using a Wireshark packet analyzer [25] connected to the local LC switch, is the total throughput; it includes all the File Size (KB) bytes (headers, data, and acks) seen by Wireshark. The measured throughput is therefore close to (but not identical to) the Figure 19. Delay for various files sizes (LC to TU)

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 25 12 [12] Baker, F., Li, X., Bao, C., and Yin, K. Framework for Throughput IPv4/IPv6 Translation. RFC 6144, Apr. 2011. 10 [13] Zhai, Y., Bao, C., and Li, X. Transition from IPv4 to IPv6: A 8 Translation Approach. Proc. Sixth IEEE International Conference on Networking, Architecture and Storage. 30-39. 6 104 [14] Source Code of the IVI implementation for Linux: 4 318 http://linux.ivi2.org/impl/. accessed 05/10/2012.

Throughput (Mbps) 531 [15] Karne, R. K., Jaganathan, K. V., Rosa, N., and Ahmed, T. 2 DOSC: dispersed operating system computing. Companion 0 to Proc. 20th annual ACM SIGPLAN Object Oriented Wireshark Calculated Wireshark Calculated Programming Systems and Applications Conference (OOPSLA’05), 2005, 55-62. Bare Linux [16] He, L., Karne, R. K., and Wijesinha, A. L. The design and performance of a bare PC Web server. International Journal Figure 20. Throughput for various file sizes (LC to TU) of Computers and their Applications, 15 (2):100–112, 2008. 7. ACKNOWLEDGMENTS [17] Ford, G. H., Karne, R. K., Wijesinha, A. L., and Appiah- We thank Bob Huskey for providing us with the code for his bare Kubi, P. The design and implementation of a bare PC email PC IPv4-IPv6 gateway implementation. server. Proc. 33rd annual IEEE International Computer Software and Applications Conference (COMPSAC ‘09), 8. REFERENCES 2009, 480-485. [1] Deering, S., and Hinden, R. Internet Protocol, Version 6 [18] Alexander, A., Yasinovskyy, R., Wijesinha, A., and Karne, (IPv6) Specification. RFC 2460, Dec. 1998. R. SIP server implementation and performance on a bare PC. [2] Gilligan, R., and Nordmark, E. Transition Mechanisms for International Journal on Advances in Telecommunications, 4 IPv6 Hosts and Routers. RFC 2893, Aug. 2000. (1 and 2):82–92, 2011. [3] Nordmark, E., and Gilligan, R. Basic Transition Mechanisms [19] Rawal, B. S., Karne, R. K., and Wijesinha, A. L. Splitting for IPv6 Hosts and Routers. RFC 4213, Oct. 2005. HTTP requests on two servers. Proc. Third International [4] Li X., et al. The China Education and Research Network Conference on Communication Systems and Networks (CERNET) IVI Translation Design and Deployment for the (COMSNETS), 2011, 1-8. IPv4/IPv6 Coexistence and Transition. RFC 6219, May [20] Yasinovskyy, R., Wijesinha, A. L., Karne, R. K., and 2011. Khaksari, G. A comparison of VoIP performance on IPv6 [5] Tsirtsis, G., and Srisuresh, P. Network Address Translation – and IPv4 networks. Proc. IEEE/ACS International Protocol Translation (NAT-PT). RFC 2766, Feb. 2000 . Conference on Computer Systems and Applications (AICCSA), 603-609, 2009. [6] Aoun, C., and Davies, E. Reasons to Move the Network Address Translator - Protocol Translator (NAT-PT) to [21] Khaksari, G. H., Wijesinha, A. L., Karne, R. K., He, L., and Historic Status. RFC 4966, Jul. 2007. Girumala, S. A peer-to-peer bare PC VoIP application. Proc. Fourth IEEE Consumer Communications and Networking [7] Hagino, J., and Yamamoto, K. An IPv6-to-IPv4 Transport Conference (CCNC), 2007, 803-807. Relay Translator. RFC 3142, Jun. 2001. [22] Tsetse, A. K., Wijesinha, A. L., Karne, R. K., and Loukili, A. [8] Bagnulo, M., Matthews, P., and van Beijnum, I. Stateful L. A 6to4 gateway with co-located NAT. Proc. IEEE NAT64: Network Address and Protocol Translation from International Conference on Electro/Information IPv6 Clients to IPv4 Servers. RFC 6146, Apr. 2011. Technology, 2012, 1-6. [9] Li, X., Bao, C., and Baker, F. IP/ICMP Translation [23] Tsetse, A. K., Wijesinha, A. L., Karne, R. K., and Loukili, A. Algorithm. RFC 6145, Mar. 2011. L. A bare PC NAT box. Proc. International Conference on [10] Bao, C., Huitema, C., Bagnulo, M., Boucadair, M., and Li, Communications and Information Technology, 2012, 281- X. IPv6 Addressing of IPv4/IPv6 translators. RFC 6052, Oct. 285. 2010. [24] Mgen, http://cs.itd.nrl.navy.mil, accessed 05/10/2012. [11] Bagnulo, M., Sullivan, A., Mathews, P., and van Beijnum, I. [25] Wireshark, http://www.wireshark.org, accessed: 02/18/2013. DNS64: DNS Extensions for Network Address Translation from IPv6 Clients to IPv4 Servers. RFC 6147, Apr. 2011.

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 26 ABOUT THE AUTHORS:

Anthony K. Tsetse graduated from Towson University in 2012 with a doctorate in Information Technology. He received a B.S. in Computer Science from the Kwame Nkrumah University of Science and Technology, Kumasi, Ghana; an M.S. in Communication and Media Engineering from The Offenburg University of Applied Sciences, Germany; and an M.S.in Information Technology from the IT University of Copenhagen, Denmark. He is an Assistant Professor in the Department of Computer Information Systems at Livingstone College. His current research interests include bare machine computing, network performance analysis, cloud computing, wireless networks, and network security.

Alexander L. Wijesinha is a Professor in the Department of Computer and Information Sciences at Towson University. He holds a Ph.D. in Computer Science from the University of Maryland Baltimore County, and both an M.S. in Computer Science and a Ph.D. in Mathematics from the University of Florida. He received a B.S. in Mathematics from the University of Colombo, Sri Lanka. His research interests are in computer networks including wireless networks, VoIP, network protocol adaptation for bare machines, network performance, and network security.

Ramesh K. Karne is a Professor in the Department of Computer and Information Sciences at Towson University. He obtained his Ph.D. in Computer Science from the George Mason University. Prior to that, he worked with IBM at many locations in hardware, software, and architecture development for mainframes. He also worked at the Institute of Systems Research at University of Maryland, College Park as a research scientist. His research interests are in bare machine computing, networking, computer architecture, and operating systems.

Alae Loukili holds a doctorate in Information Technology and a Masters degree in Computer Science from Towson University. He also holds a Masters degree in Electrical Engineering from l' Ecole Nationale Supérieure d'Electricité et de Mécanique de Casablanca.He is currently a faculty member in the Department of Computer and Information Sciences at Towson University. His research interests are in computer networks and data communications, wireless communications, and bare machine computing.

Patrick Appiah-Kubi earned a doctorate in Information Technology from Towson University, an MS in Electronics and Computer Technology from Indiana State University, and a BS in Computer Science from Kwame Nkrumah University of Science and Technology, Ghana. He is a lecturer in the Computer and Information Sciences Department at Towson University. He previously worked as a system engineer and IT consultant at MVS Consulting in Washington DC. His research interests are in bare machine computing systems, networks, and security.

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 27 Fairness Scheduler for Virtual Machines on Heterogonous Multi-Core Platforms Chi-Sheng Shih, Jie-Wen Wei, Shih-Hao Hung, Norman Chang and Joen Chen Institute of Information Industry National Taiwan University Taipei, Taiwan Taipei, Taiwan [email protected] ABSTRACT vendors and research institutes. OMAP platform from Texas Instruments [24], NaviEngine platform from NEC Electronics Heterogeneous multi-core processors are now widely deployed to TM meet computation requirements for multimedia applications on [22], Tesla processors from nVidia[18], C6000 High embedded mobile devices. However, due to the difference on Performance Multi-core DSP from Texas Instruments Inc., computation capability of heterogeneous multi-cores, it is MSC8156 from FreeScale, and PAC DUO from ITRI [16] are few challenging to share the load among cores and to better utilize the examples. With the increasing computation capacity on such cores on the processor. In this work, we develop a fairness platforms, many embedded system developers are now scheduler to share the load among cores to meet the above merging/migrating several stand-alone embedded real-time challenge. The developed framework ensures that each virtual applications onto one shared platform to take advantage of high machine receives its proportional share of computation time over computation capacity, low cost, and low maintenance overhead on different processing elements. To fairly schedule the tasks on the heterogeneous multicore platforms. Redesigning and re- platform, VM-aware fair scheduler (VMAFS) takes into account implementing existing applications for new platforms have its the preemptibility of processes on co-processors. A family of own merits: better optimized resource management and global algorithms is designed to schedule tasks on preemptive and non- schedulability analysis. However, this approach suffers from long preemptive processing elements. We define fairness on such development cycle, intractable schedulability analysis, and sky- architectures and use this metric to efficiently drive VMAFS rising development cost. algorithm so as to fairly manage non-preemptive computing Another candidate solution is to migrate existing embedded real- resources. Performance evaluations result show that when time applications to virtual machines. Figure 1 shows the system VMAFS algorithm is used, fairness is not sensitive to architecture of such platforms. In this architecture, the host the amount of computation time of non-preemptive tasks. In platform is a heterogeneous multi-core SoC. It consists of multi- addition, VMAFS algorithm greatly outperforms credit-based core general-purpose processors such as the processors from scheduling algorithm, which is deployed in existing ARM, Intel, and special purpose processors such as MIPS, DSP environment.1 cores, GPU cores, hardware accelerators, network adaptors, storage devices, etc. On the host platform, a thin operating system Categories and Subject Descriptors such as micro-kernel is deployed to abstract the details of C.3 [Information Systems Applications]: SPECIAL-PURPOSE hardware components and manage concurrent access to the AND APPLICATION-BASED SYSTEMS—Real-time and physical devices. On virtualization layer, , also called embedded systems, Microprocessor/microcomputer applications; virtual machine manager (VMM), provides virtualized hardware D.4.1 [Software]: Operating Systems Process Management platforms to guest operating systems, and maps/manages the [scheduling, Synchronization] requests to virtualized hardware components to physical hardware components on the host platform. Not all the virtual machine are identical: each virtual machine consists of virtual hardware

General Terms components which can be mapped onto hardware components on Performance and Theory the host platform. The resource manager in hypervisor assures that the resources are correctly shared among the virtual machines. On Keywords virtual machines, one can deploy different guest operating systems Fairness Scheduling. such as µc/OS II to host embedded applications, Windows from Microsoft to host user interface applications, real-time operating 1. INTRODUCTION systems to assure time-sensitive communication. The advantages of this solution include (1) resource isolation, (2) low migration Multi-core DSP SoCs have the capability of processing overhead, and (3) low hardware cost. Virtualization technologies application specific data for embedded real-time systems, and serve as alternatives to manage computing resources on cloud have been adopted in many mission critical systems and consumer computing and desktop computing. Examples are the electronic systems. Such platforms are available from commercial virtualization software from VMware and [25]. Furthermore, several commercial and research projects demonstrate that the 1 Copyright is held by the authors. This work is based on an solution is feasible for real-world applications. Examples are earlier work: RACS'12 Proceedings of the 2012 ACM Research LynuxSecure from LynuxWorks Inc. [3], OKL4 Microvisor from in Applied Computation Symposium, Copyright 2012 ACM Open Kernel Lab [19], and distributed resource kernel developed 978-1-4503-1492-3/12/10. at CMU [14]. http://doi.acm.org/10.1145/2401603.2401691

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 28 Figure 1. System architecture for heterogeneous multi-core DSP SoCs Although virtualization technologies are widely adopted in many architecture, and system model. The problem definitions are also applications, existing technologies focus on general-purpose described. Section 3 presents VM-aware fair schedulers for processors and do not support co-processor virtualization. For preemptive and non-preemptive tasks. Section 4 presents the instance, GPU and CPU virtualization have been studies for years. performance evaluation results for the VM-aware fair scheduler However, for embedded real-time systems, many of the and other fairness schedulers. computation needs are conducted by DSP cores and ASICs on such platforms. These two types of co-processors does not benefit 2. BACKGROUND AND PROBLEM from existing virtualization framework. The reasons are the following. First of all, most of DSPs and ASICs use special DEFINITION designed instruction sets. The virtualization layer has difficulties 2.1 Background to provide specific hardware abstract models for DSPs and ASICs. The Open Computing Language (OpenCL) [2] is a c99-based [1] Second, translating binary instructions to targeted instruction sets open standard for parallel programming across heterogeneous leads to great amount of computation overhead and is not processors. It is designed to abstract the heterogeneity of practical for embedded real-time applications. underling hardware such that applications are not bounded to a We propose to take advantage of both OpenCL and virtualization specific hardware environment. Figure 2 shows the architecture of technology. In our approach, heterogeneous processors are shared OpenCL framework. and accessed from guest virtual machines through hypervisor. To take advantage of the computation capacity of heterogeneous multi-core SoC platforms, we propose to integrate virtualization technology with OpenCL framework [2] to share the computation capacity on the platforms. OpenCL is an open standard to support run-time task dispatch on heterogeneous multi-core platforms. The tasks programmed in OpenCL can be compiled into different instruction sets, depending on the allocated processing core to execute the task. In the framework, we design and implement a VM-aware fairness scheduler, called VMAFS for short, for co- processors such as DSP and hardware accelerators. In OpenCL framework, one computing device receives requests from different applications. When the applications are executed on the virtual machines on heterogeneous multi-core SoC platforms shown in Figure 1, the virtual machine manager allocates fair amount of computation resources to the virtual machines and, thus, the Figure 2. OpenCL Framework applications. In addition, our VMAFS offers fairness properties An OpenCL application is a computation task that consists of on non-preemptive architectures, which are adopted in many software components implemented in OpenCL. A host represents specialized co-processors or accelerators. Although many research the run-time environment including hardware and software to works have been done in achieving task-level fairness, little execute OpenCL applications. In the framework, a context considers achieving VM-level fairness, in which computation consists of a set of processing elements, memory region, and resources are allocated to VMs proportional to their predefined communication interface so as to define the hardware weights. In the mean time, to the best of our knowledge, environment to execute computation tasks. Hence, a context proportional scheduling fairness across heterogeneous processors represents a subset of hardware components on the platform to has not been studied and is of increasing importance as the host the applications. increasing popularity of OpenCL standard. Generalized Processor Sharing (GPS) scheduler was first defined The remainder of this paper is organized as follows. Section 2 in [20] and is an idealized scheduling model that achieves perfect presents related work and the background for OpenCL framework fairness, and many other schedulers use GPS as a base to schedule and virtualization technology. Section 2 also describes the system tasks with different characteristics. Consider a system with P architecture including the assumption on workload and hardware CPUs and N threads, and each thread pi has a weight wi and 1 ≤ i

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 29 ≤ N. Let Si(t1,t2) be the CPU time that thread pi is allocated in In task scheduling, one important issue is to assure fairness such time interval [t1,t2]. When any thread pi remains executable in that resources are allocated to tasks according to their importance. [t1,t2], Definition 2.1 holds under GPS algorithm. Previous works can be classified into two categories: Virtual-time based algorithm and Round-robin based algorithms. Virtual-time Definition 2.1: A GPS scheduler satisfies , i,j = 1, based algorithm such as WFQ [10], Parekh [21], Bennett [5], and Blanquer [6] can achieve the strongest fairness with constant lag 2,…, N. bound. However, they have limited scalability by using centralized queue to achieve fairness. Round-robin algorithms The equality holds if thread pj for 1 ≤ j ≤ N remains executable have O(1) time complexity but O(N) lag bound. However, if the within time interval [t1,t2]. The definition leads to the following two properties for GPS scheduler: number of tasks and weights are bounded by a constant, the round-robin-based algorithm also can achieve similar lag bound • If both thread pi and pj remain ready to execute with fixed as virtual-time based algorithms do. Besides, these algorithms weights in [t1, t2], then GPS satisfies have better scalability. Caprita et al. extend the original GRa3 scheduler on uni-processor [9] to multi-processors [7] by (1) grouping the tasks with different range of weight. These algorithms rely on timer interrupts, which are usually unavailable on ASIC processors. As a result, these algorithms cannot be • If the set of executable threads, Φ, and their weights remain applied on these types of processing elements. unchanged throughout the interval [t1,t2], for any thread pi ∈ Φ, GPS satisfies 2.2 System Architecture

(2) In this work, we take advantage of virtualization environment to manage the resources. Figure 3 shows the system architecture of

Many prior researches such as WFQ [10] and BVT [11] are applied to uni-processor scheduling with proportional fairness property. For multiprocessors, to analyze the fairness bound in the system, lag [4] is a performance metric for fairness and least lag values represent that the system is much more fair. Suppose that Algorithm A is a scheduler (not GPS scheduler) deployed on a multi-processor platform.

Definition 2.2: For any time interval [t1,t2], the lag of thread pi scheduled by Algorithm A is defined by lagi(t) = Si,GPS(t1,t) -

Si,A(t1,t) for t∈ [t1,t2], where Si,GPS and Si,A represent the computation time to thread pi allocated by GPS algorithm and by Algorithm A. Multiprocessor scheduling [15] has been well studied in past Figure 3. System Architecture for OpenCL-based Model decades and previous works mostly target homogeneous VM-aware resource management for OpenCL-based multiprocessor platforms. With the emergence of GPGPU and heterogeneous multi-core execution model. specialized co-processors [17][13], programming and exploiting heterogeneous multiprocessors have received increasing attention There are two types of virtual machines (VMs) in our system: lately [8][12], and [23]. We summarize three key concepts of the general purpose VM, or Gen-VM, and OpenCL specific VM, or related works in the following. First, only one VM, called Host OpenCL-VM. One and only one OpenCL-VM is deployed in the OS/VM or Dom0, is granted to directly access to GPUs and the system and OpenCL-VM has a role similar to the controlling other VMs, called guest VMs, pass their GPU requests to Host OS domain 0 in Xen [25] and is responsible for managing and through virtual machine manager (VMM). Second, driver model scheduling the heterogeneous multi-processor resources. User is composed of frontend on Guest OS, and backend on Host OS. applications are executed on Gen-VMs and, whenever they enter Frontend driver is a light-weight layer and is responsible for their parallelized code sections and need to invoke OpenCL for connecting and redirecting API calls to Host OS. Backend driver parallel executions, they send a command to the OpenCL-VM. receives and interprets remote requests, and creates a We will present the Gen-VM’s components in charge of passing corresponding execution context for the API calls, then returns OpenCL commands shortly. When the OpenCL VM receives the the results to Guest OS. Third, they provide scheduling policy (yet commands, it invokes the VM-Aware Fair Scheduler (VMAFS) to very simple) to manage requests from Guest OS, like FCFS. Our schedule and to dispatch OpenCL kernels to corresponding work shares several features in part with [12] and [23]. We not computation devices. We should point out that our work focuses only virtualize GPUs but also consider fairness properties on on the scheduling policies improving proportional fairness among heterogeneous multiprocessor platforms in general. This enables Gen-VMs, which is absent in previous works as mentioned above. us to better exploit the computation capacity of heterogeneous For the works targeting on porting OpenCL to virtualized platforms through a unified interface. environments, we refer the readers to [12]. In our proposed architecture, the "virtualized" OpenCL framework is realized in three components: a.) OpenCL runtime and platform libraries

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 30 reside in Gen-VMs supporting passing command to the OpenCL- We design schedulers for both preemptible and non-preemptible VM, b.) the Request Manager and Runtime Manager reside in processors to achieve VM-level fairness and are described in OpenCL-VM providing basic runtime support such as access to Section 3.1 and 3.2 respectively. Note that the decision of whether the underlying hardware information to OpenCL tasks, and c.) a kernel should run on GPPs or SPPs is made by the programmers VM-Aware Fair Scheduler. We notice that current OpenCL run- and is not addressed in this paper. time environment adopts FCFS (First-Come-First-Serve) scheduling algorithm using an in-order queue for individual 3.1 VMAFS-P Scheduler processor or device (e.g., GPU) so that the kernels are 3.1.1 Overview sequentially scheduled. However, Gen-VMs usually host different There are two queues on each processor, active and budget- user applications and, thus, have different importance. The native exhausted queue. All kernels in active and budget-exhausted sequential scheduling not only cannot reflect the prioritizing of queue are called "schedulable kernels". Initially, all schedulable Gen-VMs, but also would fail to deliver time-sensitive kernels are placed in the active queue. In general, when a kernel performance. To improve this, we propose a scheduler, referred to on one processor has been selected for execution and has as VM-Aware Fair Scheduler (VMAFS) that achieves exhausted its budget, it will be placed into budget-exhausted proportional fairness among kernels of different GenVMs. We queue. We adopt the concept of round-robin method to schedule will discuss the details in later sections. kernels. That is, after all kernels have been placed into the We define VM-aware fairness by extending the definition given in budget-exhausted queue, the VMAFS-P switches both queues and GPS [20] for heterogeneous multi-core processors. In particular, the processor is advanced to its next round. We define a Switch we ignore thread’s properties and only take into account VM’s event as the event when active queue becomes empty and the properties. Consider that in a system, there are Nm units of processor is permitted to switch it with the budget-exhausted processor Pm for 1 ≤ m ≤ M where Nm ≥ 1, and N virtual machines queue. We define the time interval between two consecutive where each virtual machine VMi has a weight wi for 1 ≤ i ≤ N. Let occurrences of Switch event as a round. We describe in Section Si(t1,t2) be the computation time allocated to VMi in time interval 3.1.3 on how VMAFS-P maintains runtime status of each [t1,t2] on all processors. The VM-level ideal fairness satisfies the processor and kernel tasks, along with how it distribute following two conditions: proportional processor sharing to individual VMs.

If both virtual machine VMi and VMj have back-logged task The rationale of VMAFS-P algorithm is to assure that the number queues in time interval [t1,t2] and their weights are wi and wj, the of schedulable kernels of each VM is the same. This can be done following equation holds by setting the number of schedulable kernels to be a fraction of the total number of kernels of the VM which has least number of kernels. VMs having more kernels are forced to remove some of (3) their kernels temporarily from the queue or not enter into processor active/budget-exhausted queues. This does not affect If the virtual machines in the set Φ remain non-idle throughout the fairness property of the scheduler and the VMs have the the interval [t1,t2] and their weights remain unchanged, for any flexibility to decide which of its kernels run first according to its own scheduling criteria. VMi ∈ Φ, the following equation holds 3.1.2 Terminologies of VMAFS Algorithm In this subsection, we define terminologies used in VMAFS (4) algorithm. Suppose that there are P pre-emptible processors and N virtual machines in system. Denoted by VMi the i-th virtual machine of N VMs in the system. wi represents the weight of VMi The goal of this paper is to design a scheduler for OpenCL kernels and 1 ≤ i ≤ N. wmax is the variable to record the largest value to assure that the VM-level fairness, defined in Equation (3) and among wi. When VMAFS-P scheduler selects a new kernel of VMi (4), are met. In other words, the designed scheduler should for execution, it will execute for a time interval of length Bi. Bi is minimize the difference between the allocated execution time and the execution time budget of each kernel from VMi. Let S be the its ideal fair share. (minimum) basic scheduling time quantum in the system, then the value of Bi is given by S * wi. For each processor p ∈ PΦ, roundp 3. VM-AWARE FAIR SCHEDULER is the number of rounds the processor p has gone through. That is, VM-aware fair schedulers (VMAFS) are divided into the two the number of consecutive Switch events p has experienced. At subroutines - VMAFS-P and VMAFS-N, depending on whether any time t, roundmax is the largest roundp. Denoted by the underlying execution environment allows a task to be accumulating received round σi the number of times that a kernel preempted with reasonably low overhead. Tasks run on GPPs can of VMi has been selected for execution by some processor p. be preempted and context switch to other tasks for execution at When system starts up, VMAFS-P scheduler initializes σi to 0, lower cost in general, typically include some cache misses. Due to and increase it by one each time VMi's kernel is selected by some the proprietary software stack, binary device drivers and the processor p. σmax and σmin represent the largest and smallest limitation of hardware, tasks run on SPPs, however, are less likely number of all σi, respectively. During run time, VMAFS-P to be preemptible or may otherwise introduce non-negligible algorithm maintains an invariant that difference between σmax and overhead such as massive data-movement across buses. Rather, σmin is bounded by U, which is a value defined empirically by typical execution model requires the current task to finish or system administrator. As mentioned earlier, VMAFS-P controls voluntarily relinquish processors before the next task could run. the number of schedulable kernels that enters processor queues for

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 31 current σmax and σmin. Otherwise it goes to Step 2. We introduce Table 1. Notation summary the details of Step 2 later. After finding the σmax and σmin, Items Description VMAFS-P scheduler checks if (σmax - σmin) is greater than or Ψ The set of all processors equal to U. If true, go to "Lag Boosting Procedure", otherwise go to "Load Sharing Procedure". Step 2 (roundp is not equal to VM i-th virtual machine (also called 'VM' i) i roundmax): If active queue is not empty, choose a kernel k at the N The number of virtual machines. head of the queue for execution. Otherwise increased roundp by w The weight of virtual machine VM i i Algorithm 1 VMAFS-P: Lag Boosting Procedure wmax The largest value of all wi S Constant basic scheduling quantum. VMi = Find the VM with σmin P be the processor executing Lag Boosting Procedure Bi Budget of VMi, Bi = wi*S. if active queue does not hold any kernel of VMi then round The current round number of processor P. p Search for kernels of VMi in other processors’ active

roundmax maxp∈Ψ {roundp$} queues and budget-expired queues Move kernel k the head of p’s active queue so that it can be σi Accumulated received round, the number of executed immediately time instant that VM 's kernel being selected i else by any processor. Move kernel k∈ VMi to the head of p’s active queue σmax The largest value of all σi.

σmin The smallest value of all σi. U The upper bound of the difference between σmax and σmin.

Ctotal The number of schedulable kernels on each VM for all time. P The number of processors each VM. In this paper, this variable is denoted by Ctotal and can be set empirically by system administrator or programmers. Table 1 summarizes the notation used in this paper. Figure 4. Flowchart of Kernel Selection Procedure

one and then invoke the "Kernel Select Procedure" again.

3.1.3 VMAFS-P Algorithm There are four procedures in VMAFS-P algorithm. First, "Kernel Select Procedure" is the entry point of the VMAFS-P scheduler and is invoked by each processor p when the budget of the current kernel on p has been used up and p is about to select the next Figure 5. Flowchart of Lag Boosting Procedure kernel for execution. Second, in Kernel Select Procedure, we We now introduce the details of Lag Boosting Procedure, design a "Load Sharing Procedure", whose goal is to have light- illustrated in Figure 5, and Load Sharing Procedure, illustrated in loaded processor to share the work from heavy-loaded processor Figure 6. The key idea of Lag Boosting is to identify the VM with during each round. The main idea is to balance the load on each minimum accumulating received round (σ value), σmin, and then processor so that the kernels on light loaded processors do not to select the kernel from this VM to compensate for the receive more than enough execution time. Third, "Lag Boosting insufficient execution time it has received. Note that this Procedure", also being invoked in Kernel Select Procedure, is subroutine is always invoked when violating the invariant responsible for keeping the difference between σmax and σmin inequality $σmax - σmin <= U$ becomes possible, thus, intuitively, within U. The rationale behind is to ensure that each VM would VMAFS-P maintains the invariant that all virtual machines receive its proportional sharing of the P processors during any receive similar number of times of kernel execution. Setting up time interval. Finally The "Entrance Control Procedure" is the value of U is a trade-off between fairness guarantee and responsible for maintaining the total number of schedulable migration overhead. With a large U, VMAFS-P is highly kernels in system. The details of the procedures are described as decentralized and could better utilize underlying multiprocessors follows. to achieve better scalability, while still achieves certain degree of Figure 4 illustrates the flowchart of Kernel Select Procedure, the VM-level fairness. On the other hand, VMAFS-P achieves more main routine of VMAFS-P, along with its two important accurate VM-level fairness at the cost of higher migration subroutines - Load Sharing Procedure and Lag Boosting overhead when U is set small. Algorithm 1 lists the detail flow of Procedure. Upon the time when processor p is about to select a Lag Boosting Procedure. new kernel for execution. Step 1: the scheduler checks if roundp is equal to roundmax. If the result is true, the scheduler compute the

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 32 or ideal fairness is represented by a unit vector IN, which is N defined as I =. Assume that there are N VMs, according to the GPS model

introduced earlier, fairness is defined as r1 : r2 :…: rN = w1 : w2:…: wN, where ri and wi are allocated execution time and weight of

Figure 6. Flowchart of Load Sharing Procedure VMi, respectively, for 1 ≤ I ≤ N. Let’s consider an N-dimensional The round of each processor p keeps track of the current round p number. Since VMAFS-P is designed as a decentralized Algorithm 2 VMAFS-P: Load Sharing Procedure algorithm, workload being dispatched on each processor could be uneven. This will result in system performance bottleneck and the if Active queue is empty then possibility of unfairness. To avoid such issues, Load Sharing Find kernels from active queues of processors with round Procedure always requires light-loaded processors (round < number equals to roundmax or budget exhausted queues of p processors with round number less than roundmax. roundmax) to migrate workload from heavily-loaded processors if there are kernels in these queues then (roundp < roundmax) before they can advance to the next round. Algorithm 2 presents this idea. Move some kernels to self active queue Finally, the Entrance Control procedure checks the kernel's else budget. That is, when a kernel completes, VMAFS-P checks if the Increase roundp by one and return Kernel Select executing time is equal to its budget or not. If the return result is procedure false, the scheduler will add a new kernel from same VM to active else queue and continue to execute for remaining budget. Otherwise, Switch queue(active, budget-exhausted) the result shows that the kernel executed full budget then scheduler put the kernel into budget-exhausted queue as usual. Increase roundp by one and return

We list its pseudo in Algorithm 3 and its flowchart in Figure 7. N N N vector v , where the i-th element of v is ri. We call the vector v N N the status vector. Let I be the ideal unit vector such that I = and Ii : Ij = wi : wj. v represents ideal fairness if it points to the direction of IN. As shown in Figure 8, there are two VMs in system. A vector’s projection to VM1 (VM2) represents the 2 accumulated execution time of VM1(VM2). I is a < 2, 1 > vector 2 2 2 when w1:w2 = 2:1. v (tk) =< v1 ,v2 > are the status vector of VMs Figure 7. Flowchart of Entrance Control Procedure at time tk, where represent the accumulated received execution time of VM1 and VM2 at time tk respectively. Algorithm 3 VMAFS-P: Entrance Control Procedure

VMi = the VM to which the current kernel k belongs. Executedtime = how long k has been executed since this round. if k is not finished then Move k to budget-exhausted queue else Move a newkernel k’ fromVMi to the active queue and continue to execute k’ for remaining time (Bi −

3.2 VMAFS-N Scheduler In VMAFS-N, the VM-level fairness is defined by N-dimension vectors. Within time interval [tm, tn], virtual machine VMi is allocated for certain amount of computation time, represented by Figure 8. An Example of Geometric Representation of ri. Each virtual machine VMi has a predefined weight wi to Fairness represent the importance or criticality of the virtual machines. In the algorithm, we use vectors to represent the allocated According to the definition of fairness on heterogeneous computation time and weight of the virtual machines. The VM- multicore processors given in Section 2.2, the ideal computation level fairness is defined by N-dimension vectors. Within time time for any two virtual machines VM and VM , both of which are i j interval [0, t ], the amount of computation time allocated to not idle in time interval [t , t ] where t < t , should meet the k m n m n virtual machine VM is defined represented by r (t ). Each virtual following constraint: r : r = w : w . Hence, we use a N-dimension i i k i j i j machine VM has a predefined weight w to represent the vector to represent the allocated computation time for all virtual i i importance or criticality of virtual machine VM . According to the machines in the system at a given time interval. The vector is i definition of fairness on heterogeneous multi-core processors called the status vector, denoted by V, and represented by V N m,n given in Section 2, the ideal computation time allocation for any =< r1, r2,…,rN > for time interval [tm, tn]. The optimal allocation two virtual machines VMi and VMj, both of which have backlogged ready queue in time interval [tm, tn] where tm

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 33 should meet the following constraint: ri(tn) - ri(tm) : rj(tn) - rj(tm) = the status vector always pointing to the direction closest to the wi: wj. Hence, we use a N-dimension vector to represent the ideal fairness vector. To achieve fairness, VMAFS-N selects a N N allocated computation time for all virtual machines in the system kernel which will minimize the angle between V (tk+1) and I N at a given time interval. The vector is called the status vector and where V (tk+1) is the status vector if the computation time is represented by Vm,nN =< r1, r2,…, rN > for time interval [tm, tn] allocated to the selected kernel and the kernel finishes at time tk+1. where ri = ri(tn) - ri(tm) for 1 ≤ i ≤ N. The ideal fairness is The pseudo code of VMAFS-N algorithm is listed at Algorithm 4. N represented by fairness vector IN, which is defined as I =< w1, Note that VMAFS-N algorithm is applied individually for each type of processing elements and there could be several units for w2,…, wN > where VMi is not idle for 1 ≤ i ≤ N. each type of the processing elements. The fairness of a computation time allocation can be represented Figure 10 shows an example to illustrate show how VMAFS-N by the angle between status vector and fairness vector. When the algorithm works. There are two VMs in the system. Ideal vector angle is zero degree, these two vectors point to the same direction I2, represented by the dashed line, and current status vector V2(t ), and the allocated computation time meets the fairness constraint. k represented by the solid line, are shown in the figure. When the Because the allocated computation time is no less than 0, the current kernel completes at scheduling instant t , the scheduler maximum angle between these two vectors are 90 degrees. Hence, k we define fairness factor at time tk, denoted by Ftk, as cosine of the Algorithm 4 VMAFS-N: Non-preemptive Scheduler angle, given by the following formula: Input: kernels on request queue of N virtual machines. ( 5) Update the vector v(t) and ideal vector I for i = 1 → N do When the angle is zero degree, fairness constraint is met and the while queue is not empty do fairness factor is 1; when the angle is 90 degree, the fairness factor if cosθ(v’, I) is the maximum then is 0. calculates the cosine value for all ready kernels according to Figure 9 illustrates the terminologies defined above. At time tk, Equation 5. In this example, we have three ready kernels in three virtual machines are not idle and the fairness vector is I3. queues and the gray dashed lines show all three status vectors 3 When the allocated computation time is V (tk), the fairness factor assuming each of the kernels are selected at scheduling instant tk+1 is cos(θ1). At time tk+1, two virtual machines are not idle and the at scheduling instant tk. As shown in the figure, the execution 2 fairness vector is I . When the allocated computation time is times for kernels are k1, k2, and k3. Among them, one is hosted by V2(tk+1), the fairness factor is cos(θ2). virtual machine VM1 and the other two are hosted by virtual machine VM2. Because θ2 < θ1 < θ3 in this example, θ2 is the smallest angle against the ideal fairness vector. So the VMAFS-N algorithm chooses the kernel on virtual machine VM2 to be the next task.

When the ready queue of any virtual machine VMj becomes empty, the algorithm removes the allocated computation time for N N-1 virtual machine VMj from the status vector V to be V . When any virtual machine VMj becomes available at schedule instant tk, the algorithm adds the allocated computation time vj for virtual machine VMj to the status vector. The allocated computation time vj is the maximum of allocated computation time for VMj thus far and normalized allocated computation time for all the other virtual

machines, vi × . Figure 10 shows an example to

illustrate the change. Before schedule instant tk, only two virtual Figure 9. An example of Adding Virtual Machines machines have backlogged ready queue. At schedule instant tk, virtual machine VM becomes ready to execute, i.e., its ready The main idea of non-preemptive scheduler is to use the vector- 3 queue becomes backlogged. The algorithm replaces V2(t ) by like interpretation of fairness as described above to identify the k V3(t ) where r is the maximum among r , r , and r . Given the most suitable kernel to be executed. The geometric interpretation k 3 3 1 2 updated status vector, VMAFS-N is invoked to select the next enables us to measure exactly how close it is from current status kernel. In this example, each virtual machine has only one ready to ideal fairness. VMAFS-N scheduler adopts a greedy approach. kernel: k , k , and k . Among the kernels, kernel on virtual Whenever invoked, VMAFS-N scheduler searches for a ready 1 2 3 machine VM leads to the maximum fairness factor and is selected kernel so that the angle between ideal vector IN and status vector 3 to the next kernel to execute. VN (after the execution of the candidate kernel) will be the smallest. 4. PERFORMANCE EVALUATION Since cosine function is a strictly descending function when 0 ≤ θ In this section, we present performance evaluation result for VMAFS-N algorithm to demonstrate its effectiveness. Extensive ≤ , VMAFS-N scheduler maximizes Equation 5 during each simulations are conducted to evaluate VM-aware Fair Scheduler. scheduling instant. In this way, VMAFS-N scheduler tries to keep Comparisons with other algorithms are also included.

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 34

Figure 10. Enlarged VM Set and updated status vector

We compare the performance of VMAFS-N scheduler with that of number of allocated execution rounds of VMs, the better degree first-come-first-serve (FCFS) scheduler and credit-based of fairness is achieved. As shown in Figure 11, there could be in scheduler. The first come first serve policy had been adopted in OpenCL framework. The Credit-based scheduler has long been Table 2. Workload Parameter used in Xen hypervisor for scheduling VM tasks. We keep all the Parameter Range VCPU concepts but rewrite it to support non-preemptive scheduling. Number of processing cores(P) 2-10 Number of virtual machines(N) 2-15 Two performance metrics are used to evaluate the fairness properties and the overhead of scheduling algorithms. The Number of ready kernels of each VM 1-500 allocated computation time of each VM measures directly how (Ctotal) processor resources are assigned to VMs. Fairness factor, i.e., the Total schedulable kernels (N * C ) 2-7500 cos(θ) value between the status vector and the ideal unit vector, total represents how close the actual computation allocation is to ideal Kernel’s execution time 20-200us VM-level fair allocation. Ratio of VM weight 1-10 Table 2 shows the workload parameters for the experiments. These parameters are selected to have extensive coverage for average up to 40% increase in differences of number of received modern and future embedded systems. The latest mobile devices execution rounds of VMs when U is removed (VMAFS-P without (in year 2012) have four cores and have two virtual machines. In U). However, with VMAFS-P (with U), the maximum of δmax - our experiments, the number of processing cores range from 2 to δmin is always kept under U(=Ctotal/3), regardless of number of 10; the number of virtual machines is randomly selected between schedulable kernels are there in the system. On the other hand, the 2 and 15. The number of ready kernels on each VM, denoted by maximum of δmax - δmin increases as the number of schedulable Ctotal, ranges from 1 to 500. In other words, the total number of kernel increases when VMAFS-P (without U) is adopted. As ready kernels for in the system, i.e., N*Ctotal, ranges from 2 to mentioned in Section 3, the value of U presents a trade-off 7,500 at any time. In VMAFS-N algorithm, we assume that all the between achieved degree of fairness and migration overhead. We kernels have a bounded worst case execution time so that none of shall present the migration overhead analysis for VMAFS-P (with them could occupy any processors by having unlimited execution U) shortly. time. The execution time and its upper bound is set by observing realistic workloads running on TI RadeonTM HD 5870. A set of micro-benchmarks yields the execution time various between 20μs (for light workload) to 200μs (for heavy workload). Therefore, we set the range of kernel’s execution time to be 20- 200μs. Finally, we set the ratio of VM’s weight between 1-10 and observe its impact to the scheduling results. 4.1 VMAFS-P Scheduler 4.1.1 Comparison of VMAFS-P with/without Upper Bound Figure 11 shows that how the upper bound on the difference on Figure 11. Comparison of VMAFS-P with/without Upper allocated kernel rounds, which is U, affects the degree of achieved Bound U fairness during execution. There are 4 VMs and 4 processors in the experiments. The number of C varies from 5 to 100, and we 4.1.2 Comparison of Allocated Computing Resource total This section presents the achieved degree of fairness of different set the U being Ctotal/3 to achieve fairness while also to avoid extensive external migrations. The smaller the difference of algorithms. We show both the allocated computation time of each

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 35 VM when the simulation ends and the cos(θ) value sampled at various time points. Figure 12 and Figure 13 show the achieved degree of fairness for different algorithms in terms of allocated computation time for different configurations. In Figure 12, we set N=4; P=4; Ctotal =50; U=5; S=1(ms), and the execution time varies from 0 to 2000(ms). The major difference of these two cases is weight ratio. w1: w2: w3: w4 = 1:2:4:8 in Figure 12 and w1: w2: w3: w4: w5 = 1:2:3:4:5 in Figure 13. As we can see, the GR3 and VMAFS-P(with U) both achieve near ideal proportional fair share for each VM. On the other hand, the VMAFS-P (without U) performs worse in terms of fairness guarantee. We should note that there could still be limitations in VM-level GR3 3 scheduler. The GR scheduler achieves fairness by running the Figure 14. Degree of Fairness with Weight Constraint intra-group round robin scheduler on the top of inter-group scheduler. However, if a large number of VMs fall into the same group (this might happen when their weights are all between 2M to 2M+1 for some M), it becomes a round-robin scheduling algorithm and cannot maintain fairness.

Figure 15. Degree of Fairness without Weight Constraint 4.1.3 Compare the Context Switch Overhead between VMAFS-P and GR3 Figure 12. Fairness Comparison with Weight Ratio Constraint Although GR3 can achieve good fairness when VMs are assigned evenly to different groups (no skew). It achieves fairness by fine- grained scheduling that may require extensive context switches at every time quantum. On the other hand, VMAFS-P adopts weighted round-robin (WRR) method to reduce the context switch overhead and to better utilize cache affinity. Figure 16 shows that the number of context switch reduces as N increases. In Figure 16, the number of schedulable kernels is fixed and the number of N varies. The VMAFS-P scheduler performs context switch only when the current kernel exhausts its budget, but GR3 scheduler switches to next kernel at each time quantum. On the other hand, Figure 13. Fairness Comparison without Weight Ratio VMAFS-P scheduler switches to next kernel until Bk (Bk = wk*S) Constraint time units. As the result, VMAFS-P can greatly reduce the context switch overhead during the run-time. So we further measure the fairness, cos(θ) values of ideal fairness vector and allocated computation time vector, for different algorithms during the simulation and report the maximum cos(θ) value for each algorithm. The greater of maximum cos(θ) value, the lower the degree of fairness. As we can see in Figure 14 and Figure 15, VMAFS-P always achieve near ideal fairness in terms of cos(θ) (with cos value > 0.95 in all cases). The result of GR3 achieve the more ideal fairness in Figure 15 than that in Figure 14 because the all VMs are assigned to different groups by inter scheduler of GR3. However, VMAFS-P scheduler achieves almost ideal fairness at any case of weight ratio.

Figure 16. Context Switch Overhead for VMAFS-P and GR3 algorithm

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 36 4.1.4 Migration Overhead between VMAFS-P Figure 19 shows how sensitive the fairness factor is to the weight with/without U of VMs. The horizontal axis denotes the weight ratio of VMs; the vertical axis denotes cos(θ) value, i.e., the fairness factor. Weight To achieve better fairness, the scheduler may need to migrate the ratio refers to the ratio of the weight of VM to that of VM , for 2 tasks of VMs from one physical processor to another one when i+1 i ≤ i ≤ N, and weight of VM is 1. The execution time in Case 1 certain processors are occupied by a VM. To understand the 1 ranges from 10μs to 60μs; the execution time in Case 2 ranges relationship between fairness and migration overhead, we measure from 30μs to 180μs. As we can see in the figure, fairness factor the number of migration over different upper bound on fairness. decreases when the weight ratio increases. This is because We compare it against the number of migration when there is no VMAFS-N scheduler selects the kernel that can redirect status upper bound on fairness. Figure 17 shows the results of migration vector to ideal fairness vector. Comparison between the two cases when the upper bound on fairness, U, varies from 5 to 30. When in Figure 19 shows the rationale that the shorter the average there is no upper bound on fairness, the number of migration does execution time of kernels, the finer control over processors not change. When there are upper bounds on fairness, the number VMAFS-N scheduler can achieve. As a result, a higher degree of of migration reduces while the upper bound on fairness U fairness can be reached. However, we also note that the fairness is increases. This is because that the scheduler can delay the not sensitive to the average execution time of kernels. When the execution of kernels when the upper bound on fairness increases. average execution time increases for 200%, the fairness only In particular, the number of migrations increases for 40% when decreases for 0.5%. the upper bound on fairness changes from 15 to 10. This is because Lag Boosting Procedure is triggered when the upper bound on fairness is reduces.

Figure 17. Migration Overhead on VMAFS-P with/without U Figure 19. Fairness versus Weight Ratio 4.2 VMAFS-N Scheduler 4.2.1 Comparison of Fairness with Non-Preemptive In this section, we present the fairness evaluation of VMAFS-N Property Scheduling scheduler when the tasks are executed on non-preemptive Figure 20 shows the performance results of VMAFS-N scheduler, processors. Figure 18 shows how sensitive the fairness factor is to FCFS, and credit-based scheduler in terms of fairness. The the execution times. The horizontal axis denotes the range of horizontal axis denotes the different Gen-VMs, and the vertical execution times, and the vertical axis denotes average fairness axis denotes the allocated computing time of VMs. There are six factor. The results show that fairness factors decrease when the VMs in system, and their weights ranges from 1 to 10. VMAFS-N range of execution times increase due to the non-preemptivity of scheduler always achieves the best fairness level and is closest to kernel execution. Fortunately, the impact to fairness factor is ideal fairness. Although the Credit-based scheduler approaches negligible: the execution times increases for 100% and the ideal fairness to a certain degree, VMAFS-N outperforms for all fairness factor decreases less than 0.5%. the cases.

Figure 20. Comparison of Fairness with Non-Preemptive Property Figure 18. The Effect of Worst Case Execution Time to Fairness

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 37 5. ONCLUSION AND FUTURE WORK [11] K. J. Duda and D. R. Cheriton. Borrowed-virtual-time (bvt) In this paper, we are concerned about scheduling OpenCL kernels scheduling: supporting latency-sensitive threads in a general- with concept of VM-aware fairness in the virtualization purpose scheduler. SIGOPS Oper. Syst. Rev., 33:261–276, heterogeneous multi-core platforms. We propose VM-level fair December 1999. scheduler (VMAFS) which is a solution to on-line schedule [12] V. Gupta, A. Gavrilovska, K. Schwan, H. Kharche, N. Tolia, OpenCL tasks into heterogeneous multi-core. In our scheduler we V. Talwar, and P. Ranganathan. Gvim: Gpu-accelerated provide fairness guarantee to ensure VMs to receive computing virtual machines. In Proceedings of the 3rd ACM Workshop time proportional to their importance. Besides, providing fairness on System-level Virtualization for High Performance guarantee also avoids resource contention and avoids the lower Computing, HPCVirt ’09, pages 17–24, New York, NY, priority of Gen-VMs receiving more computing resource than the USA, 2009. ACM. higher priority of Gen-VMs. In VMAFS-N, the VM-level fairness [13] IBM. The cell project at IBM research. is interpreted geometrically by using vector-like concepts. http://www.research.ibm.com/cell/. Last accessed at June Performance evaluations result show that when VMAFS-N 2011. algorithm is used, scheduling fairness is not sensitive to the amount of computation time of non-preemptive tasks. In addition, [14] K. Lakshmanan and R. Rajkumar. Distributed resource VMAFS-N algorithm greatly outperforms credit-based scheduling kernels: OS support for end-to-end resource isolation. In algorithm, which is deployed in virtualization environment. 2008 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS ’08), pages 195–204, April 6. REFERENCES 2008. [15] S. T. Leutenegger and M. K. Vernon. The performance of [1] C programming language-c99. multiprogrammed multiprocessor scheduling algorithms. http://en.wikipedia.org/wiki/C99. SIGMETRICS Perform. Eval. Rev., 18:226–236, April 1990. [2] The opencl standard website. [16] T.-J. Lin, C.-N. Liu, S.-Y. Tseng, Y.-H. Chu, and A.-Y. Wu. http://www.cl.cam.ac.uk/research/srg/netos/xen/. Overview of ITRI PAC project - from VLIW DSP processor to multicore computing platform. In VLSI Design, [3] LynxSecure. at Automation and Test, 2008. VLSI-DAT 2008. IEEE http://www.lynuxworks.com/corporate/press/2011/virtualizat International Symposium on, pages 188 –191, april 2008. ion-5.0.php, March 2012. [17] J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable [4] S. K. Baruah, N. K. Cohen, C. G. Plaxton, and D. A. Varvel. parallel programming with cuda. Queue, 6:40–53, March Proportionate progress: A notion of fairness in resource 2008. allocation. Algorithmica, 15:600–625, 1996. 10.1007/BF01940883. [18] nVidia Corporation. TELSA computing solutions. at http://www.nvidia.com/object/tesla_computing_solutions.ht [5] J. Bennett and H. Zhang. Wf2q: worst-case fair weighted fair ml. Last accessed at June 2011. queueing. In INFOCOM ’96. Fifteenth Annual Joint Conference of the IEEE Computer Societies. Networking the [19] Open Kernel Labs. OKL4 microvisor. at http://www.ok- Next Generation. Proceedings IEEE, volume 1, pages 120 – labs.com/products/okl4-microvisor, April 2011. 128 vol.1, mar 1996. [20] A. K. Parekh and R. G. Gallager. A generalized processor [6] J. M. Blanquer and B. Özden. Fair queuing for aggregated sharing approach to flow control in integrated services multiple links. SIGCOMM Comput. Commun. Rev., networks: the single-node case. IEEE/ACM Trans. Netw., 31:189–197, August 2001. 1:344–357, June 1993. [7] B. Caprita, W. C. Chan, J. Nieh, C. Stein, and H. Zheng. [21] A. K. Parekh and R. G. Gallager. A generalized processor Group ratio round-robin: O(1) proportional share scheduling sharing approach to flow control in integrated services for uniprocessor and multiprocessor systems. In Proceedings networks: the single-node case. IEEE/ACM Trans. Netw., of the annual conference on USENIX Annual Technical 1:344–357, June 1993. Conference, ATEC ’05, pages 36–36, Berkeley, CA, USA, [22] Renesas Electronics Corporation. Naviengine. at 2005. USENIX Association. http://www2.renesas.com/automotive/en/assp/naviengine.htm [8] R. N. Cedric Augonnet, Samuel Thibault and P.-A. l. Last accessed at June 2011. Wacrenier. In H. Sips, D. Epema, and H.-X. Lin, editors, [23] L. Shi, H. Chen, and J. Sun. vcuda: Gpu accelerated high STARPU: A Unified Platform for Task Scheduling on performance computing in virtual machines. In Parallel Heterogeneous Multicore Architectures, volume 5704 of Distributed Processing, 2009. IPDPS 2009. IEEE Lecture Notes in Computer Science, pages 863–874. International Symposium on, pages 1 –11, may 2009. Springer Berlin / Heidelberg, 2009. [24] Texas Instruments. OMAP applications processors devices [9] W. C. Chan. Group ratio round-robin: An O(1) proportional from texas instruments. at share scheduler. Technical report, Masterąęs thesis, http://focus.ti.com/dsp/docs/dsphome.tsp?sectionId=46. Last Columbia University, June 2004. accessed at June 2011. [10] A. Demers, S. Keshav, and S. Shenker. Analysis and [25] Xen Hypervisor. The xen virtual machine monitor. simulation of a fair queueing algorithm. SIGCOMM http://www.xen.org/. Last accessed at June 2011. Comput. Commun. Rev., 19:1–12, August 1989.

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 38 [26] Bowman, B., Debray, S. K., and Peterson, L. L. Reasoning (CHI ’00) (The Hague, The Netherlands, April 1-6, 2000). about naming systems. ACM Trans. Program. Lang. Syst., ACM Press, New York, NY, 2000, 526-531. 15, 5 (Nov. 1993), 795-825. [29] Lamport, L. LaTeX User’s Guide and Document Reference [27] Ding, W., and Marchionini, G. A Study on Video Browsing Manual. Addison-Wesley, Reading, MA, 1986. Strategies. Technical Report UMIACS-TR-97-40, University [30] Sannella, M. J. Constraint Satisfaction and Debugging for of Maryland, College Park, MD, 1997. Interactive User Interfaces. Ph.D. Thesis, University of [28] Fröhlich, B. and Plate, J. The cubic mouse: a new device for Washington, Seattle, WA, 1994. three-dimensional iput. In Proceedings of the SIGCHI conference on Human factors in computing systems

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 39 ABOUT THE AUTHORS:

Dr. Chi-Sheng Shih joined the Department of Computer Science and Information Engineering at National Taiwan University in 2004. He currently serves as an Associate Professor. He received the B.S. in Engineering Science and M.S. in Computer Science from National Cheng Kung University. In 2003, he received his Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign. His main research interests include embedded systems, hardware/software co-design, real-time systems, and database systems, embedded software, and software/hardware co-design for system-on-a-chip. He won best paper awards at 2011 RACS, 2005 IEEE RTCSA, and others.

Jei-Wen Wei was born in 1987 in Taiwan. She received her M.S. degree from National Taiwan University in 2011 and B.S. degree from National Chung-Cheng University in 2009. In 2009, she joined Embedded Systems and Wireless Networking Laboratory, under the supervision of Prof. Chi-Sheng Shih, in the Department of Computer Science and Information Engineering, National Taiwan University. Her research interests include embedded system design, real-time scheduling and multi-processor optimization. She is the project manager of System Software Designs for Heterogeneous Embedded Multi-core Processors project.

Shih-Hao Hung is currently an associate professor in the Department of Computer Science and Information Engineering at National Taiwan University. His research interests include mobile-cloud computing, parallel processing, computer system design, and information security. He worked for Sun Microsystem Inc. in Menlo Park, California (2000-2005) after completing his post doctoral work (1998-2000), Ph.D. training (1994-1998) and M.S. program (1992-1994) at the University of Michigan, Ann Arbor. He graduated from National Taiwan University with a BS degree in electrical engineering in 1989.

Joen Chen received the B.Sc. degree in Computer Science from the National Changhua University Education, Taiwan, in 2010. He is actively working towards the M.Sc. degree at the National Taiwan University, Taipei, Taiwan. From July 2012 to present, he is undergoing internship at Institute for Information Industry, Taiwan. His research interest is the OpenCL for collaborative computation.

Norman Chang is a Technical Manager at SNSI of Institute for Information Industry in Taiwan. He currently leads an engineer team to develop a high speed PCI-e switch for distributed computing in order to collaborate with game cloud service.

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 40 Robust and Flexible Tunnel Management for Secure Private Cloud Yung-Feng Lu Chin-Fu Kuo Department of Computer Science and Department of Computer Science and Information Engineering Information Engineering National Taichung University of Science and National University of Kaohsiung Technology [email protected] yfl[email protected]

ABSTRACT can be classified into for categories: public cloud, commu- Private cloud is cloud infrastructure operated solely for a sin- nity cloud, hybrid cloud, and private cloud. They provide gle organization, whether managed internally or by a third- Internet-based services, computing, and storage for users in party and hosted internally or externally. It provides a flex- all markets including financial, healthcare, and government. ible way to extend the working environment. Since the busi- However, cloud computing is also facing many challenges. ness process that working on them could be critical, it is im- Data security, as it exists in many other applications, is portant to provide a secure environment for organizations to among these challenges that would raise great concerns from execute those processes. While user mobility has become an users when they store sensitive information on cloud servers. important feature for many systems, technologies that pro- A survey regarding the use of Cloud services made by IDC vide users a lower cost and flexible way in joining a secure highlights that the security is the greatest challenge for the private cloud are in a strong demand. This paper exploits adoption of Cloud [9]. the key management mechanisms to have secured tunnels with private cloud for users who might move around dy- Hence, security is a huge concern for cloud users, and it namically without carrying the same machine. A strong becomes an important issue that provides a secure way to authentication with a key agreement scheme is proposed use cloud. To improve the mutual trust between user and to establish the secure tunnel. Furthermore, the proposed framework also provides mutual authentication, session key cloud provider, the Cloud Security Alliance (CSA) [1, 26] renewal between the users and the cloud server. Several re- has developed a security guide that identifies many areas for lated security properties of the proposed mechanism are also concern in cloud computing. As point out in [21], there presented.1 are many important security relevant cloud components, such as : cloud provisioning services, cloud data storage Categories and Subject Descriptors services, cloud processing infrastructure, cloud support ser- vices, cloud network and perimeter security. To ease the C.2.0 [Computer Systems Organization ]: Computer security concerns of cloud applications, a lot of excellent Communication Networks-General—Security and protection researches [11, 13, 22, 31, 36] were focus on secure cloud computing to protect access and data of cloud. General Terms While private cloud is cloud infrastructure operated solely Security for a single organization, there is a strong demand for tunnel- ing technologies for accessing the cloud securely. At related Keywords area, work has been done to provide more robust and secure Cloud Computing, Strong Authentication, Key Management, communication for remote access, e.g., IPSec VPN [4, 10, User Mobility, IPSec 19], secure routing [12, 15, 17], secure SIP implementation [2, 7]. Although IPSec introduces a good flexibility in the 1. INTRODUCTION security enhancement of higher-layer protocols, the setup of IPSec-based secured tunnels tends to be static in the con- Cloud computing is the use of computing resources that figuration and restricted in a machine-to-machine fashion. are delivered as a service over a network. It is an emerg- While user mobility has become an important feature for ing new computing model that enables convenient and on- many systems, technologies that provide users a lower cost demand network access to a shared pool of computing re- and flexible way in joining a VPN are in a strong demand. sources. Base on the deployment models, cloud computing An example approach in this direction is the adopting of the 1Copyright is held by the authors. This work is ensemble system for group communication by having a single based on an earlier work: RACS’12 Proceedings of shared encryption key for the entire VPN [25]. The major the 2012 ACM Research in Applied Computation Sym- problem for this approach is the vulnerability of the secured posium, Copyright 2012 ACM 978-1-4503-1492-3/12/10. tunnels when some malicious user obtains the shared key. In http://doi.acm.org/10.1145/2401603.2401670

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 41 [18, 32], researchers try to deploy the network infrastructure • Authentication: A good authentication is needed to with VPN. However, these mechanisms still restricted in a protect the private cloud. It works as a part of access machine-to-machine fashion. control and keeping the bad guys out while allowing authorized users in a minimum of worry. Another issue with private cloud is the need of authenti- cation. User authentication is often the primary basis for • Secure communication: The transmitted data shall be access control, keeping the bad guys out while allowing au- protected by encryption. Thus, there is a need to pro- thorized users in with a minimum of worry. Within a pri- vide a convent key management between the users and vate cloud, authorized users are allowed to access various the private cloud. cloud services and resources. When several clouds work to- • Securing data at rest: The data stored in storage shall gether, the identity management is a necessary service [24]. be protected via encryption. Single Sign-on for Interactions on the Intercloud can be a good solution, many technologies could be employed such as In this paper, we target the need to provide strong authenti- Shibbolet [27, 33], OpenID [30, 34]. An identity manage- cation and convenient tunnel establishment for private cloud ment infrastructure able to support authentication among computing. Besides, we also interested in the flexibility of federated clouds, based on SAML assertions, is proposed in user mobility. To achieve above goals, we focus in the exten- [6]. To have good access control, many excellent authentica- sion of the IPSec implementations to have secured tunnels tion mechanisms, e.g., password authentication [5, 23, 28], for private cloud and users who might move around dynam- ID-based authentication [14, 35], were developed. However, ically without carrying the same machine. since the limitations of these mechanisms, we still need a novel mechanism. Besides, it is well known that multi-factor authentication is preferable to defend social engineering at- 2.2 System Architecture tacks. There are excellent works [3, 20, 29] on this area. However, these mechanism usually require additional ic-card Tunnel Agent or biometric features. Key Database Home Thus, an approach is needed to provide a strong authenti- (Move) Tunnel Agent cation and secure communication for cloud computing. We Host are interested in the extension of the IPSec implementations to have secured tunnels for users who might move around LAN dynamically without carrying the same machine. To provide Internet highly flexibility, the non-IPSec schemes shall also provided. SD Private Cloud With the proposed algorithm, the strong authentication, se- Client Target cure tunneling, and the user mobility could be achieved. An Tunnel Agent Tunnel Agent AAA-based negotiation procedure will be proposed for the implementations. The implementation of the proposed al- Figure 1. System architecture gorithm is simple and highly portable. As shown in Figure 1, Client, Home, Host, and Target denote The rest of this paper is organized as follows: Section 2 ex- the moving subject (e.g., a travel agent), a trusted machine plores the needs of security services of cloud and the system of the subject (e.g., a desktop in the office of the agent), a architecture of this paper. Section 3 presents a strong au- trusted server, and a trusted machine of the primary working thentication and dynamic configuration algorithm to extend environment of the subject (e.g., private cloud), respectively. IPSec for the access of secure private cloud. A session key The objective is to provide a secured IPSec-based tunnel be- renewal protocol is also presented in this section. Section 4 tween Client and Target dynamically. We assume that Host analyzes the security of the proposed mechanism. Section 5 and Target could communicate in a trusted way. Note that is the conclusion. when it is a concern, IPSec could be deployed to resolve the problem. 2. PRELIMINARIES

2.1 Critical Services for Private Cloud 3. PORTABLE TUNNEL ESTABLISHMENT As point out in the Cloud Security Alliance (CSA) [1], cloud FOR PRIVATE CLOUD environment is a new model which cannot be well protected The goal of this scheme is to extend IPSec to provide a se- by traditional perimeter security solutions. Basically, we can cure, l ower-cost, and flexible way in joining a private cloud. identify there at least three specific services of the private Instead of adopting a full-scaled Public Key Infrastructure, cloud computing environment need the security improve- the purpose of this section is to propose an IPSec-based ment[21] with information technologies. These specific ser- dynamic configuration algorithm for the better support on vices are: user mobility. Our scheme consists of two phases: ticket

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 42 ′ generation and mutual authentication with key agreement. uses password as a key to decrypts g to derive the key Furthermore, to support non-IPSec environment, we also parameter g. At that time, Client uses her Tunnel present a session key renewal protocol to renew the key of Agent to submit a request to Host to request for a secure tunnel. secured connection. Client randomly generates a k- k bits nonce NC ∈R {0, 1} . Then it derives the message Before we describe our proposed protocol, we should sum- digest marize the notation used in our protocol. The notations and Nc [C, T, g ,User,passwd,IRQ]KkeyID . abbreviations used in the following sections are summarized CH in Table 1. ~0 and ~1 denote two independent secure cryp- Where IRQ contains the Security Objective of IPSEC tographic hash functions, [·]MK denotes the computation of and some related information of Client. After that, some MAC digest using a given key, MK, {·}K denotes Client transmits the messages keyID, message digest, Pi H and request information the encryption of some message using encryption key, KPiH , that is being shared by the party Pi and the Host, H, and || {C, T, gNc ,User,password,IRQ} denotes the concatenation of messages. We assume that g, ~ ~ keyID 0, 1 are fixed in advance and known to the tunnel agents, with the protection of the one-time key KCH to the and that each party Pi has a symmetric key, KpiH , shared Host. with the Host, H. • Step 2: Request Validation: The Tunnel Agent of KkeyID 3.1 Offline Phase: Ticket Generation Host uses CH to decrypt the request and then uses KkeyID to verify the message digest of the request. In the protocol, we use a key-based mechanism to achieve CH the objective of user mobility. Let Client and Host both • Step 3: Transmission of the User Information: have the same set of pre-shared materials. We assume that If the state of the one-time key tuple and message di- when Host get request from Home, it would be authenticate gest are valid and password is correct, then update the the user and then generate a set of key tuples. Each key state of the key pair in Key Database from valid to tuple includes a key, a corresponding password, life time, invalid. The Tunnel Agent of Host then passes keying and a parameter g. We must notice that the key tuples information (includes the ID of pre-shared parameter stored in Key Database and moving subject are different. g, i.e., gID) and the user information to Target. Note The Key Database would keep all information of key tuples. that we assume Client and Target share a set of key To avoid possible threats, only a key, life time, and a pa- parameter g and Target could use the gID to get the ′ rameter g = {g}passwd that encrypted by password would key parameter g. be store in the moving subject. The password could be gen- erated by user and keep in user’s mind. • Step 4: User Authentication: The Tunnel Agent of Target authorizes the user to use services provided by We assume that when a moving subject leaves Home, he/she Target. would take several key tuples that downloaded from Host. • Step 5: Generation of Keying Information for Note that a pre-shared key could only be used to create a Session-authentication Key: The Tunnel Agent of secured tunnel once. We call them one-time keys. Whenever Target generates the following keying information of a key is used, the state of the tuple changes from the valid the session-authentication key for a new tunnel: a nonce k state to the invalid state. Besides, the key parameter g is NT ∈R {0, 1} , a secure ID secure since it have been encrypted by password. Thus, an NC NT adversary cannot use it to do anything even if he had stole SIDT = g kg , the key tuples from user’s storage. a key for message digest

NC NT 3.2 Online Phase: Mutual Authentication with MKCT = ~1(CkT kSIDT k(g )g ), Key Agreement and a session-authentication key While a communication is sensitive, user might need a secure NC NT tunnel to protect the transmitted data between the Client KCT = ~0(CkT kSIDT k(g )g ). and the Target. Since the user could be in any possible en- vironment, he/she might use an equipment with different Then, the Tunnel Agent of Target does the configura- level of resources. So, the constructed secure tunnel shall tion for the tunnel. After that, the Tunnel Agent of considerate with the resources of the equipment. Figure 2 Target delete the nonce of Target, NT . shows the sequence of events for the construction of the se- • Step 6: Passing Back of Tunneling Information cured tunnel, where the numbers correspond to the logical and the Keying Information for the Session- order of the events. We assume that the communications authentication Key to Client: The keying infor- between Host and Target are via a secured connection. The mation of session-authentication key, gNT , message di- establishment of a secured tunnel between Client and Target gest could be done as follows: [”1”,T,C,SIDT ]MKCT , • Step 1: Initial Setup Request: At first, Client asks and the tunneling information are passed back to the user to get the password. Then, Tunnel Agent of Client Tunnel Agent of Client.

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 43 Table 1. Notations and abbreviations

Symbol/Term Description ~0 Denote a secure cryptographic hash function for generation of session-authentication key. ~1 Denote a secure cryptographic hash function for generation of MAC digest key. MAC A message authentication code (MAC) allows two parties, who have shared a secret key in advance, to authenticate their subsequent communication. We can use it to generate a tag. The particular message may be efficiently verified by the party sharing the key. [·]MK Denote the computation of some MAC digest using a given key (MK). MK Denote a key for MAC digest computation. {·}K Denote the encryption of some message using encryption key, KP H , that is being shared by PiH i the party Pi and the Host, H.

KPiH Denote the shared key between the party Pi and the Host, H. g Denote a pre-shared parameter of key establishment. gID Denote the ID of a pre-shared parameter g. {g}passwd Denote the pre-shared parameter g encrypted by the passwd. C Denote the client (user). It could be a IP address of client. T Denote the target (private cloud). It could be a IP address of the gateway of the private cloud. NC Denote a nonce generated by the client (user). NT Denote a nonce generated by target (private cloud). N g C Denote an exponent generated by parameter g and nonce NC . N g T Denote an exponent generated by parameter g and nonce NT . User Denote the user, e.g., user name. passwd Denotes the secret password shared between the user and the Host. The password could be generated by user and keep in user’s mind. TI Denote the tunnel information. IRQ Denote a request information for tunnel establishment. It could be security level and length of secret key. keyID Denote the ID of key in the key pool. keyID KCH Denote the one-time key shared between the client and the host. SIDC Denote the secure ID generated by the client. SIDT Denote the secure ID generated by the target. MKCT Denote a key for message digest. KCT Denote the session-authentication key shared between the user and private cloud.

• Step 7: Derivation of Session-authentication Key If the MAC digest is correct, then it accepts the session- and Setup of Secure Tunnel: The Tunnel Agent of authentication key, KCT , and setup the secure tunnel Client generates the following keying information of for secure communication. the session-authentication key for a new tunnel: • Step 10: Establishment of a Secured Tunnel (IKE NC NT SIDC = g kg Negotiation): Client and Target use pre-share model It also generates the key for message digest to invoke the secure tunnel. Then, they use IKE pro- tocol to generate session key for communication. NC NT MKCT = ~1(CkT kSIDT k(g )g ). • Step 11: Successful Construction of the Secured Then, it verifies the received message digest, Tunnel

[”1”,T,C,SIDT ]MK . CT 3.3 Session Key Renewal for Non-IPSec Envi- If the message digest is correct, then it generates the ronment session-authentication key Our proposed mechanism also support non-IPSec scheme. NC NT KCT = ~0(CkT kSIDT k(g )g ) To make it, the session-authentication key could be use as the encryption key for the first session, then, we need a and deletes the nonce, NC . After that, tunnel agent session key agreement to renew the session key. Here are uses the session-authentication key to setup a secure our session key renewal process. tunnel. • Step 8: Passing of Message Digest: The Tunnel • Step 1: Initial Renewal Request: At first, Client ′ k Agent of Client sends the message digest, randomly generates a k-bits nonce NC ∈R {0, 1} . Then it derives the message digest [”2”,T,C,SIDT ]MK , CT ′ N [C, T, g c ]K to Target. CT ′ • Step 9: Verification and Final Tunneling Setup: with the protection of the session key KCT to the Tar- The Tunnel Agent of Target verifies the received mes- get. sage digest, • Step 2: Generation of Keying Information for

[”2”,T,C,SIDT ]MKCT . Session Key: The Tunnel Agent of Target generates

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 44 SD Private Cloud Client Host Target

1. Initial Setup Request 2. Ticket and Password Validation k N +RC }1,0{ , Decrypt g}{ passwd 3. Transmission of the Keying Transmit to Host: information and User Information NC UsergTCkeyID p asswdIRQ},,,,,{, keyID N C Kuser gIDIRQUsergTC },,,,,{ K HT NC UsergTC IRQp asswd],,,,,[ keyID N Kuser C gIDI RQUsergTC ],,,,,[ K HT 4. User Authentication 5. Generate A Keying Information and Derive The 6. Passing Back of Tunneling Information and Keying Information Session-Authentication Key N T TIg SIDCT ],,,"1[",}{, MK CT MKT CT k N +RT 1,0{ } = NC || ggSID NT 7. Derive Key Information T MK NC ggSIDTC NT ))(||||||( N N CT = h1 T = C || ggSID T N N C = C IDTCK ggS T ))(||||||( NT NC CT h 0 T CT = h1 C ggSIDTCMK ))(||||||( Delete N Verify received MAC digest: T SIDCT "1[" ],,, MKT CT Derive Session-Authentication Key

N T N C CT = h 0 C IDTCK ggS ))(||||||( Derive Tunnel Information

Delete NT

SIDTC "2[" ],,, 8. Passing Back of MAC digest MKC CT 9. Verify received MAC digest: 10. Establish a Secure Tunnel SIDTC "2[" ],,, (IKE Negotiation) MKC CT Setup secure tunnel between user and private cloud

11. Successful Construction of the Secured Tunnel

Figure 2. Steps for the set-up of a user-mobility VPN for private cloud

′ the following keying information of the new session of the new session key, gNT , message digest ′ k key: a nonce NT ∈R {0, 1} , a secure ID ′ [”1”,T,C,SID ] ′ , T MKCT ′ ′ ′ SID gNC gNT , T = k and the tunneling information are passed back to the a key for message digest Tunnel Agent of Client.

′ ′ ′ ′ N N • Step 4: Derivation of Session Key: The Tunnel MK = ~1(CkT kSID k(g C )g T ), CT T Agent of Client generates the following keying infor- ′ ′ ′ NC NT and a new session key mation of the new session key: SIDC = g kg . It

′ ′ ′ ′ also generates the key for message digest NC NT KCT = ~0(CkT kSIDT k(g )g ). ′ ′ ′ ′ NC NT MKCT = ~1(CkT kSIDT k(g )g ). Then, the Tunnel Agent of Target delete the nonce of ′ Then, it verifies the received message digest, Target, NT . ′ [”1”,T,C,SIDT ]MK′ . • Step 3: Passing Back the Keying Information for CT the Session Key to Client: The keying information If the message digest is correct, then it generates the

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 45

5 6 7

3 4 : ; <

SD 8 9

= >

¡ ¢ £

7

? 4

¥

98

¤

@

 

¦

      



§ ¨© ©

∈ k

N RC }1,0{'

 



    

N ' 

 © ©

C §

      

gTC ],,[ session

  

K CT 

© © ©

 



    

 © ∈ k N RT }1,0{'

N ' N '

            

$ C

T

  &      

  # =

T ||' ggSID

§ % © ©© © © © © © © ©

' NN '' N ' = C T T MK CT 1 T ggSIDTC ))(||'||||(' g SIDCT ]',,,"1[", MKT ' CT ' N ' = C N 'T

K CT 0 IDTC T ggS ))(||'||||('

     

0 (

   

 

(

§ © ©

   ' NT

N C N '' T = ) C ||' ggSID

N 'T = N 'C

MK CT 1 C ggSIDTC ))(||||||('

    + , - 

*

.

     

 #  SIDCT ]',,,"1["

MKT 'CT

  

(

    

  ©

= T NN '' C

CT / 0 C ggSIDTCK ))(||'||||('

(

  

N'T

  + , - 

2

 &   

#

1 ©

§ % SIDTC ]',,,"2[" MKC '

 

CT  

*

   

 # 

§

+ , - 

.

  

 



     



© ©

§ SIDTC ]',,,"2["

MKC 'CT



 &     A  

 # 

©© © "

    

 A 

  # 

©

       

!

 

            

 ##   # 

§ " © ©© ©

Figure 3. Session key renewal for private cloud

new session key If the MAC digest is correct, then it accepts the session ′ key, K , and use it to protect the secure tunnel. ′ ′ ′ ′ CT NC NT KCT = ~0(CkT kSIDT k(g )g )

′ 4. SECURITY ANALYSIS and deletes the nonce, N . C There are several good properties would be hold by our pro- posed protocol. The authentication property (by analysis • Step 5: Passing of Message Digest: The Tunnel 1), mutual authentication property (by analysis 2) are held Agent of Client sends the message digest, by the proposed protocol. Forward secrecy (analysis 5) is ′ ′ also held by this protocol. The proposed protocol also can [”2”,T,C,SIDT ]MK , CT resist guessing attack (analysis 3), replay attack (analysis to Target. 4), and known session key security (analysis 6). Analysis 1 (User authentication). The authentica- • Step 6 & 7: Verification and Renewal the Session tion property is held by the proposed protocol. Key: The Tunnel Agent of Target verifies the received message digest, Proof. The authentication procedure follows the chains of responsibility in [16]. We could observe a chain of respon- ′ [”2”,T,C,SID ] ′ . sibility, as show in Figure 4, running from the request at T MKCT

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 46 Client to the Target resource at the other. Client’s access Therefore, the proposed protocol provides the mutual au- request (e.g., a request by Alice) is granted because the re- thentication. quest comes with a request information and is encrypted by the public key of Client’s key pair. Host certifies the request Analysis 3 (Resist Guessing Attack). The proposed that is encrypted with Client’s key for Alice@Host. Target’s protocol can resist guessing attacks. membership database shows that Alice@Host is a legal user. The access control list on the Target shows that Client, i.e., Proof. The undetectable on-line password guessing attack Alice, has the authority to own a secured tunnel. The tunnel will fail, since after Step 2 of our protocol, Host can au- agent of Target helps in the establishment of a new tunnel. thenticate Client. The off-line password guessing attack will

not work against the protocol since the password passwd is

 

¢ 

 ¤ ¨

¤ only used for protecting the corresponding secret parameter



©

¨ ¤  g, and no verifiable information is encrypted by passwords.

Also, the secret g = Decrypt {g}passwd is stored in user’s

¨ ¨



¡ ¢ £

©

¤¥ ¤

 token. Only the legal user in Client has his/her password



©

¤ ¨ ¤

 

¡ ¢ £

¦ §

© ¨

¤ ¥ passwd (keep in his/her mind) can decrypt it and then use

¨ ¨







©

¤

¨ ¨  his/her the secret parameter g. Therefore, the proposed pro-

tocol can resist guessing attacks.

¡ ¢

© ©

¤ ¤ ¤ ¤ ¨

 

¡

©

¤ ¤

   



£ 

© ©

¤  ¤ ¨ ¤ ¤

  



¢ ¡ ¡



§

©

¤

   

¤ Definition 2 (Replay Attack).

 

¢

©

¤   A replay attack is a form of network attack in which a valid data transmission is maliciously or fraudulently repeated or Figure 4. Chain of responsibility delayed. Session tokens usually were adopted to avoid the replay attacks.

Definition 1 (Mutual Authentication). Analysis 4 (Resist Replay Attack). Mutual authentication, also called two-way authentication, This protocol is secure and can resist replay attacks. is a process in which both entities in a communications link authenticate each other. It is a strong security feature for Proof. In the proposed protocol, the replay attacks fail network services. because the freshness of the messages transmitted in the ses- Nc Analysis 2 (Mutual Authentication). Mutual au- sion key agreement phase is provided by the exponents g N thentication property is held by the proposed protocol. and g T . Except for Client, only Target who can compute the session key KCT can embed by the nonce NC and the Proof. In the protocol, the goal of mutual authentication session key KCT generated by Target in the hashed message is to generate an agreed session key KCT between Client NC NT and Target for the short-term authentication of the secure ~0(CkT kSIDT k(g )g ) tunnel construction. In Step 4 of the on-line phase, after of Step 5. Except for Target, only Client who can compute Target receiving the encrypted message NT the session key KCT can embed the nonce g and the ses- NC {C, T, g ,User,IRQ,gID}KHT sion key KCT generated by Client in the hashed message

NT NC and message digest ~0(CkT kSIDC k(g )g )

NC [C, T, g ,User,IRQ,gID]KHT of Step 7. Therefore, the proposed protocol can resist replay attacks. from Host, he will check if the encrypted message contains the identity User. Since the identity User is encrypted by Definition 3 (Forward Secrecy). A protocol hold the secret key KHT shared between Host and Target, Tar- forward secrecy, if compromise of the three private keys of N get will believe the random value g C was originally sent the participating entities does not affect the security of the from Client, verified by the Host in Step 2 and then sent to previous session keys. him/her. Target then can compute the session key In addition, there are two aspects related to forward secrecy, ~ NC NT 0(CkT kSIDT k(g )g ). i.e., perfect forward secrecy (p-FS) and master key forward In Step 7, after Client receiving the message gNT , encrypted secrecy (m-FS). The p-FS means that the compromise of both A’s and B’s long-term private keys would not affect tunneling information {TI}K , and message digest CT the secrecy of the previously established session keys. The [”1”,T,C,SIDT ]MKCT . m-FS is satisfied if the session key secrecy still holds even Since the hashed message included the shared when the server’s master key is compromised.

NC NT SIDC = g kg = SIDT Analysis 5 (Forward secrecy). The proposed proto- col satisfies forward secrecy. between Client and Target, Client will believe the random value gNT was originally sent form Target. Client then can Proof. The proposed protocol satisfies both p-FS and compute the session key m-FS, a disclosed secret key g cannot derive the session key

NC NT NC NT KCT = ~0(CkT kSIDT k(g )g ). KCT = ~0(CkT kSIDT k(g )g )

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 47 used before because without getting the used random ex- 6. ACKNOWLEDGEMENTS ponents NC and NT , nobody can compute the used ses- This work was supported by National Science Council, NSC sion key KCT . If an attacker eavesdrops all conversations of 101-2218-E-025-001, NSC 100-2221-E-390-012, and NSC 101- N the medium and derives some used random values g C and 2221-E-390-007. NT g , he/she could not compute the used session key KCT . This problem is the Diffie-Hellman key exchange algorithm. 7. REFERENCES Therefore, the proposed protocol provides forward secrecy. [1] C. S. Alliance. Security guidance for critical areas of Definition 4 (Known Session Key Security). focus in cloud computing v3, 2012. A protocol is called Known session key security, if an ad- [2] J. Askhoj, M. Nagamori, and S. Sugimoto. Archiving versary, having obtained some previous session keys, cannot as a service: a model for the provision of shared get the session keys of the current run of the key agreement archiving services using cloud computing. In protocol. Proceedings of the 2011 iConference, iConference ’11, pages 151–158, New York, NY, USA, 2011. ACM. Analysis 6 (Known Session Key Security). The [3] A. Bhargav-Spantzel, A. Squicciarini, and E. Bertino. proposed protocol is secure with respect to the known session Privacy preserving multi-factor authentication with key security. biometrics. In Proceedings of the second ACM Proof. In our protocol, the agreed session key relies on workshop on Digital identity management, DIM ’06, the one-way hash function and session secrets. The output pages 63–72, New York, NY, USA, 2006. ACM. of hash function is distributed uniformly in {0, 1}k, thus one [4] M. Blaze, J. Ioannidis, and A. D. Keromytis. Trust session key which is the output of hash function has no re- management for ipsec. ACM Trans. Inf. Syst. Secur., lation with the others. Besides, the session key is generated 5(2):95–118, May 2002. with the session secrets which are computed from the ran- [5] J. Bonneau. Getting web authentication right: a best-case protocol for the remaining life of passwords. dom ephemeral key, thus even one sessiona֒es֒ session secrets are revealed, the other session secrets will still remain safe. In Proceedings of the 19th international conference on Man-in-the-middle attack (MIA) is one of the major attacks Security Protocols, SP’11, pages 98–104, Berlin, in peer-to-peer systems. Therefore, it is also desirable to let Heidelberg, 2011. Springer-Verlag. the proposed protocol secure against MIA in cloud service. [6] A. Celesti, F. Tusa, M. Villari, and A. Puliafito. In general, an authentication protocol is secure against an Security and cloud computing: Intercloud identity MIA, if MIA can not establish any session key with either management infrastructure. In Proceedings of the 2010 the sender or the receiver. Since the proposed protocol can 19th IEEE International Workshops on Enabling provide mutual authentication, the proposed protocol can Technologies: Infrastructures for Collaborative verify the transmitted message were send by each other. Be- Enterprises, WETICE ’10, pages 263–265, sides, our protocol is a Station-to-Station-like protocol [8]. Washington, DC, USA, 2010. IEEE Computer Society. Thus, the protocol can resist MIA. [7] C.-C. Chang, Y.-F. Lu, A.-C. Pang, and T.-W. Kuo. Design and implementation of sip security. In 5. CONCLUSION Proceedings of the 2005 international conference on Information Networking: convergence in broadband This paper presents a tunnel establishment with strong au- and mobile networking, ICOIN’05, pages 669–678, thentication for private cloud. We exploit the extension of Berlin, Heidelberg, 2005. Springer-Verlag. the IPSec implementations to have secured tunnels for users [8] W. Diffie, P. C. Van Oorschot, and M. J. Wiener. who might move around dynamically. A session key renewal Authentication and authenticated key exchanges. Des. protocol is also presented in this work. Instead of having a Codes Cryptography, 2(2):107–125, June 1992. full-scaled Public Key Infrastructure, a lower-cost and flex- ible way in joining a virtual private network is proposed. [9] F. Gens. New idc it cloud services survey: Top An AAA-based negotiation procedure is proposed for the benefits and challenges. In IDC Predictions 2010 implementations. The implementation of the proposed al- Webcast Q&A IDC Survey: What IT Is Likely to gorithm is simple and highly portable. In addition, a private Move to the Cloud?, December 2009. cloud being a combination of computing resources; resource [10] N. Hallqvist and A. D. Keromytis. Implementing constrains are given less priority to provide high security internet key exchange (ike). In Proceedings of the to the cloud. Therefore, this work has not performed any annual conference on USENIX Annual Technical performance comparison with some existing schemes. The Conference, ATEC ’00, pages 46–46, Berkeley, CA, proposed scheme holds many important security properties USA, 2000. USENIX Association. such as mutual authentication, forward secrecy, and known [11] F. Hao, T. V. Lakshman, S. Mukherjee, and H. Song. session key security. It also can resist many popular attacks Secure cloud computing with a virtualized network such as replay attack, guessing attack, and man in the mid- infrastructure. In Proceedings of the 2nd USENIX dle attack. The related analytical properties is presented. conference on Hot topics in cloud computing, HotCloud’10, pages 16–16, Berkeley, CA, USA, 2010. For future research, we shall explore dynamic routing meth- USENIX Association. ods for security enhancement. The integration of cloud re- [12] D. Huang, X. Zhang, M. Kang, and J. Luo. Mobicloud: lated services and the proposed protocol would be a focus of Building secure cloud framework for mobile computing this study. and communication. In Proceedings of the 2010 Fifth

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 48 IEEE International Symposium on Service Oriented Environments (STAVE 2011), Crete (Greece), June System Engineering, SOSE ’10, pages 27–34, 2011. Washington, DC, USA, 2010. IEEE Computer Society. [25] K. P. B. Ohad Rodeh and D. Dolev. Dynamic virtual [13] X. Jin, E. Keller, and J. Rexford. Virtual switching private networks. Technical report, Cornell without a hypervisor for a more secure cloud. In Universityt, March 1999. Proceedings of the 2nd USENIX conference on Hot [26] M. Pan, P. Li, Y. Song, Y. Fang, and P. Lin. Topics in Management of Internet, Cloud, and Spectrum clouds: A session based spectrum trading Enterprise Networks and Services, Hot-ICE’12, pages system for multi-hop cognitive radio networks. In 9–9, Berkeley, CA, USA, 2012. USENIX Association. INFOCOM, pages 1557–1565, 2012. [14] H.-S. Kim, S.-W. Lee, and K.-Y. Yoo. Id-based [27] R. Sinnott, J. Jiang, J. Watt, and O. Ajayi. password authentication scheme using smart cards and Shibboleth-based access to and usage of grid resources. fingerprints. SIGOPS Oper. Syst. Rev., 37(4):32–41, In Proceedings of the 7th IEEE/ACM International Oct. 2003. Conference on Grid Computing, GRID ’06, pages [15] C.-F. Kuo, A.-C. Pang, and S.-K. Chan. Dynamic 136–143, Washington, DC, USA, 2006. IEEE routing with security considerations. IEEE Computer Society. Transactions on Parallel and Distributed Systems, [28] R. Song. Advanced smart card based password 20(1):48–58, 2009. authentication protocol. Computer Standards & [16] B. Lampson. Computer security in the real world. Interfaces, 32(5-6):321–325, 2010. Computer, 37(6):37–46, 2004. [29] C. Steele. Paranoid penguin: two-factor [17] C.-F. Liao, Y.-F. Lu, A.-C. Pang, and T.-W. Kuo. A authentication. Linux J., 2005(139):10, Nov. 2005. secure routing protocol for wireless embedded [30] S.-T. Sun, E. Pospisil, I. Muslukhov, N. Dindar, networks. In Proceedings of the 2008 14th IEEE K. Hawkey, and K. Beznosov. What makes users International Conference on Embedded and Real-Time refuse web single sign-on?: an empirical investigation Computing Systems and Applications, RTCSA ’08, of openid. In Proceedings of the Seventh Symposium pages 421–426, Washington, DC, USA, 2008. IEEE on Usable Privacy and Security, SOUPS ’11, pages Computer Society. 4:1–4:20, New York, NY, USA, 2011. ACM. [18] W.-H. Liao and S.-C. Su. A dynamic vpn architecture [31] J. Szefer, E. Keller, R. B. Lee, and J. Rexford. for private cloud computing. In Proceedings of the Eliminating the hypervisor attack surface for a more 2011 Fourth IEEE International Conference on Utility secure cloud. In Proceedings of the 18th ACM and Cloud Computing, UCC ’11, pages 409–414, conference on Computer and communications security, Washington, DC, USA, 2011. IEEE Computer Society. CCS ’11, pages 401–412, New York, NY, USA, 2011. [19] S.-H. Liu, Y.-F. Lu, C.-F. Kuo, A.-C. Pang, and ACM. T.-W. Kuo. The performance evaluation of a dynamic [32] C.-L. Tsai, U.-C. Lin, A. Y. Chang, and C.-J. Chen. configuration method over ipsec. In Proceedings of Information security issue of enterprises adopting the 24th IEEE Real-Time Systems Symp.: Works in application of cloud computing. In Networked Progress Session (RTSS WIP), 2003. Computing and Advanced Information Management [20] Y.-F. Lu, P.-H. Lin, S.-S. Ye, R.-S. Wang, and S.-C. (NCM), 2010 Sixth International Conference on, Chou. A strong authentication with key agreement pages 645–649, 2010. scheme for web-based collaborative systems. In [33] M. Villari, F. Tusa, A. Celesti, and A. Puliafito. How International Symposium on Intelligent Signal to federate vision clouds through saml/shibboleth Processing and Communication Systems (ISPACS authentication. In Proceedings of the First European 2012). IEEE Computer Society, 2012. conference on Service-Oriented and Cloud Computing, [21] P. Mell and T. Grance. Effectively and Securely Using ESOCC’12, pages 259–274, Berlin, Heidelberg, 2012. the Cloud Computing Paradigm. National Institute of Springer-Verlag. Standards and Technology, Information Technology [34] M. B. Vleju. A client-centric asm-based approach to Laboratory, 2009. identity management in cloud computing. In [22] C. Modi, D. Patel, B. Borisaniya, A. Patel, and Proceedings of the 2012 international conference on M. Rajarajan. A survey on security issues and Advances in Conceptual Modeling, ER’12, pages solutions at different layers of cloud computing. J. 34–43, Berlin, Heidelberg, 2012. Springer-Verlag. Supercomput., 63(2):561–592, Feb. 2013. [35] J.-H. Yang and C.-C. Chang. An id-based remote [23] A. P. Muniyandi, R. Ramasamy, and Indrani. mutual authentication with key agreement scheme for Password based remote authentication scheme using mobile devices on elliptic curve cryptosystem. ecc for smart card. In Proceedings of the 2011 Computers & Security, 28(3-4):138–143, 2009. International Conference on Communication, [36] D. Zissis and D. Lekkas. Addressing cloud computing Computing & Security, ICCCS ’11, pages 549–554, security issues. Future Gener. Comput. Syst., New York, NY, USA, 2011. ACM. 28(3):583–592, Mar. 2012. [24] D. Nunez, I. Agudo, P. Drogkaris, and S. Gritzalis. Identity management challenges for intercloud applications. In 1st International Workshop on Security and Trust for Applications in Virtualised

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 49 ABOUT THE AUTHORS:

Yung-Feng Lu received the B.S. and M.S. degrees from the Department of Electronic Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan, R.O.C., in 1998 and 2000, respectively, and the Ph.D. degree from the Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, R.O.C. He is currently an Assistant Professor in the Department of Computer Science and Information Engineering at National Taichung University of Science and Technology, Taichung. His research interests include networking, embedded systems, databases, storage, neural networks, cloud computing, and information security.

Chin-Fu Kuo received his BS and MS degrees in the Department of Computer Science and Information Engineering, National Chung Cheng University, Chiayi, Taiwan, ROC, in 1998 and 2000, respectively. He received the Ph.D. degree in the computer science and information engineering, National Taiwan University, in Taipei, Taiwan, ROC in 2005. He joined the Department of Computer Science and Information Engineering (CSIE), National University of Kaohsiung (NUK), Kaohsiung, Taiwan, as an Assistant Professor in 2006. Currently, he is an Associate Professor in CSIE of NUK, Kaohsiung, Taiwan. His research interests include real- time process scheduling, resource management, cloud computing, and network QoS.

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 50 An On-line Hot Data Identification for Flash-based Storage using Sampling Mechanism

Dongchul Park Young Jin Nam Biplob Debnath U of Minnesota–Twin Cities Oracle Corp. NEC Lab. Minneapolis, MN, USA Santa Clara, CA, USA Princeton, NJ, USA [email protected] [email protected] [email protected] David H.C. Du Youngkyun Kim Youngchul Kim U of Minnesota–Twin Cities ETRI ETRI Minneapolis, MN, USA Daejeon, South Korea Daejeon, South Korea [email protected] [email protected] [email protected]

ABSTRACT throughs, a NAND flash-based storage has emerged to over- come the shortcomings of them. These days, the flash-based Efficient hot and cold data identification in computer sys- storages like a Solid State Drive (SSD) have been increas- tems has been a fundamental issue. However, it has been ingly adopted as main storage media especially in mobile least investigated. In this paper, we propose a novel on- devices such as laptops, and even in data centers [1, 2, 3, line hot data identification scheme for flash-based storage 4, 5, 6]. Although they provide a block I/O interface like named HotDataTrap. The main idea is to maintain a work- magnetic disk drives, due to the lack of their mechanical ing set of potential hot data items in a cache based on a head movement, not only is random read performance com- sampling mechanism. This sampling-based scheme enables parable to the sequential read performance, but also shock HotDataTrap to early discard some of the cold items so that resistance, light weight, and low power consumption can be it can reduce runtime overheads as well as a waste of mem- achieved. However, flash memory has two main limitations ory spaces. Moreover, our two-level hash indexing scheme in terms of a write operation. First, once data are writ- helps HotDataTrap directly look up a requested item in the ten in a flash page, an erase operation is required to update cache and save a memory space further by exploiting spa- them, where the erase is almost 10 times slower than write tial localities. Both our sampling approach and hierarchical and 100 times slower than read [7]. Thus, frequent erases hash indexing scheme empower HotDataTrap to precisely severely degrade the overall performance of flash-based stor- and efficiently identify hot data with a even less memory. age. Second, each flash block cell can be erased only for a Our extensive experiments with various realistic workloads limited number of times (i.e., 10K-100K) [7]. demonstrate that our HotDataTrap outperforms the state- of-the-art scheme by an average of 335% and our two-level To get over these limitations, a flash-based storage uses an hash indexing scheme considerably improves further Hot- intermediate software/hardware layer named Flash Trans- DataTrap performance up to 50.8%. 1 lation Layer (FTL) which hides these limitations so that existing disk-based applications can use flash memory with- Categories and Subject Descriptors out any modifications. However, internally FTL needs to deal with these physical limitations. To avoid an expensive D.4.7 [Operating Systems]: Organization and Design— erase operation for any logical page update operation, FTL Real-time systems and embedded systems treats flash memory as a log device, stores updated data to a new physical page, marks the old physical page as invalid for General Terms future garbage collection, and maintains logical-to-physical page address mapping information in order to track the lat- Design, Performance est location of a logical page [8]. Periodically, FTL invokes garbage collection to reclaim the free space resulted from Keywords the earlier update operations. In addition, during garbage Hot Data Identification, HotDataTrap, , SSD collection, FTL employs wear leveling algorithms to ensure that all blocks are erased out evenly in order to improve 1 Introduction overall longevity of the flash memory. One of the critical issues in designing flash-based storage Over the years, researchers have explored various alterna- systems and integrating them into the storage hierarchy is tive storage technologies to possibly replace conventional how to effectively identify hot data that will be frequently magnetic disk drives. With the recent technological break- accessed in near future as well as have been frequently ac- 1 cessed so far. Since future access information is not given Copyright is held by the authors. This work is based on an earlier work: SAC’12 Proceedings of the 2012 ACM Sympo- as a priori, we need an on-line algorithm which can predict sium on Applied Computing, Copyright 2012 ACM 978-1- the future hot data based on past behaviors of a workload. 4503-0857-1/12/03. Thus, if any data have been recently accessed more than http://doi.acm.org/10.1145/2245276.2232034

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 51 the threshold number of times, we can consider the data hot In addition to these applications, hot data identification has data, otherwise, cold data. This definition of hot, however, a big potential to be utilized by many other fields. Al- is still ambiguous and takes only frequency (i.e., how many though it has a great impact on designing and integrating times the data have been accessed) into account. However, flash memory, it has been least investigated [26]. The chal- there is another important factor–recency (i.e., when the lenges are not only the classification accuracy, but also a data have been accessed)–to identify hot data. In general, limited resource (i.e., SRAM) and computational overheads. many access patterns in workloads exhibit high temporal lo- Existing schemes for flash memory either suffer from high calities [9]; therefore, recently accessed data are more likely memory consumption [27] or incur excessive computational to be accessed again in near future. This is the rationale for overheads [28]. To overcome these shortcomings, Hsieh et al. including the recency factor in hot data identification. This recently proposed a multiple hash function framework [29]. hot definition can be different for each application. Now, we This adopts a counting bloom filter and multiple hash func- briefly describe some applications of a hot data identification tions so that it can reduce memory consumption as well as scheme in the context of the flash-based storage systems. computational overheads. However, it considers simply a frequency factor and cannot identify hot data with accuracy Garbage Collection and Wear Leveling: Both have a critical due to a false positive probability that is an inborn limita- impact on the access latency and lifetime of flash memory. tion of a bloom filter. Moreover, it is heavily dependent on By collecting hot data to the same block, we can consider- the workload characteristic information (i.e., an average hot ably improve the garbage collection efficiency. In addition, data ratio of a workload) which cannot be given as a priori we can improve its reliability by allocating hot data to the in an on-line algorithm. flash blocks with a low erase count [10, 11, 9, 12, 2]. Considering these observations, an efficient on-line hot data Flash Memory as a Cache: There are some attempts [13, 14] identification scheme has to meet the following requirements: to use flash memory as an extended cache between DRAM (1) capturing recency as well as frequency information, (2) and HDD. In this hierarchy, they tried to keep track of low memory consumption, (3) low computational overheads, page histories to decide the placement and migration of data and (4) independence of prerequisite information. Based pages between DRAM and flash memory by adopting the hot on these requirements, we propose a novel on-line hot data data identification concept. identification scheme named HotDataTrap. The main idea Hybrid SSD: Hybrid SSDs such as SLC-MLC hybrid [15] is to cache and maintain a working set of potential hot data and Flash-PCM hybrid [16] are another application. We can by sampling LBA access requests. store hot data into SLC (Single-Level Cell) or PCM (Phase HotDataTrap captures recency as well as frequency by main- Change Memory), while we can store cold data into MLC taining recency bit like a CLOCK cache algorithm [30], (Multi-Level Cell) or NAND flash. They, consequently, can which results in a more accurate hot data identification. It improve both performance and reliability of SSDs. periodically resets the recency information and decays the Address Mapping Scheme: Convertible-FTL (CFTL) is a frequency information, which is an aging mechanism. Thus, novel hybrid address mapping algorithm designed for flash- if an item has not been accessed for a long time, eventually based storage [17]. CFTL classifies all flash blocks into the it is considered as cold data and will become a victim can- hot and cold blocks according to data access patterns. It didate for eviction from the cache. Moreover, HotDataTrap improves the address translation efficiency in a way that can achieve the direct cache lookup as well as less mem- hot data are managed by a page-level mapping, while cold ory consumption with the help of its hierarchical two-way data are maintained by a block-level mapping. hash indexing scheme. Consequently, our HotDataTrap pro- vides a much more accurate hot data identification than the Buffer Replacement Algorithm: LRU-WSR algorithm is spe- state-of-the-art scheme [29] even with a very limited memory cially designed for flash memory considering its asymmetric space. The main contributions of this paper are as follows: read and write latencies [18]. For a victim selection, if any page selected by LRU algorithm is defined as clean and cold, • Sampling-based Hot Data Identification: Unlike all it is chosen as a victim. On the other hand, if it is defined other schemes, HotDataTrap does not initially insert all re- as dirty and hot, it is moved to MRU position instead. quested items into the cache. Instead, it decides whether to cache them or not with our simple sampling algorithm. Sensor Networks: FlashDB is a flash-based B-Tree index Once an item (i.e., requested LBA) is selected for caching, structure optimized for sensor networks [5]. In FlashDB, HotDataTrap maintains it as long as it is stored in the cache. the B-Tree node can be stored either in a disk mode or in a Otherwise, HotDataTrap simply ignores it from the begin- log mode. These mode selections are based on the hot data ning. This sampling-based scheme enables HotDataTrap to identification algorithm. early discard some of the cold items so that it can reduce Shingled Magnetic Recording Disks: A Shingled Magnetic the computational overhead and memory consumption. Recording (SMR) HDD is a magnetic hard disk drive that • Two-Level Hash Indexing: To reduce memory con- adopts the shingled magnetic recording technology to over- sumption further, HotdataTrap does not maintain whole come the areal density limit faced in conventional hard meta data information for each accessed LBA in the cache. disk drives [19]. Although this is not flash-based storage, Instead, it stores partial LBA information by using the hi- its physical characteristics are very similar to the NAND erarchical hash indexing scheme. That is, for sequential flash [20, 21, 22, 23]. Recently, Lin et al. proposed H-SWD accesses, HotDataTrap stores starting accessed LBA infor- adopting hot data identification mechanism to improve the mation and only its offset information so that it can take SMR HDD performance [24, 25]. advantage of spatial localities. Moreover, this hierarchical hash indexing enables HotDataTrap to directly access the

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 52 Hot List Application 1 Application 2 ... Application N

fwrite (file, data) Element is promoted if the File Systems Element is demoted LBA already exists in the if the hot list is full candidate list block write (LBA, size)

Flash Translation Layer SRAM (DRAM) Candidate List Element is evicted if New LBA is added if the LBA Garbage the candidate list is full does not exist in any list Collector Address Hot Data Allocator Identifier Figure 2. Two-level LRU List Scheme. If data are Wear found in the hot list, they are considered hot data. Leveler

flash write (block, page) Memory Technology Device Layer to reclaim them for its near future use [31]. This garbage control signals collection is performed on a block basis; so all valid pages in Flash Memory the victim block must be copied to other clean spaces before the victim block is to be erased. Another crucial issue in Figure 1. Typical System Architecture of Flash flash memory is wear leveling. Its motivation is to prevent Memory-based Storage Systems. The hot data any cold data from staying at any block for a long time. Its identification scheme directly affects internal com- main goal is to minimize the variance among erase count ponents in Flash Translation Layer (FTL). values for each block so that the life span of flash memory is to be maximized. Another crucial issue in flash memory is wear leveling. Wear corresponding meta data information in the cache. This can leveling is to evenly distribute the number of erase count for significantly reduce the cache lookup overhead. each block since, unlike DRAM, each block cell in flash mem- ory has a limited life span. That is, after a limited number The rest of the paper is organized as follows. Section 2 of erases to each block, errors can start to occur. Currently, gives a brief overview of flash memory and existing hot data the most common allowable number of write per block is identification schemes and Section 3 describes our proposed typically 10K for MLC (Multi-level Cell) and 100K for SLC HotDataTrap scheme. Section 4 provides diverse our ex- (Single-level Cell). Wear leveling is largely categorized into periments and analyses. Finally, Section 5 concludes this dynamic wear leveling and static wear leveling. Dynamic research. wear leveling tries to balance the block erase count by allo- cating hot data to the blocks with low erase counts. How- 2 Related Work ever, this will not be able to guarantee a complete balance of block erase counts since some blocks with cold data may not This section describes NAND flash memory characteristics be erased forever. On the contrary, static wear leveling can and existing hot data identification schemes for flash mem- evenly distribute the erase counts by migrating cold data roy. In addition, we discuss their advantages and disadvan- into blocks with high erase counts. Dynamic wear leveling tages. has a shorter life expectancy than static wear leveling since 2.1 Characteristics of Flash Memory a complete balance of block erase counts is hard to achieve. Both garbage collection and wear leveling play a critical role Flash memory is organized as an array of blocks each of in longevity and reliability of flash memory. Since both algo- which consists of either 32 pages (small block) or 64 pages rithms are fundamentally based on hot and cold data iden- (large block). It provides three types of unit operations such tification scheme, an efficient hot data identification scheme as read, write, and erase. These operations are asymmetric: is very critical in flash memory design. the read and write operations are performed on a page ba- sis, while the erase operates on a block basis. Data erase 2.2 Hot Data Identification Schemes (1,500µs) is the most expensive operation in flash memory compared to data read (25µs) and write (200µs) [7]. This subsection describes the existing hot data identification schemes and explores their merits and demerits. Figure 1 shows a typical architecture of flash memory-based storage systems. Both Memory Technology Device (MTD) Chiang et al. [27] classified data into read-only, hot, and and Flash Translation Layer (FTL) are two major parts of cold data for designing a garbage collection algorithm. This flash memory architecture. The MTD provides aforemen- scheme separates read-only data from other data and chose tioned three primitive flash operations. The FTL plays a a block containing many hot data since hot data will soon role in address translation between Logical Block Address become invalidated, which will result in higher reclamation (LBA) and its Physical Block Address (PBA) so that users efficiency. For data classification, this scheme keeps track can utilize the flash memory with existing conventional file of the hotness of each LBA (Logical Block Address) in a systems without significant modification. A typical FTL lookup table by maintaining some information such as data is largely composed of an address allocator and a cleaner. access timestamp, counters, valid block counters, etc. How- The address allocator deals with address translation and the ever, this approach introduces significantly high memory cleaner tries to collect blocks filled with invalid data pages consumption.

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 53 D-bit counters when an item xi appears, all corresponding counter values 1 1 0 1 are incremented by 1. Among all corresponding counter val- 0 1 0 0 ues, the minimum counter value indicates an upper bound 0 0 0 1 of the frequency of xi because each of them can be also in- 0 1 0 0 +1 cremented by other items. If this minimum frequency value K-Hash Functions 1 1 0 0 i 0 0 1 0 is greater than a threshold value, x reported as hot data. f 1(X) 0 1 1 0 As shown in Figure 3, the Multi-hash function scheme 1 1 1 1 +1 adopted this basic counting BF concept to identify hot data. LBA f 2(X) 1 0 0 1 When a write request is issued, its LBA is hashed by K - X 1 0 1 0 hash functions and the K -hash values correspond to their f 3(X) 0 0 1 0 +1 1 1 0 0 bit positions of the BF. Then, the corresponding counters 1 0 1 1 are increased by 1. If all K -counter values are larger than f 4(X) 0 1 0 1 the predefined hot threshold value, the data are classified as 0 1 0 0 hot data (note: if any one bit of B-most significant bits is 1 0 0 1 +1 set to 1, this means its counter value is greater than or equal D−B 1 0 1 0 to at least 2 ). 0 0 1 0 B- Most Significant Bits This scheme achieves less memory space consumption and lower computational complexity; however, an appropriate bloom filter size is its big concern. The bloom filter size is Figure 3. Multi-Hash Function Framework. K closely related to the number of a unique LBA to be accom- Hash values correspond to each bit position in the modated, which depends heavily on the workload charac- bloom filter. If all K counters are greater than teristics. If the bloom filter size is not properly configured, the predefined threshold, it is defined as hot data. its performance is severely degraded (this will be addressed Here, K =4, D=4 and B=2. in our experiment section in detail). Moreover, its decay interval is based on the average hot data ratios in work- loads, which cannot be provided to the on-line algorithm in Chang et al. proposed a two-level LRU (Least Recently advance. Used) discipline for hot and cold identification [28]. As shown in Figure 2, it maintains two fixed-length LRU lists Hsieh et al. also presented an approximated hot data iden- for LBAs: one for a hot data list and another for a hot tification algorithm named a direct address method (here- candidate list. When the Flash Translation Layer (FTL) after, we refer to this as DAM) as their baseline algorithm. receives a write request, this scheme checks whether the re- DAM assumes that the unlimited memory space is available quested LBA already exists in the first-level LRU list (i.e., to keep track of hot data. It maintains a counter for all a hot list) or not. If it is found in the hot list, the data LBAs and periodically decays each LBA counter value by are considered hot data, otherwise, cold data. If the LBA dividing by two at once. hits the candidate list instead, it is promoted to the hot list. This approach reduces memory space consumption by keep- 3 HotDataTrap ing track of only hot or potentially hot LBAs. However, it still requires considerable running overheads to emulate the This section describes our proposed hot data identification LRU discipline and its performance is sensitive to the sizes scheme named HotDataTrap. of both lists. 3.1 Architecture To resolve memory space overheads and computational com- plexity, Hsieh et al. proposed a Multi-hash function scheme Most of the I/O traces exhibit localities [11, 9] and in case of by adopting a counting bloom filter [29]. Since this is the sequential accesses, only a few LSBs (Least Significant Bits) state-of-the-art algorithm, we describe it in more detail. A of LBAs are changed, while most of the other bits of them counting bloom filter (for short, BF) is an extension of a are not affected. The basic architecture of HotDataTrap is basic BF. The basic BF is widely used for a compact rep- inspired by these observations. resentation of a set of items. According to Border’s defini- HotDataTrap caches a set of items in order to identify hot tion [32], a BF for representing a set S = {x1, x2, x3, ··· , xn} items. For each item, it maintains an ID, a counter, and of n items is described by m bits, all bits are initially set to 0. a recency bit (shown in Figure 4). The counter is used to A BF employs k independent hash functions h1, h2, ··· , hk keep track of the frequency information, while the recency with a range {1, 2, ··· , m}. Each item xi passing through bit is adopted to check whether the item has been recently the hash functions is remembered by marking the k bits cor- accessed or not. To reduce memory consumption, HotData- responding to the k hash values. If all k bits have already Trap uses only partial bits (16 bits) out of 32-bit LBA (as- been marked to 1, it indicates that the item xi has appeared suming LBA is composed of 32 bits) to identify LBAs. This before, assuming m is large enough. Otherwise (i.e., at least 16-bit ID consists of a primary ID (12 bits) and a sub ID (4 one bit out of k bits has not been marked yet), xi is regarded bits). as a new item. Although a basic BF can indicate whether an item x has appeared before or not, it cannot provide the fre- Our proposed scheme adopts two hash functions: one for a quency information (i.e., how many times x has appeared). primary ID access and the other for a sub ID access. As To capture this frequency information, a counting BF uses shown in Figure 4, one hash function for the sub ID access m counters instead of an array of m bits. Consequently, captures only the last 4 LSBs, while the other one for the

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 54 Primary ID Sub ID Count Rec 10 (12 bits) (4 bits) (3) (1) Hash 8 Functions 101110101010 0000 4 1 6.85 12 bits 0001 6 1 h1 6 4.92 0010 1 0 LBA 4 ...101110010011 111010011101 0100 3 0 2 0101 6 0 0 0.26 0.05 0.08

0110 1 1 (%) Rates Identification False 0 0111 7 1 F_16bit F_20bit F_24bit MSR Distilled RealSSD 4 bits h2 111101100101 1000 1 1 Figure 5. False LBA Identification Rates for Each Two-Level Hierarchical 1001 1 0 Hash Indexing Workload. (F 16(20, 24)bit stands for Financial1 traces capturing 16(20, 24) LSBs in LBAs respec- tively.) Figure 4. HotDataTrap Framework. This hierar- chical two-way hash indexing scheme enables Hot- MSR RealSSD Distilled Financial1 DataTrap to achieve a direct cache access as well 100% as less memory consumption by exploiting spatial 90% localities. 80% 70% 60% 50% primary ID access takes the following 12 LSBs. This two- 40% level hierarchical hash indexing scheme can considerably re- 30% 20% duce cache lookup overheads by allowing HotDataTrap to 10% directly access LBA information in the cache. Moreover, it 0% Proportion Proportion of accesseach LBA for trace 1 16 31 46 61 76 91

empowers HotDataTrap to exploit spatial localities. That 106 121 136 151 166 181 196 211 226 241 256 271 286 301 316 331 346 361 376 391 406 421 436 451 is, many access patterns in workloads generally exhibit high LBA (unit: 100K) spatial localities as well as temporal localities [9]. For se- quential accesses, HotDataTrap stores the starting LBA in Figure 6. Proportion of LBA Access for Each a sequential access to both the primary ID and the sub ID, Trace. Financial1 traces tend to intensively access and its offset addresses only to the sub IDs, not to the pri- only a very limited space (a dotted square part). mary IDs. Consequently, it can significantly reduce memory space consumption. so that the variance of all accessed LBAs in the workloads is However, this partial LBA capturing can cause a false LBA even smaller than the others. Therefore, capturing 16 LSBs identification problem. To verify that our partial ID can pro- in Financial1 causes a relatively higher false LBA identifi- vide enough capabilities to appropriately identify LBAs, we cation rate than the others. In this case, if we need a more made an attempt to analyze four workloads (i.e., Financial precise identification performance according to the applica- 1, MSR, Distilled, RealSSD). These traces will be explained tions, we can increase our primary ID size, not the sub ID in detail in our evaluation section (Section 4.3). size. Even though we increase our primary ID size, we still can save a memory space with the sub ID by exploiting spa- Figure 5 shows false LBA identification rates for each trace. tial localities. Moreover, although HotDataTrap adopts 16 We first captured the aforementioned 16 LSBs to identify LSBs capturing in our experiments, our scheme outperforms LBAs and then made a one-to-one comparison with the the state-of-the-art scheme even in the Financial1 traces. whole 32-bit LBAs. As plotted in Figure 5, MSR, Distilled, This will be verified in our performance evaluation section and RealSSD traces present very low false LBA identifica- (Section 4.3) later. tion rates, which means we can use only 16 LSBs to identify LBAs. On the other hand, Financial1 trace file shows a 3.2 Overall Working Processes different characteristic from the others. When we captured the 16 LSBs of the Financial1 traces, it exhibited a rela- HotDataTrap works as follows: whenever a write request is tively higher false identification rate than those of the other issued, the request LBA is hashed by two hash functions as three traces. However, when we adopted 24 LSBs instead, described in the section 3.1 to check if the LBA has stored it showed a zero false identification rate, which means the in the cache. If the request LBA hits the cache, the corre- 24 LSBs can perfectly identify all LBAs in the Financial1 sponding counter value is incremented by 1 to capture the traces. To clarify this reason, we made another analysis of frequency and the recency bit is set to 1 for recency cap- those four traces. turing. If the counter value is greater than or equal to the predefined hot threshold value, it is classified as hot data, Figure 6 represents a 100% stacked column chart that is otherwise cold data. In case of a cache miss, HotDataTrap generally used to emphasize the proportion of each data. makes a decision on inserting this new request LBA into the As displayed in this figure, unlike the others, the Financial1 cache by using our sampling-based approach. traces tend to access only a very limited range of LBAs (a dotted square part), whereas MSR, Distilled, and RealSSD Unlike other hot data identification schemes or typical cache traces access much wider ranges of LBAs. This means Fi- algorithms which initially insert all missed items into the nancial1 traces intensively access only a very limited space cache and then selectively evict a useless item from the cache

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 55 Recency Algorithm 1 HotDataTrap Victim Candidates ID Counter Bit P N K J ... R 1: Input: write requests for LBAs D 7 1 2: Output: hot or cold data classification of the LBAs 3: A write request for an LBA x is issued Hash Functions C 4 0 4: if (Cache Hit) then 5: Increase the counter for x by 1 h1 Hierarchical 6: Set the recency bit to 1 Indexing S 2 1 7: if (Counter ≥ HotThreshold) then h2 8: Classify x as hot data 9: end if Our victim candidate list N 1 0 10: else considerably reduces the ... 11: // Cache Miss victim search overheads. 12: if (Passed a sampling test) then 13: if (Cache is not full) then Figure 7. HotDataTrap Victim Selection 14: Insert x into the cache 15: Set the counter for x to 1 16: Set the recency bit to 1 17: else afterwards, our HotDataTrap initially stores only selected 18: // Need to evict data items in the cache from the beginning. Due to a limited 19: while (The current candidate is not a victim) do 20: Move to the next candidate in the list cache space, it would be ideal if we could store only an item 21: Check if the candidate can be evicted that will become hot in near future. However, it is impos- 22: end while sible unless we can see the future. Even though we try to 23: Evict the victim and insert x into the cache 24: Set the counter for x to 1 predict the future accesses based on the access history, we 25: Set the recency bit to 1 must pay extra overheads to manage considerable amount of 26: end if the past access information. This is not a reasonable solu- 27: else 28: // If failed in the sampling test tion especially for the hot data identification scheme because 29: Skip further processing of the x it must be triggered whenever every write request is issued. 30: end if 31: end if To resolve this problem, HotDataTrap tries to do sampling with 50% probability which is conceptually equivalent to tossing a coin. This simple idea is inspired by the intuition that if any LBAs are frequently accessed, they are highly is full and we need to insert a newly sampled LBA into the likely to pass this sampling. This sampling-based approach cache, HotDataTrap chooses a victim. If we perform a se- helps HotDataTrap early discard an infrequently accessed quential search in the cache for the victim selection, it causes LBA. It, consequently, can reduce not only memory con- significant overheads. To reduce these overheads, HotData- sumption, but also computational overheads. The effective- Trap maintains a victim candidate list. If an access counter ness of this sampling-based approach will be verified in our value is smaller than the predefined hot threshold value and experiments (Section 4.3). its recency bit is reset to 0, such LBA is classified into a vic- tim candidate. Whenever HotDataTrap executes its decay A hot data identification scheme needs an aging mechanism process, it stores those victim candidates into the list. That which decays the frequency information. Like the Multi- is, the victim list is periodically updated for each decay pe- hash function scheme [29], HotDataTrap periodically divides riod to reflect the latest information. Thus, when HotData- the access counter values by two for its aging mechanism, Trap needs to evict a cold item from the cache, it first selects which enables hot data identification schemes to partially one victim candidate from the list and directly checks if the forget their frequency information as time goes on. Thus, candidate can be removed (i.e., it should meet two afore- if any data have been heavily accessed in the past, but not mentioned requirements for an eviction). If the candidate afterwards, they will eventually become cold data and will is still cold, HotDataTrap deletes the candidate and inserts be selected as victims in the cache soon. the new item into the cache. If the candidate has turned to hot data since the last decay period, we cannot evict it. In Moreover, HotDataTrap retains another mechanism for bet- this case, HotDataTrap checks another candidate from the ter hot data identification by capturing recency as well as list. frequency information. Both a recency capturing mecha- nism and an aging mechanism look similar, but are clearly The victim candidate list can notably save the victim search different: the main goal of the aging mechanism is to reg- overheads but does not guarantee that all candidates in the ularly decay the frequency information over time as well as list are currently victims in the cache because the cold data to prevent a counter overflow problem. On the other hand, (i.e., victims at that time) may have changed to hot data a recency capturing mechanism intends to reflect recent ac- since the last decay period. To reduce this error, HotData- cess information. Thus, the aging mechanism in Multi-hash Trap periodically updates the candidate information for each cannot appropriately capture recency information. For in- decay period. However, even though there may exist errors stance, the Multi-hash scheme does not provide any mecha- in the victim selection, they are not so much common be- nism to check if any data have been recently accessed because cause, in general, most of the workloads do not dramatically its aging mechanism simply divides all access counts by two change their access patterns or workload characteristics and periodically. However, the aging mechanism in HotData- are likely to retain localities [11]. Trap periodically resets all recency bits to 0, and its recency capturing mechanism sets the corresponding recency bit to This victim candidate list enables HotDataTrap to directly 1 if any data are accessed. look up a victim by using aforementioned our two-level hi- erarchical hash indexing scheme, which can significantly re- Figure 7 illustrates our victim selection process. If the cache duce victim search overheads.

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 56 Table 1. System Parameters Table 2. Workload Characteristics

System Parameters Values Total Request Ratio Trace 12 Workloads Bloom Filter Size 2 Requests (Read:Write) Features 12 R:1,235,633(22%) OLTP Decay Interval 2 Financial1 5,334,987 Hot Threshold 4 W:4,099,354(78%) application R:47,380(4.5%) MSR Server Number of Hash Function 2 MSR 1,048,577 Initial Memory Space 2KB W:1,001,197(95.5%) traces R:1,633,429(52%) General traces Distilled 3,142,935 W:1,509,506(48%) in a laptop R:1,083,495(51%) General traces RealSSD 2,138,396 3.3 Algorithm W:1,054,901(49%) in a desktop Algorithm 1 provides a pseudocode of our HotDataTrap al- gorithm. When a write request is issued, if the LBA hits search Trace (refer to as MSR) made up of 1-week block I/O the cache, its corresponding access counter is incremented traces of enterprise servers at Microsoft Research Cambridge by 1 and its recency bit is set to 1. If the counter value is Lab [34]. We selected, in particular, prn volume 0 trace since equal to or greater than a predefined HotThreshold value, it exhibits a write intensive workload [35]. Lastly, we adopt it is identified as hot data, otherwise, cold data. If it does a real Solid State Drive (SSD) trace file that is 1-month not hit the cache, HotDataTrap tries to do sampling by gen- block I/O traces of a desktop computer (AMD X2 3800+, erating a random number between 0 and 1. If the number 2G RAM, Windows XP Pro) in our lab (hereafter, refer to as passes the sampling test (i.e., belongs to its sampling range), RealSSD trace file). We installed Micron’s C200 SSD (30G, the LBA is ready to be added into the cache. If the cache SATA) to the computer and collected personal traces such has a space available, HotDataTrap just inserts it into the as computer programming, running simulations, documen- cache and sets both the access counter and recency bit to tation work, web surfing, watching movies, etc. The total 1. Otherwise, HotDataTrap needs to select and remove a requests in Table 2 correspond to the total number of read victim whose access counter value is less than HotThreshold and write requests in each trace. These requests can also be value and recency bit was reset to 0. Then, it stores the subdivided into several or more sub-requests with respect newly sampled LBA into the cache. If the new request fails to LBA accessed. For example, let us consider such a write to pass the sampling test, HotDataTrap simply discards it. request as WRITE 1000, 5. This means ’write data into 5 consecutive LBAs from the LBA 1000’. In this case, we 4 Performance Evaluation regard this request as 5 write requests in our experiments. This section provides extensive experimental results and 4.2 Performance Metrics comparative analyses. We first select a hot ratio to evaluate each performance of 4.1 Evaluation Setup those schemes. A hot ratio represents a ratio of hot data to all data. Thus, by comparing each hot ratio of both schemes We compare our HotDataTrap with a Multi-hash function (HotDataTrap and Multh-hash) with that of the baseline scheme [29] that is the state-of-the-art scheme and a direct scheme (DAM), we can observe their closeness to the base- address method (refer to as DAM) [29]. We select a freez- line result. However, even though both hot ratios of two ing approach (i.e., when an LBA access counter reaches its schemes are identical, hot data classification results of both maximum value due to a heavy access, it does not increase schemes may not be able to be identical since an identical the counter any more even though the corresponding LBA hot ratio means the same number of hot data to all data, is continuously accessed) as a solution for a counter over- which does not necessarily mean all classification results are flow problem both in Multi-hash scheme and HotDataTrap identical. Thus, we need another different metric to make since it, according to our experiments, showed a better per- up for this limitation. To meet this requirement, we adopt formance than the other approach (i.e., exponential batch a false identification rate. Whenever write requests are is- decay: divides all counters by two at once). DAM is an sued, we try to make a one-to-one comparison between two aforementioned baseline algorithm. Table 1 describes sys- hot data identification results of each scheme. This allows us tem parameters and their values. to make a more precise analysis of them. Memory consump- For fair evaluation, we adopt an identical decay interval tion is another important factor to be discussed since the (4,096 write requests) as well as aging mechanism (i.e., ex- SRAM size is very limited in flash memory and we need to ponential batch decay) for all those schemes. For more ob- exploit it. Lastly, we must take runtime overheads into ac- jective evaluation, we employ four realistic workloads (Ta- count. To evaluate these overheads, we measure CPU clock ble 2). Financial1 is write intensive block I/O traces from cycles per each representative operation. the University of Massachusetts at Amherst Storage Repos- 4.3 Evaluation Results itory [33]. This trace file was collected from an On-Line Transaction Processing (OLTP) application running at a fi- • Overall Performance: We first start to discuss hot ra- nancial institution. Distilled trace file [28] represents a gen- tios of each scheme under four realistic workloads. As shown eral and personal usage patterns in a laptop such as web surf- in Figure 8, our HotDataTrap exhibits very similar hot ra- ing, documentation work, watching movies, playing games, tios to the baseline scheme (DAM), while the Multi-hash etc. This is from the flash memory research group repository scheme has a tendency to show relatively even higher hot at National Taiwan University, and since Multi-hash scheme ratios than DAM. This results from the inborn limitation of employed only this trace file for its evaluation, we also adopt a bloom filter-based scheme. If a bloom filter size is not large this trace for fair comparison. We also select Microsoft Re- enough to accommodate all distinct input values, the bloom

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 57 HotDataTrap Multi-hash DAM HotDataTrap Multi-hash DAM HotDataTrap Multi-hash DAM HotDataTrap Multi-hash DAM 35 40 25 25 30 20 20 25 30 20 15 15 20 15 10 10 10 10 Hot Ratio (%) Ratio Hot (%) Ratio Hot (%) Ratio Hot 5 (%) Ratio Hot 5 5 0 0 0 0 1 5 9 13 17 21 25 29 1 5 9 13 17 21 25 29 1 5 9 13 17 21 25 29 1 5 9 13 17 21 25 29 Number of Write Requests (Unit: 300K) Number of Write Requests (Unit: 300K) Number of Write Requests (Unit: 300K) Number of Write Requests (Unit: 300K) (a) Financial1 (b) MSR (c) Distilled (d) RealSSD

Figure 8. Hot Ratios of Each Scheme under Various Traces

HotDataTrap Multi-hash HotDataTrap Multi-hash HotDataTrap Multi-hash HotDataTrap Multi-hash 30 15 10 10 25 8 8 20 10 6 6 15 4 4 10 5 5 2 2 0 0 0 0 False Identifcaion Rate (%) Rate Identifcaion False False Identification Rate (%) Rate Identification False 1 5 9 13 17 21 25 29 1 5 9 13 17 21 25 29 (%) Rate Identificatoin False 1 5 9 13 17 21 25 29 (%) Rate Identification False 1 5 9 13 17 21 25 29 Number of Write Requests (Unit: 300K) Number of Write Requests (Unit: 300K) Number of Write Requests (Unit: 300K) Number of Write Requests (Unit: 300K) (a) Financial1 (b) MSR (c) Distilled (d) RealSSD

Figure 9. False Hot Data Identification Rates of Both HotDataTrap and Multi-hash

HotDataTrap Multi-hash 2500 Financial1 MSR Distilled RealSSD 600 6000 5508 500 2000 5000 400 1500 4000 3265 300 3000 1000 200 Average 2000 500 100 CPU Clock Cycle Clock CPU 492 487 1000 Cycle Clock CPU Number of LBA Collisions LBA of Number 0 0 0 0 50 100 150 200 250 Checkup Decay 10% 30% 50% 70% 100% Number of Write Requests (Unit: 4,096 Writes) Operation Sampling Rate (a) Overheads for Each Operation (b) Cache Overheads Figure 10. Number of LBA Collisions for Each Unit (4,096 writes) Period Figure 11. Average Runtime Overheads

filter necessarily causes a false positive problem. There ex- workload characteristics of Financial1 (i.e., Financial1 inten- ists a fundamental trade-off between a memory space (i.e., sively accesses only a very limited working space), it shows a a bloom filter size) and its performance (i.e., a false positive relatively higher false LBA identification rate than the other rate) of the bloom filter-based scheme. To verify this, we traces when we capture the 16 LSBs (Least Significant Bits) measure the number of LBA collision for each unit period to identify LBAs. Figure 9 verifies our aforementioned dis- (i.e., decay period) under those traces. In other words, al- cussion: Financial1 shows a higher false hot data identifica- though the working space of the LBAs is much larger than tion rate than the others. Moreover, according to the Fig- the size of bloom filter (4,096) in the Multi-hash scheme, ure 5 in Section 3.1, MSR also accesses a narrower working all of them have to be mapped to the bloom filter by using space than Distilled and RealSSD (even though it is wider hash functions. Thus, many of them must be overlapped than Financial1). Figure 9 (b) also clearly supports this to the same positions of the bloom filter so that it causes observation by showing that its overall false hot data iden- hash collisions. As plotted in the Figure 10, we can observe tification rate is slightly higher than that of Distilled and many collisions (on average, 877) for each unit request (4,096 RealSSD in HotDataTrap. However, even though these two writes). All of these collisions do not necessarily mean the traces (particularly, Financial1) display higher false identi- number of false identification of the Multi-hash because it fication rates, they still outperform the Multi-hash. can reduce the possibility by adopting multiple hash func- Runtime Overheads: tions. However, considering this factor, many collisions nec- • We evaluate runtime overheads of essarily have considerable effect on false identification rates two major operations in hot data identification scheme: a in the Multi-hash scheme. Figure 9 demonstrates well this checkup and decay operation. A checkup operation corre- analysis. Multi-hash shows much higher false identification sponds to a verification process to check whether the data rates than our HotDataTrap. HotDataTrap makes a even are hot or not. A decay operation is the aforementioned an more precise decision on hot data identification by an aver- aging mechanism that regularly divides all access counts by age of 335%. two. To evaluate the runtime overheads, we measure CPU clock cycles of them. They are measured on the system with As we discussed in Section 3.1, due to the distinguished AMD X2 3800+ (2GHz, Dual Core) and 2G RAM under

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 58 HotDataTrap Multi-hash HotDataTrap Multi-hash DAM 2048 4096 8192 16384 2048 4096 8192 16384 12 30 100 100 25.8 9.9 24.2 10 8.99 25 22.9 21.2 20.4 80 80 8 7.3 20 18.7 18.1 19.1 6.16 15.6 16.7 60 60 5.43 6 15 12.2 13.7 8.8 10.5 40 40 4 10 6.9 1.86 1.72 1.69 (%) Ratio Hot 2 1.4 1.33 5 20 20 0 0 0 0 False Identification Rate (%) Rate Identification False False Identification Rate (%) Rate Identification False 0.5KB 1KB 2KB 4KB 8KB 0.5KB 1KB 2KB 4KB 8KB 1 5 9 13 17 21 25 29 (%) Rate Identification False 1 5 9 13 17 21 25 29 Memory Space Memory Space Number of Write Requests (Unit: 300K) Number of Write Requests (Unit: 300K) (a) False Identification Rates (b) Hot Ratios (a) HotDataTrap (b) Multi-hash

Figure 12. Impact of Memory Spaces Figure 13. Impact of Decay Intervals

Windows XP Professional platform. For a fair and precise scheme is more sensitive to a memory space. This result is measurement, we fed 100K numbers of write requests from closely related to the false positives of a bloom filter since Financial1 traces to each scheme and ran 100 times of each the false positive probability decreases as a bloom filter size operation to measure average CPU clock cycles. Since the increases. On the other hand, HotDataTrap shows a very CPU clock count depends heavily on level-1 cache in the stable performance even with a smaller memory space, which CPU, if both codes and data are not in the level-1 cache, means our HotDataTrap can be well employed even in em- the CPU clock count is enormously high. This is the main bedded system devices with a very limited memory space. reason the CPU clock count at the first run is excessively Figure 12 (b) exhibits hot ratios of each scheme over various higher than the subsequent counts that represent the time memory spaces. All hot ratios increase as a memory size it takes when everything is in the level-1 cache. increases because their decay periods also increase accord- ingly. A longer decay period allows each scheme to accept As shown in Figure 11 (a), the runtime overheads of a more requests, which causes higher hot ratios. checkup operation in HotDataTrap is almost comparable to that of the Multi-hash checkup operation. Although • Impact of Decay Intervals: A decay interval has a sig- the HotDataTrap requires a few more cycles than Multi- nificant impact on the overall performance. Thus, finding an hash scheme, considering both standard deviations (28.2 appropriate decay interval is another important issue in the for Multi-hash and 25.6 for HotDataTrap) for each scheme, hot data identification scheme design. Multi-hash adopts a they can be ignorable. This result is intuitive because both formula, N ≤ M/(1 − R), as its decay interval, where N is a schemes not only adopt the same number of hash functions, decay interval, M and R correspond to its bloom filter size but also can directly look up and check the data in the and an average hot ratio of a workload respectively. The au- cache. However, the decay operation of HotDataTrap re- thors say this formula is based on their expectation that the quires much lower (40.7%) runtime overheads than that of number of hash table entry (M ) can accommodate all those Multi-hash since Multi-hash must decay 4,096 entries, while LBAs which correspond to cold data within every N write HotDataTrap only has to treat even less number of entries request (they assumed R is 20%). According to the formula, (approximately 1/3 of the Multi-hash, but it depends on assuming the bloom filter size (M ) is 4,096 bits, they sug- workload characteristics) for each decay interval. For each gested 5,117 write requests as their optimal decay interval. entry, Multi-hash requires 4 bits, while our scheme needs To verify their assumption, we tried to measure the number 20 bits or 8 bits. Assuming HotDataTrap entries are 1/3 of a unique LBA for every 5,117 write request under all those of Multi-hash entries to decay, its runtime overheads would four traces. According to our observation (due to a space also be around 1/3 of Multi-hash overheads. However, our limit, we do not display this plot here), a large number of scheme requires extra processes to update its victim candi- unique LBAs exceeded the bloom filter size of 4,096, which date list during the decay process. Thus, its overall runtime necessarily causes many hash collisions so that it results in overheads reach almost 60% of Multi-hash scheme’s. In ad- a higher false positive probability. Moreover, initially fixing dition, HotDataTrap must consider extra overheads: cache an average hot ratio of a workload to 20% is not reason- management. Multi-hash does not retain this overhead since able since each workload has different characteristics and it uses a bloom filter, while our scheme manages a set of po- we cannot know their average hot ratios in advance. Lastly, tential hot items in the cache. However, we can significantly even though the authors recommended 5,117 writes as their reduce these overheads with the help of our sampling ap- optimal decay interval, our experiments demonstrated that proach since the requests failing to pass the sampling test Multi-hash showed a better performance when we adopted will be discarded without any further process. Figure 11 (b) 4,096 writes as its decay interval. Thus, we chose 4,096 plots the impact of various sampling rates. As the sampling writes as its decay interval due to not only a performance rate grows, the runtime overheads also increase. issue, but also a fairness issue. • Impact of Memory Spaces: In this subsection, we as- Now, in order to observe the impact of a decay interval on sign a different memory space to each scheme from 0.5KB each scheme, we fix the memory space (2KB, i.e., a bloom to 8KB and observe their impacts on both schemes. The filter size of 4,096 bits for Multi-hash) for each scheme and Multi-hash scheme consumes 2KB (4 × 4, 096 bits). As plot- change the decay interval from 2,048 to 16,384 write re- ted in Figure 12 (a), both schemes exhibit a performance quests. Figure 13 displays a performance change over diverse improvement as a memory space grows. However, HotData- decay intervals. As shown in Figure 13 (b), the performance Trap outperforms Multi-hash, and moreover, the Multi-hash of Multi-hash is very sensitive to the decay intervals. Its false

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 59 HotDataTrap HotDataTrap_WT HotDataTrap HotDataTrap_WT HotDataTrap HotDataTrap_WT HotDataTrap HotDataTrap_WT 15 5 2.5 2.5 4 2 2 10 3 1.5 1.5 2 1 1 5 1 0.5 0.5 0 0 0 0 False Identifcaion Rate (%) Rate Identifcaion False False Identification Rate (%) Rate Identification False 1 5 9 13 17 21 25 29 1 5 9 13 17 21 25 29 (%) Rate Identificatoin False 1 5 9 13 17 21 25 29 (%) Rate Identification False 1 5 9 13 17 21 25 29 Number of Write Requests (Unit: 300K) Number of Write Requests (Unit: 300K) Number of Write Requests (Unit: 300K) Number of Write Requests (Unit: 300K) (a) Financial1 (b) MSR (c) Distilled (d) RealSSD

Figure 14. Performance Improvement with Our Two-Level Hierarchical Hash Indexing Scheme

10% 30% 50% 10% 30% 50% 10% 30% 50% 10% 30% 50% 15 70% 100% 5 70% 100% 1.5 70% 100% 2 70% 100%

4 1.5 10 1 3 1 2 5 0.5 1 0.5 0 0 0 0

False Identification Rate (%) Rate Identification False 1 5 9 13 17 21 25 29 (%) Rate Identification False 1 5 9 13 17 21 25 29 (%) Rate Identification False 1 5 9 13 17 21 25 29 (%) Rate Identification False 1 5 9 13 17 21 25 29 Number of Write Requests (Unit: 300K) Number of Write Requests (Unit: 300K) Number of Write Requests (Unit: 300K) Number of Write Requests (Unit: 300K) (a) Financial1 (b) MSR (c) Distilled (d) RealSSD

Figure 15. Performance Change over Sampling Rates under Diverse Workloads hot data identification rate almost exponentially increases Financial1 MSR Financial1 MSR 20 Distilled RealSSD 10 Distilled RealSSD as a decay interval grows. For instance, when the interval 8 increases 4 times (from 4,096 to 16,384), its false identifi- 15 6 cation rate increases approximately 10 times (9.93 times). 10 In particular, the false identification rate reaches even on 4 average 72.4% if the decay interval is 16,384. This results (%) Ratio Hot 5 2 from a higher false positive probability of the bloom filter. 0 0

On the other side, our HotDataTrap shows a stable perfor- 10% 30% 50% 70% 100% DAM (%) Rate Identification False 10% 30% 50% 70% 100% Sampling Rates Sampling Rates mance irrespective of decay intervals (Figure 13 (a)), which (a) Average Hot Ratios demonstrates well that the decay interval does not have a (b) Avg. False ID Rates critical influence on HotDataTrap performance. Therefore, Figure 16. Impact of Sampling Rates even though we cannot find an optimal decay interval, it is not a critical issue in HotDataTrap unlike Multi-hash. Effectiveness of Hierarchical Hash Indexing: • Hot- shown in Figure 16 (a), average hot ratios for each trace DataTrap is designed to exploit spatial localities in data gradually approach to those of DAM since a higher sampling accesses by using its two-level hierarchical hash indexing rate can reduce a chance to drop potential hot items by mis- scheme. In this subsection, we explore how much we take. Although this higher sampling may cause higher run- can benefit from it. To evaluate its benefit, we prepared ning overheads due to a more frequent cache management two different HotDataTrap schemes: one for our original process (Figure 11), it can be expected to achieve a better HotDataTrap (which initially retains the indexing scheme) performance as displayed in Figure 16 (b). False identifica- and the other for HotDataTrap without two-level indexing tion rates, intuitively, decrease as the sampling rate grows. scheme (referred to as HotDataTrap WT). That is, Hot- Since HotDataTrap with 10% sampling rate shows an even DataTrap WT always use a primary ID as well as its sub better performance than the Multi-hash scheme by an aver- ID even though the traces show sequential accesses, which age of 65.1%, it is acceptable that we select a 50% sampling can require a more memory space than the HotDataTrap. rate as our initial sampling rate for HotDataTrap. Instead, Figure 14 compares both performances and demonstrates we can choose a different sampling rate according to appli- the effectiveness of our two-level hierarchical hash index- cation requirements. If an application requires a precise hot ing scheme. As shown in the plots, HotDataTrap improves data identification performance at the expense of runtime its performance, by an average of 19.7% and up to 50.8% overheads, it can adopt a 100% sampling rate, not lower thereby lowering false hot data identification. sampling rates, and vice versa. • Impact of Sampling Rates: When a request LBA does not hit the cache, it may or may not be inserted into the 5 Conclusion cache according to a sampling rate of HotDataTrap. Thus, the sampling rate is another factor to affect an overall Hot- Hot data identification is of paramount importance in flash- DataTrap performance. Figure 15 exhibits a performance based storage devices since not only does it have a great im- change for each sampling rate under a variety of traces and pact on the overall performance of them, but also it has a big Figure 16 plots overall impacts of the sampling rates. As potential to be applicable to many other fields. We, in this paper, proposed a novel efficient on-line hot data identifica-

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 60 tion algorithm named HotDataTrap. HotDataTrap adopts Devices,” in FAST, 2005. a sampling mechanism unlike other existing schemes. This [7] N. Agrawal, V. Prabhakaran, T. Wobber, J. Davis, sampling approach helps HotDataTrap early discard some M. Manasse, and R. Panigrahy, “Design Tradeoffs for of the cold items so that it can reduce runtime overheads as SSD Performance,” in USENIX ATC, 2008. well as a waste of a memory space. Moreover, its two-level [8] A. Gupta, Y. Kim, and B. Urgaonkar, “DFTL: a Flash hierarchical hash indexing enables HotDataTrap to directly Translation Layer Employing Demand-based Selective look up items in the cache and to reduce a memory space Caching of Page-level Address Mappings,” in further by exploiting spatial localities. ASPLOS, 2009. We made diverse experiments in many respects with vari- [9] L. Chang and T. Kuo, “Efficient Management for ous realistic workloads. Based on various performance met- Large-Scale Flash-Memory Storage Systems with rics, we carried out experiments on the runtime overheads Resource Conservation,” ACM Transactions on and the impacts of memory sizes, sampling rates, and de- Storage, vol. 1, no. 4, 2005. cay periods. In addition, we also explored the effectiveness [10] A. Ban, “Wear Leveling of Static Areas in Flash of our hierarchical indexing scheme by comparing two dif- memory,” US Patent,6732221, M-Systems, 2004. ferent HotDataTrap schemes with and without the indexing [11] L. Chang, “An Efficient Wear Leveling for Large-Scale scheme respectively. Our diverse experiments presented that Flash-Memory Storage Systems,” in SAC, 2007. HotDataTrap achieved a better performance up to a factor [12] Y. Chang, J. Hsieh, and T. Kuo, “Endurance of 20 and by an average of 335%. The two-level hash index- Enhancement of Flash-Memory Storage Systems: An ing scheme improved a HotDataTrap performance further Efficient Static Wear Leveling Design,” in DAC, 2007. up to 50.8%. Moreover, HotDataTrap was not sensitive to [13] T. Kgil, D. Robert, and T. Mudge, “Improving NAND its decay interval as well as a memory space. Consequently, Flash Based Disk Caches,” in ISCA, 2008. HotDataTrap can be well employed even in small embedded [14] J. Mogul, E. Argollo, and M. S. P. Faraboschi, devices with a very limited memory space. “Operating System Support for NVM+DRAM Hybrid Main Memory,” in HotOS, 2009. 6 Acknowledgments [15] L.-P. Chang, “Hybrid solid-state disks: combining This work was partially supported by grants from NSF (NSF heterogeneous NAND flash in large SSDs,” in Awards: 0960833, 0934396, 1127829 and 115471). This ASP-DAC, 2008. research was also supported in part by GLORY project [16] G. Sun, Y. Joo, Y. Chen, Y. Xie, Y. Chen, and H. Li, from the Ministry of Knowledge Economy (MKE: Grant “Hybrid Solid-State Storage Architecture for the No. KI001703), Korea. GLORY is named from GLObal Re- Performance, Energy Consumption, and Lifetime source management sYstem for future internet service. In Improvement,” in HPCA, 2010. addition, this was partially supported by the MKE, Korea, [17] D. Park, B. Debnath, and D. Du, “CFTL: A under the ITRC (Information Technology Research Center) Convertible Flash Translation Layer Adaptive to Data support program supervised by the NIPA (National IT In- Access Patterns,” in SIGMETRICS, 2010. dustry Promotion Agency) (NIPA-2011-C1090-1131-0009). [18] H. Jung, H. Shim, S. Park, S. Kang, and J. Cha, This work was done when the author (Young Jin Nam) was “LRU-WSR: integration of LRU and writes sequence a visiting professor at the University of Minnesota, Twin reordering for flash memory,” IEEE Transactions on Cities. Consumer Electronics, vol. 54, no. 3, 2008. [19] R. Wood, M. Williams, A. Kavcic, and J. Miles, “The 7 References feasibility of magnetic recording at 10 terabits per square inch on conventional media,” IEEE Transactions on Magnetics, vol. 45, no. 2, pp. [1] A. Caulfield, L. Grupp, and S. Swanson, “Gordon: 917–923, 2009. using flash memory to build fast, power-efficient [20] I. Tagawa and M. Williams, “High density clusters for data-intensive applications,” in ASPLOS, data-storage using shingled-write,” in Proceedings of 2009. the IEEE International Magnetics Conference [2] E. Gal and S. Toledo, “Algorithms and Data (INTERMAG), 2009. Structures for Flash Memories,” ACM Computing [21] G. Gibson and G. Ganger, “Principles of Operation for Survey, vol. 37, no. 2, 2005. Shingled Disk Devices,” Carnegie Mellon University [3] S. Lee, D. Park, T. Chung, D. Lee, S. Park, and Parallel Data Lab Technical Report, H. Song, “A Log Buffer-based Flash Translation Layer CMU-PDL-11-107, 2011. Using Fully-associative Sector Translation,” ACM [22] J. Wan, N. Zhao, Y. Zhu, J. Wang, Y. Mao, P. Chen, Transaction on Embedded Computing Systems, vol. 6, and C. Xie, “High performance and high capacity no. 3, 2007. hybrid shingled-recording disk system,” in IEEE [4] H. Kim and S. Ahn, “ BPLRU: A Buffer Management International Conference on Cluster Computing Scheme for Improving Random Writes in Flash (CLUSTER), 2012. Storage ,” in FAST, 2008. [23] Y. Cassuto, M. A. A. Sanvido, C. Guyot, D. R. Hall, [5] S. Nath and A. Kansal, “FlashDB: Dynamic and Z. Z. Bandic, “Indirection systems for Self-tuning Database for NAND Flash,” in ISPN, 2007. shingled-recording disk drives,” in IEEE 26th [6] D. Zeinalipour-Yazti, S. Lin, V. Kalogeraki, Symposium on Mass Storage Systems and D. Gunopulos, and W. A. Najjar, “MicroHash: An Technologies (MSST), 2010. Efficient Index Structure for Flash-Based Sensor

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 61 [24] C.-I. Lin, D. Park, W. He, and D. Du, “H-SWD: Systems,” ACM Transactions on Storage, vol. 2, no. 1, Incorporating Hot Data Identification into Shingled 2006. Write Disks,” in Proceedings of the 20th IEEE [30] F. Corbato, “A Paging Experiment with the Multics International Symposium on Modeling, Analysis and System,” in MIT Project MAC Report MAC-M-384, Simulations of Computer and Telecommunication 1968. Systems (MASCOTS 2012), August 2012. [31] L.-P. Chang, T.-W. Kuo, and S.-W. Lo, “Real-time [25] D. Park, C.-I. Lin, and D. Du, “H-SWD: A Novel garbage collection for flash-memory storage systems of Shingled Write Disk Scheme based on Hot and Cold real-time embedded systems,” ACM Transactions on Data Identification,” in FAST, 2012. Embedded Computing Systems, vol. 3, no. 4, pp. [26] D. Park and D. Du, “Hot Data Identification for 837–863, 2004. Flash-based Storage Systems using Multiple Bloom [32] A. Broder, M. Mitzenmacher, and A. B. I. M. Filters,” in Proceedings of the 27th IEEE Symposium Mitzenmacher, “Network applications of bloom filters: on Mass Storage Systems and Technologies (MSST A survey,” in Internet Mathematics, 2002. 2011), May 2011, pp. 1 – 11. [33] “University of Massachusetts Amhesrst Storage [27] M. Chiang, P. Lee, and R. Chang, “Managing flash Traces,” http: memory in personal communication devices,” in ISCE, //traces.cs.umass.edu/index.php/Storage/Storage. 1997. [34] “SNIA IOTTA Repository: MSR Cambridge Block [28] L. Chang and T. Kuo, “An Adaptive Striping I/O Traces,” http://iotta.snia.org/traces/list/BlockIO. Architecture for Flash Memory Storage Systems of [35] D. Narayanan, A. Donnelly, and A. Rowstron, “Write Embedded Systems,” in RTAS, 2002. Off-Loading: Practical Power Management for [29] J. Hsieh, T. Kuo, and L. Chang, “Efficient Enterprise Storage,” in FAST, 2008. Identification of Hot Data for Flash Memory Storage

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 62 ABOUT THE AUTHORS:

Dongchul Park is currently a research scientist in System Architecture Lab. (SAL) at Samsung R&D Center in San Jose, California. He received his Ph.D. and M.S. degrees in Computer Science and Engineering at the University of Minnesota, Twin Cities in 2012 and 2006 respectively. During his Ph.D., he was a member of Center for Research in Intelligent Storage (CRIS) group and advised by Professor David. H.C. Du. His research interests lie in storage system design and applications including non-volatile memories (Flash-based SSD), hot/cold data identification, embedded systems, shingled magnetic recording (SMR) disks, data deduplication, virtualization (Xen, KVM), and Hadoop/big data processing.

Young Jin Nam is currently a principal software engineer at Oracle in San Jose, California. He was a visiting Professor in Computer Science and Engineering at the University of Minnesota, Twin Cities in 2011 and an Associate Professor in the School of Computer and Information Technology at Daegu University, South Korea, from 2004 to 2012. He received his Ph.D. and M.S. degrees in the department of Computer Science and Engineering at Pohang University of Science and Technology (POSTECH), South Korea, in 1994 and 2004 respectively. His research interests cover Data Deduplication, Key-Value Store, HPC Checkpointing, Storage QoS, High-performance Storage Architecture, SMR Disks, and Phase Change Memory (PCM).

Biplob Debnath is a research staff member in Storage Systems Group at NEC Laboratories America in Princeton, New Jersey. He received the Ph.D. degree in Electrical and Computer Engineering from the University of Minnesota, Twin Cities in 2010, and the and B.S. degree in Computer Science and Engineering from Bangladesh University of Engineering and Technology, Bangladesh, in 2002. He was a member of Advanced Research in Computing Technology and Compilers (ARCTiC) and advised by Professor David J. Lilja. His research interests include non-volatile memory (flash memory and phase change memory), data deduplication, caching, key-value store, file, and storage systems.

David H.C. Du (Fellow, IEEE) is currently the Qwest Chair Professor in the Department of Computer Science and Engineering at the University of Minnesota, Twin Cities. He is the Center Director of National Science Foundation I/UCRC Center for Research in Intelligent Storage (CRIS) and was a Program Director (IPA) at National Science Foundation CISE/CNS Division from March 2006 to August 2008. He received the B.S. degree in mathematics from National TsingHua University, Taiwan, R.O.C. in 1974, and the M.S. and Ph.D. degrees in computer science from the University of Washington, Seattle, in 1980 and 1981, respectively. His research interests include storage systems, cyber security, sensor networks, multimedia computing, and high-speed networking.

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 63 Youngkyun Kim is a Director of Storage System Research Team at Electronics and Telecommunications Research Institute (ETRI), South Korea, since 1996. He received the B.S. and Ph.D. degrees in Computer Science at Chonnam National University, South Korea in 1991 and 1995 respectively. His research interests include massive high performance storage systems, distributed file systems, internet massive database systems, and non-volatile memory storage.

Youngchul Kim is currently Sr. staff engineer in the Department of Cloud Computing at Electronics and Telecommunications Research Institute (ETRI), Daejeon, South Korea. He received his B.S. and M.S. in Computer Science at the Kangwon National University, Chuncheon, South Korea in 1995 and 1999, respectively. His research interests include cloud computing and distributed systems.

APPLIED COMPUTING REVIEW MAR. 2013, VOL. 13, NO. 1 64