CSI Journal of E-ISSN 2277–7091 COMPUTING Vol

CSI Journal of e-ISSN 2277–7091 COMPUTING Vol. No. 2, No. 4, 2015

www.csijournal.org

Contents

Preface R K Shyamasundar, Editor-in-Chief

Article R1 New Frontiers in Formal Methods: From Theory to Sanjit A. Seshia 11 Pages Cyber-Physical Systems, Education and Beyond

Article R2 Static Analysis and Symbolic Code Execution Dhiren Patel, Madhura Parikh and 10 Pages Reema Patel

Article R3 LFTL: A multi-threaded FTL for a Parallel IO Srimugunthan, K. Gopinath, 11 Pages Flash Card under Linux Giridhar Appaji Nag Yasa

Article R4 Towards An Executable Declarative Specification N. V. Narendra Kumar and 10 Pages of Access Control R. K. Shyamasundar

Article R5 An Empirical Study of Popular Matrix Shamsuddin N. Ladha, Hari 8 Pages Factorization based Collaborative Filtering Manassery Koduvely and Algorithms Lokendra Shastri

TM Open Access @ www.csijournal.org Editorial Board

Editor-in-Chief Prof. R K Shyamasundar, Tata Institute of Fundamental Research Editorial Board Members

Prof. Srinivasa Aluru Prof. Subhasis Banerjee Dr. Anup Bhattacharjee IIT Bombay & Iowa State Univ. IIT Delhi BARC Prof. Pushpak Bhattacharya Dr. Satish Chandra Prof. P P Chakrabarti IIT Bombay IBM Research IIT Kharagpur Prof. B B Chaudhuri Prof. Sajal Das Dr. Sukhendu Das ISI, Kolkata Univ. of Texas at Arlington IIT, Madras Dr. Sudhir Dixit Prof. Nikil Dutt Prof. Naveen Garg HP Labs UC Irvine IIT Delhi Dr. Rahul Garg Prof. Manoj Gaur Prof. Rajesh Gupta Opera Solutions, Noida NIT, Jaipur UCSD, La Jolla Prof. J Haritsa Prof. D Janakiram Dr. Ravi Kannan IISc IIT Madras Microsoft Research Dr. Raghu Krishnapuram Prof. Vipin Kumar Prof. Rupak Majumdar IBM Research Univ of Minnesota MPI Software Systems Prof. Bhubaneswar Mishra Prof. Y Narahari Prof. K Narayana Kumar Courant Institute of Mathematical Sciences IISc CMI Dr. Srinivas Padmanabhuni Prof. N R Pal Prof. C Pandurangan Infosys Labs ISI, Kolkata IIT Madras Prof. Dhiren K. Patel Prof. Sushil K Prasad Prof. Arun K. Pujari NIT, Surat Georgia State Univ. University of Hyderabad Prof. Krithi Ramamritam Prof. R Ramanujam Dr. S Ramesh IIT Bombay IMSc GM, India Science Lab Prof. Rajeev Sangal Prof. Abdul Sattar Prof. Sanjit Seshia IIIT, Hyderabad Griffith University UC Berkeley Dr. Lokendra Shastri Prof. Sandeep Shukla Prof. T V Sreenivas Infosys Labs Virginia Tech IISc © Computer Society of IndiaTM Copyright Protected. Online Advertisement The Journal of Computing, published by the Computer Society of India offers good visibility of online research content on computer science theory, Languages & Systems, Databases, Internet Computing, Software Engineering and Applications. With the potential to deliver targeted audience regularly with each issue, The Journal of Computing is an excellent media for promotion of your IT Technology products, IT Technology services, Conferences, Technology Recruitments etc through advertisement. Additionally, the cost of advertisement in The Journal of Computing is substantially low. The advertisement rates for The Journal of Computing is as follows:

Printed Edition India Other Countries Inside Pages Inside Pages No. of Insertions : 1 No. of Insertions: 1 Full page in B&W: ` 10,000; in Colour: ` 20,000 Full page in B&W: $300; in Colour: $650 Half page in B&W: ` 6,000; in Colour: ` 12,000 Half page in B&W: $200; in Colour: $325 No. of Insertions : 6 No. of Insertions: 6 Full page in B&W: ` 50,000; in Colour: ` 100,000 Full page in B&W:$1500; in Colour: $3000 Half page in B&W: ` 30,000; in Colour: ` 60,000 Half page in B&W:$1000; in Colour: $2000 No. of Insertions: 12 Full page in B&W: ` 10,000; in Colour: ` 200,000 Half page in B&W: ` 60,000; in Colour: ` 120,000 Inside Cover Pages Inside Cover Pages No. of Insertions: 1 Full page in B&W: ` 15,000; in Colour: ` 25,000 No. of Insertions: 1 Full page in B&W:$450; in Colour: $750 No. of Insertions: 6 Full page in B&W: ` 75,000; in Colour: ` 125,000 No. of Insertions: 6 Full page in B&W:$2250; in Colour: $3500 No. of Insertions: 12 Full page in B&W: ` 150,000; in Colour: ` 250,000 ContentsN. V. Narendra Kumar, et. al. R4 : 11

Article Title Page No.

Preface 2

R1 New Frontiers in Formal Methods: From Theory to Cyber-Physical Systems, 3 Education and Beyond Sanjit A. Seshia

R2 Static Analysis and Symbolic Code Execution 14 Dhiren Patel, Madhura Parikh and Reema Patel

R3 LFTL: A multi-threaded FTL for a Parallel IO Flash Card under Linux 24 Srimugunthan, K. Gopinath, Giridhar Appaji Nag Yasa

R4 Towards An Executable Declarative Specification of Access Control 35 N. V. Narendra Kumar and R. K. Shyamasundar

R5 A Novel, Decentralized, Local Information based Algorithm for Community 45 Detection in Social Networks Ramasuri Narayanam & Y. Narahari

R6 Mitigation of Jamming Attack in Wireless Sensor Networks 53 Manju. V.C., Dr. Sasi Kumar

R7 An Efficient Approach for Storing and Accessing Huge Number of 61 Small Files in HDFS Shyamli Rao & Amit Joshi

R8 Cross-Referencing Cancer Data Using GPU for Multinomial Classification of 67 Genetic Markers Abinaya Mahendiran, Srivatsan Sridharan, Sushanth Bhat and Shrisha Rao

Letter to the Editor 80

CSI Journal of Computing | Vol. 2 • No. 4, 2015 2R4 : 2 Towards An Executable Declarative Specification of Access Control

Preface

This is the last issue of Volume 2. The first paper entitled New Frontiers in Formal Methods: From Theory to Cyber-Physical Systems, Education, and Beyond, is an invited paper by Prof. Sanjit Seshia, University of California, Berkeley, USA, highlights the challenges of Formal Methods, and Machine Learning and their application to Cyber-Physical Systems. I thanks the author for sparing his valuable time and I am certain the article will be of immense help to our readers. The rest of the articles are regular papers. It is a pleasure to thank the authors for their contributions.

This is the last issue, I am editing for the Journal. I hereby place on record my appreciation for the International Editorial Board and the referees who have supported the refereeing of the papers submitted for the Journal and also provided constructive feedbacks to the authors. The success of any journal depends on contributions from authors. I am happy to note that there is a constant stream of papers submitted for favour of publication. I thank the authors and wish the Journal to attract more papers and establish its impact in the ICT arena. Finally but not least, I thank the ExecCom for the continued support for the Journal.

Prof. R K Shyamasundar IIT Bombay Editor-in-Chief, CSI Journal of Computing May 2015

CSI Journal of Computing | Vol. 2 • No. 4, 2015 Sanjit A. Seshia R1 : 3 New Frontiers in Formal Methods: Learning, Cyber- Physical Systems, Education, and Beyond Sanjit A. Seshia

S. A. Seshia, EECS Department, UC Berkeley, Berkeley, CA 94720-1770 E-mail: [email protected]

We survey promising directions for research in the area of formal methods and its applications, including fundamental questions about the combination of formal methods and machine learning, and applications to cyberphysical systems and education.

Index Terms : Formal Methods, Machine Learning, Cyber-Physical Systems, Education.

1. Introduction Formal methods is a field of computer science and 2. Formal Methods and Inductive Learning CSI JOURNAL COMPUTING, VOL. XX, NO. XX, XXXXXXXXX 2 engineering concerned with the rigorous mathematical A major challenge for formal methods, in the author’s specification, design, and verification of systems [1], [2]. opinion, is to develop effective proof techniques that can At its core, formal methods is about proof: formulating capture the way the best mathematicians and scientists specifications that form proof obligations, designing systems in-depth treatment, the reader is referred to a prior synthesis of three kinds of verification artifacts: articlegeneralize by the from author intuition [12], and [13]. experience to synthesize models, specifications, and auxiliary inputs. There to meet those obligations, and verifying, via algorithmic proof the critical parts of a proof. A major opportunity for formal search, that the systems indeed meet their specifications. methodsWe begin is to by blend examining techniques the from traditional inductive view (machine) of are several examples of these verification artifacts, Over the past few decades, formal methods has made verificationlearning into as traditional a decision deductive problem, procedures with three to inputsmimick including inductive invariants, abstractions, environ- enormous strides. Verification techniques such as model such a process of generalization. In this section, we present checking [3], [4], [5] and theorem proving (see, e.g. [6], [7], (see Figure 1): ment assumptions, input constraints, auxiliary lem- a1) briefA modelexposition of of the these system positions. to be For verified, a more Sin-depth; mas, ranking functions, and interpolants, amongst [8]) are used routinely in the computer-aided design of treatment, the reader is referred to a prior article by the integrated circuits and have been widely applied to find author2) A [12], model [13]. of the environment, E, and others. In Sec. 2.1, we illustrate this formulation of bugs in software, analyze embedded systems, and find 3) TheWe begin property by examining to be the verified, traditionalΦ. view of verification verification in terms of synthesis with two examples. security vulnerabilities. At the heart of these advances are as a decision problem, with three inputs (see Fig. 1): computational proof engines such as Boolean satisfiability The verifier generates as output a YES/NO answer, As we will see, one often needs human insight into solvers [9], Boolean reasoning and manipulation routines indicating1. A model whether of the system or not to beS verified,satisfies S; the property at least the form of these artifacts, if not the artifacts based on Binary Decision Diagrams (BDDs) [10], and Φ 2.in A environment model of the environment,E. Typically, E, and a NO output is themselves, to succeed in verifying the design. In satisfiability modulo theories (SMT) solvers [11]. accompanied3. The property by to a be counterexample, verified, F. also called an. Sec. 2.2, we describe an approach to the synthesis Even as these advances extend the capacity and error trace,The verifier which isgenerates an execution as output of the systema YES/NO that answer, of these artifacts by a combination of inductive applicability of formal methods, several technical challenges indicating whether or not S satisfies the property F in remain, while exciting new application domains are yet to indicatesenvironment how EΦ. Typically,is violated. a NO Other output debugging is accompanied output by learning, deductive reasoning, and human insight in be explored. This article expresses the author’s personal maya counterexample, also be provided. also called Some an error formal trace, verification which is an the form of a “structure hypothesis.” viewpoint on the new frontiers for formal methods, toolsexecution also includeof the system a proof that orindicates certificate how F of is correct- violated. highlighting three areas for further research and development. nessOther with debugging a YES output answer, may so also that be an provided. independent Some First, recent results have highlighted interesting conceptual formal verification tools also include a proof or certificate 2.1 Verification by Reduction to Synthesis connections between formal methods and the field of machine systemof correctness can use with the a YES proof answer, to check so that (usually an independent more learning (inductive learning from examples), which merits quickly)system can that use the the property proof to check indeed (usually holds. more quickly) that We consider here a common verification problem: further study. Second, the field of cyber-physical systems the property indeed holds. proving that a certain property is an invariant of – computational systems tightly integrated with physical Property a system — i.e., that it holds in all states of that Φ processes – is growing rapidly, throwing up many problems system. Let us first set up some notation. Relevant that can be addressed with formal methods. Finally, the System YES [proof] field of education is undergoing a sea change, with the S M background material may be found in a recent book advent of massive open online courses (MOOCs) and related Environment Compose Verify chapter on modeling for verification [14]. technologies. Formal methods can play a significant role in E NO Let M =(I,δ) be a transition system where I is a improving the state of the art in technologies for education. counterexample logical formula encoding the set of initial states, and In the rest of the paper, we examine each of these frontiers δ is a formula representing the transition relation. for formal methods in somewhat more depth. Fig. 1. FormalFig. 1 : verificationFormal verification procedure. procedure. For simplicity, assume that M is finite-state, so that This traditional view is somewhat high-level and I and δ are Boolean formulas. Suppose we want to verify that M satisfies a temporal logic property idealized.CSI Journal In practice, of Computing one does | Vol. not 2 always • No. 4, start2015 . with models S and E — these might have to be Ψ = Gψ where ψ is a logical formula involving no abstracted from implementations or hypothesized temporal operators. We now consider two methods manually. Also, the specification Φ is rarely com- to perform such verification. plete and sometimes inconsistent, as has often been noted in industrial practice. Finally, the figure omits 2.1.1 Invariant Inference certain inputs that are perhaps the most crucial in Consider first an approach to prove this property successfully completing the verification task. For by (mathematical) induction. In this case, we seek example, if performing a proof by mathematical to prove the validity of the following two logical induction, one might need to supply hints to the statements: verifier in the form of auxiliary inductive invariants. Similarly, while performing a proof by abstraction, I(s)= ψ(s) (1) ⇒ one may need to pick the right abstract domain and ψ(s) δ(s, s )= ψ(s ) (2) refinement strategy for generating suitable abstrac- ∧ ⇒ tions. Generating these auxiliary inputs to the proof where, in the usual way, ψ(s) denotes that the logi- procedure are often the trickiest steps in getting ver- cal formula ψ is expressed over variables encoding ification to succeed, as they require insight not only a state s. into the problem but also into the inner workings of Usually, when one attempts a proof by induction the proof procedure. as above, one fails to prove the validity of the The issues raised in the preceding paragraph can second statement, Formula 2. This failure is rarely be distilled into to a single challenge: the effective due to any limitation in the underlying validity CSI JOURNAL COMPUTING, VOL. XX, NO. XX, XXXXXXXXX 2 in-depth treatment, the reader is referred to a prior synthesis of three kinds of verification artifacts: article by the author [12], [13]. models, specifications, and auxiliary inputs. There We begin by examining the traditionalCSI view JOURNAL of are COMPUTING, several VOL.examples XX, NO. ofXX, these XXXXXXXXX verification artifacts, 2 verification as a decision problem,CSI JOURNAL with COMPUTING, three inputs VOL. XX,including NO. XX, XXXXXXXXX inductive invariants, abstractions, environ- 2 (see Figure 1): ment assumptions, input constraints, auxiliary lem- in-depth treatment, the reader is referred to a prior synthesis of three kinds of verification artifacts: 1) A model of the system to be verified, S; mas, ranking functions, and interpolants, amongst in-depth treatment,article the reader by the is author referred [12], to [13]. a prior synthesis of threemodels kinds, specifications of verification, and artifactsauxiliary: inputs. There 2) A model of the environment, E, and others. In Sec. 2.1, we illustrate this formulation of article by the authorWe [12], begin [13]. by examining the traditionalmodels, viewspecifications of are several, and auxiliary examples inputs of these. There verification artifacts, 3) The property to be verified, Φ. verification in terms of synthesis with two examples. We begin by examiningverification the as a traditional decision problem, view of withare three several inputs examplesincluding of these inductive verification invariants, artifacts, abstractions, environ- As we will see, one often needs human insight into The verifier generates asverification output a YES/NO as a decision(see answer, Figure problem, 1): with three inputs including inductivement invariants, assumptions, abstractions, input constraints, environ- auxiliary lem- S at least the form of these artifacts, if not the artifacts indicating whether or not(seesatisfies Figure 1): the property1) A model of the system to be verified,ment assumptions,S; mas, input ranking constraints, functions, auxiliary and lem- interpolants, amongst Φ E themselves, to succeed in verifying the design. In in environment . Typically,1) A model a NO of output the2) systemA is model to be of verified,the environment,S; Emas,, and ranking functions,others. In and Sec. interpolants, 2.1, we illustrate amongst this formulation of Sec. 2.2, we describe an approach to the synthesis accompanied by a counterexample,2) A model also of called the3) environment,The an propertyE to, and be verified, Φothers.. In Sec. 2.1,verification we illustrate in terms this formulation of synthesis of with two examples. error trace, which is an execution of the system that of these artifacts by a combination of inductive 3) The propertyThe to verifier be verified, generatesΦ. as output a YES/NOverification answer, in termsAs of we synthesis will see, with one two often examples. needs human insight into indicates how Φ is violated. Other debugging output learning, deductive reasoning, and human insight in The verifier generatesindicating as output whether a YES/NO or not answer,S satisfiesAs the we property will see, oneat least often the needs form human of these insight artifacts, into if not the artifacts may also be provided. Some formal verification the form of a “structure hypothesis.” indicating whetherΦ orin not environmentS satisfiesE the. Typically, property aat NO least output the form is ofthemselves, these artifacts, to succeed if not the in artifacts verifying the design. In tools also include a proof or certificate of correct- Φ in environmentaccompaniedE. Typically, by a a NOcounterexample, output is themselves, also called to an succeedSec. 2.2, in weverifying describe the an design. approach In to the synthesis ness with a YES answer, so that an independent accompanied byerror a counterexample, trace,2.1 which Verification is also an execution called by an of ReductionSec. the system 2.2, we to that describeSynthesisof these an approach artifacts to by the a synthesis combination of inductive system can use the prooferror to check trace, which (usually is an more execution of the system that of these artifacts by a combination of inductive indicates howWeΦ consideris violated. here Other a common debugging verification output problem:learning, deductive reasoning, and human insight in quickly) that the propertyindicates indeed holds. how Φ is violated. Other debugging output learning, deductive reasoning, and human insight in may also beproving provided. that a Some certain formal property verification is an invariantthe formof of a “structure hypothesis.” may also be provided. Some formal verification the form of a “structure hypothesis.” Property tools also includea system a proof— i.e., or that certificate it holds of in correct- all states of that Φ tools also includeness a proof with or a certificate YES answer, of correct- so that an independent system. Let us first set up some notation. Relevant2.1 Verification by Reduction to Synthesis System ness with a YESYESsystem answer, can so use that the an proof independent to check (usually more S [proof] background material may2.1 be found Verification in a recent by book Reduction to Synthesis M system can use thequickly) proof that to check the property (usually indeed more holds. We consider here a common verification problem: Compose Verify chapter on modeling for verification [14]. Environment quickly) that the property indeed holds. We consider hereproving a common that verification a certain property problem: is an invariant of E M =(I,δ) I NO Property Let be a transitionproving system that a where certainis property a is an invariant of counterexample a system — i.e., that it holds in all states of that Property Φ logical formula encoding thea system set of initial — i.e., states, that and it holds in all states of that Φ system. Let us first set up some notation. Relevant System δ is a formula representing theYES transition relation. Fig. 1. Formal verification procedure. S system.[proof] Let us firstbackground set up some material notation. may Relevant be found in a recent book System MYES For simplicity,[proof] assume that M is finite-state, so that S EnvironmentM Compose Verify background materialchapter may on be modelingfound in a for recent verification book [14]. This traditional view is somewhat high-levelCompose andE VerifyI and δ are Boolean formulas.chapter Supposeon modeling we wantfor verification [14]. Environment NO Let M =(I,δ) be a transition system where I is a idealized. In practice, one doesE not always start to verify that M satisfies acounterexample temporal logic property . NO Let M =(I,δ)logicalbe a transition formula system encoding where theI setis of a initial states, and Ψ = Gψ where ψ is a logical formula involving no with models S and E — these might have to be counterexample logical formula encodingδ is a formula the set of representing initial states, the and transition relation. Fig. 1. Formaltemporal verification operators. procedure. We now consider two methods abstracted from implementations or hypothesized δ is a formula representingFor simplicity, the assume transition that relation.M is finite-state, so that manually. Also, the specificationFig. 1. FormalΦ is rarely verification com- procedure.to perform such verification. This traditional view is somewhatFor high-level simplicity, and assumeI and thatδ areM is Boolean finite-state, formulas. so that Suppose we want plete and sometimes inconsistent, as has often been This traditionalidealized. view is somewhat In practice, high-level one does and notI and alwaysδ are start Booleanto verify formulas. that M Supposesatisfies we a temporal want logic property noted in industrial practice. Finally, the figure omits 2.1.1 Invariant Inference . idealized. In practice,with models one doesS and notE always— these start mightto verify have that to beM satisfiesΨ = Gψ awhere temporalψ is logic a logical property formula involving no certain inputs that are perhaps the most crucial in . with models S andabstractedE — theseConsider from might implementations first have an to approach be orΨ hypothesized= toG proveψ where thisψtemporal propertyis a logical operators. formula We involving now consider no two methods successfully completing the verification task. For abstracted from implementationsmanually. Also,by (mathematical) the or specificationhypothesized induction.Φ temporalis rarely In this operators. com- case,to we Weperform seek now such consider verification. two methods example, if performing a proof by mathematical manually. Also, theplete specification and sometimesto proveΦ is the inconsistent, rarely validity com- of asto the has perform following often been such two verification. logical induction, one might need to supply hints to the plete and sometimesnoted inconsistent, in industrialstatements: as practice. has often Finally, been the figure omits 2.1.1 Invariant Inference verifier in the form of auxiliary inductive invariants. noted in industrialcertain practice. inputs Finally, that the are figure perhaps omits the most2.1.1New crucial Frontiers Invariant in in Inference Formal Methods: Learning, Similarly, while performing a proof by abstraction, I(s)= ψ(s) Consider(1) first an approach to prove this property R1 : 4 certain inputs thatsuccessfully are perhaps completing the most crucial the verification in Cyber-PhysicalConsider⇒ task. first For Systems, anby approach (mathematical) Education, to prove and induction. thisBeyond property In this case, we seek one may need to pick thesuccessfully right abstract completing domain and the verificationψ task.(s) Forδ(s, s )= ψ(s ) (2) example, if performing a proof∧ byby mathematical (mathematical)⇒ to induction. prove the In validity this case, of the we followingseek two logical refinement strategy for generatingexample, suitable if performing abstrac- a proof by mathematical This traditional view is somewhatinduction, high-level one might and need to supplyto provehints to the the validitystatements: of the following two logical tions. Generating these auxiliaryinduction, inputs one to might the proof need towhere, supply in the hints usual toformulaway, the ψ( iss) expresseddenotes over that variables the logi- encoding a state s. idealized. In practice, one doesverifier not always in the formstartCSI ofwith JOURNAL auxiliary COMPUTING, inductivestatements: VOL. invariants. XX, NO.CSI XX, JOURNAL XXXXXXXXX COMPUTING, VOL. XX, NO. XX, XXXXXXXXX 3 3 procedure are often the trickiest steps in getting ver- cal formula ψ is expressedUsually,CSI over JOURNAL when variables COMPUTING,one attempts encoding VOL. XX, a NO.proof XX, XXXXXXXXX by induction 3 verifier in the formSimilarly, of auxiliary while inductive performing invariants. a proof by abstraction, I(s)= ψ(s) (1) ification to succeed,models S as and they E require– these insightmight have not onlyto be abstracteda state s .from as above, one fails to prove the validity of the second ⇒ implementationsSimilarly, or hypothesized while performingone manually. may need a proof toAlso, pick bythe the abstraction, right abstract domain and I(s)= ψ(s) (1) into the problem but also into the inner workings of Usually,checkers when forstatement, one Formula attemptscheckers 2.Formula Instead, a proof forcheckers 2. it Formula by isThis induction usually for failure 2. Formula Instead, beis rarely⇒ψ 2. it(s Instead,Given is) due usuallyδ( s, ato it s specificationany is be-)= usually ψGiven be-(sΩ ) and a specificationaGiven class(2) a of specificationΩ and a classΩ and of a class of specificationF one is rarely may complete need to andCSI pickrefinement JOURNALsometimes the right COMPUTING, strategy inconsistent, abstract VOL. for domain XX, generating NO.limitation XX, and XXXXXXXXX suitable in the abstrac-underlyingψ(s) validityδ(s, s checkers )= forψ(∧ sFormula ) 2. ⇒ 3 CSICSI JOURNAL JOURNAL COMPUTING, COMPUTING, VOL. VOL. XX, XX, NO. NO. XX, XX, XXXXXXXXX XXXXXXXXX as above,cause2 one2 the fails hypothesized to provecause invariant thecause hypothesized validityψ is the “not hypothesized of strong invariant the ψ invariantisformal “notψ artifactsstrongis “not(2)Σ strong, doesformal there artifactsformal existΣ an artifacts, does thereΣ, does exist there an exist an the proof procedure.as has often beenrefinement noted in strategy industrial for practice. generating Finally, suitable the abstrac-Instead, it is usually because the∧ hypothesized⇒ invariant is tions. Generating theseenough.” auxiliary More precisely, inputsenough.” toψ theneeds More proofenough.” to precisely, bewhere, Moreconjoinedψ precisely, inneeds the usual toψelement beneeds way, conjoined oftoψΣ be(sthat) conjoineddenotes satisfieselement thatΩ? of theelementΣ logi-that satisfies of Σ thatΩ? satisfies Ω? The issuesfigure raised omits in the certain preceding inputs paragraphthat are perhaps can thesecond most statement,crucial “not Formula strong enough.” 2. This failureMore precisely, is rarely needs to be conjoined tions. Generatingcheckersprocedure these auxiliary for are Formula often inputs(strengthened) 2. the Instead,to trickiest the proof it with stepsis usually anotherwhere,(strengthened) in getting be- formula, in(strengthened) the ver-usual withGiven knowncal another way, aformula as withspecification theψ formula,(s another)Toψdenotes makeis known expressed formula,Ω thingsand that as a theconcrete, theknown class over logi- of as variables we the instantiateTo encoding make the things symbols concrete, we instantiate the symbols in-depthin-depth treatment, treatment, the the reader reader is is referred referredbe to todistilled a a prior prior intoinsynthesis successfullysynthesis to a singleof ofcompleting three challenge: three kinds kinds the the verification of of effectiveverificationverification task.due artifactsFor artifacts toexample, any: : limitation (strengthened) in the with underlying another formula, validity known as the auxiliary To make things concrete, we instantiate the symbols procedure are oftencauseification the the trickiest hypothesized to succeed, stepsauxiliary in as invariant getting they inductive requireψ ver-is “not invariant insightcalauxiliary strong formula. not inductiveauxiliary onlyψ isformal expresseda invariant inductive state artifactss.. invariant overaboveΣ, variables does. in the there two encoding exist examples. an above in the two examples. article by the author [12], [13]. ifmodels performingmodels, specificationsspecifications a proof by mathematical, and auxiliaryauxiliary induction, inputs inputs .one There might inductive invariant. above in the two examples. article by the author [12], [13]. , ification to succeed,, andenough.”into as the they More problem require precisely,. but There insightMore alsoψ needsprecisely, intonot only the to be inner the conjoineda problem stateMore workingss precisely,. of verifyingMore ofelement the precisely,Usually, whetherproblem of Σ thethat of when problem verifying satisfiesInvariant one ofΩ whetherattempts verifying? inference. awhether proofIn this by case, inductionInvariantΨ= inference.G ψ. In this case, Ψ=G ψ. We begin by examining the traditional view of needare severalto supply examples hints to the of theseverifier verification in the form artifacts,of auxiliary Invariant inference. In this case, Ψ=G ψ. We begin by examining the traditional view of are severalinto examples the problem of these(strengthened) but verification also into with the artifacts, inner anothersystem workings formula,satisfies of anMore known invariant systemprecisely,Usually, as the property satisfies systemtheTo when makeproblemreduces an satisfies oneinvariant things of attemptsto verifying an concrete,the property invariantΣ a whetheris proofwereduces theproperty instantiate byset systemto induction ofreduces the the all symbols Booleanto the Σ formulaeis the over set of the all Boolean formulae over the inductive invariants. Similarly, whilethe proofperformingCSI procedure.JOURNAL a proof COMPUTING, by VOL. XX, NO. XX, XXXXXXXXX as above, one fails to proveΣ theis validity the set of of3 all the Boolean formulae over the verificationverification as as a adecision decision problem, problem, with with three three inputs inputs includingincluding inductive inductive invariants, invariants,auxiliary abstractions, abstractions, inductive environ- environ- invariantproblem. ofsatisfies synthesizing problem an an invariant auxiliary ofproblemabove synthesizing inproperty invariant of the synthesizingreduces two anφ examples. auxiliary (propositional)to the an invariantproblem auxiliary state ofφ invariant variablesφ of(propositional)M. Ω comprises state variables of M. Ω comprises abstraction, onethe may proof need procedure. to pick theThe right issues abstract raised domain inCSI the JOURNAL preceding COMPUTING,as paragraph above, VOL. XX, oneNO.CSI can XX,JOURNAL fails XXXXXXXXXsecond COMPUTING, to prove statement, VOL. the XX, Formulavalidity NO. XX, XXXXXXXXX of 2. This(propositional) the failure is state rarely variables of3 M. Ω comprises 3 (see(see Figure Figure 1): 1): mentment assumptions, assumptions, input input constraints, constraints,More precisely, auxiliary auxiliary the lem-such problem lem- that ofthesynthesizing verifying followingsuch whether two an that auxiliary formulas thesuchInvariant following thatinvariant are thevalid: two following inference. f such formulasFormulas that twoIn arethe formulas this valid:following 3 case,and are 4Ψ= above. valid:G ψ. Formulas 3 and 4 above. CSI JOURNALCSI COMPUTING, JOURNAL COMPUTING, VOL. XX, NO. VOL. XX, XXXXXXXXXXX, NO. XX,and XXXXXXXXX refinement strategyThe issues for generating raisedbe in distilled thesuitable preceding into abstractions. to a paragraph single2 two challenge: can2 formulasCSI JOURNALsecond the are effective COMPUTING, statement,valid: VOL.due Formula XX, to NO. anyXX, 2. XXXXXXXXX Thislimitation failure in is the rarelyFormulas underlying 3 and validity 4 above. 3 1) A model of the system to be verified, S; mas, ranking functions, andsystem interpolants, satisfiescheckers an amongst for invariant Formula property 2. Instead,reduces itto is the usuallyΣ is be- the setGiven of all a Boolean specificationAbstraction-based formulaeΩ and over verification.a class the ofAbstraction-basedIn this case, verification. In this case, 1) A model of the system to be verified, S; Generatingmas, ranking thesebe auxiliary distilled functions, intoinputs and to to ainterpolants, the single proof challenge: procedure amongst are the effective dueI(s)= to anyφ(s) limitationψ(s) in(3) theI(s)= underlyingφ(s) validityψ(s) Abstraction-based(3) verification. In this case, 2) A model of the environment, E, and others. In Sec. 2.1, we illustrateproblem thiscause offormulation synthesizing the hypothesized ofcheckers an auxiliary invariant for Formula invariantψ is 2. “notφ Instead,(propositional) strongcheckers it isI for( usuallys)=formal Formula state be-φagain, variables artifacts(s 2.) Instead,ψΨ=(GivensΣ of) ,MG does it. a(3)ψ isΩ. specification usuallythereΣcomprisesis the exist be- setΩ anagain, ofand allGivenΨ= a abstraction class aG specification ofψ. Σ is theΩ setand of all a class abstraction of 2) A model of the environment, E, and oftenothers. the synthesistrickiest In Sec. synthesissteps 2.1, in we getting illustrate verificationverification this formulation toverification succeed, artifacts of as artifacts ⇒ ∧ ⇒ ∧⇒ ∧ again, Ψ=G ψ. Σ is the set of all abstraction in-depth treatment,in-depth treatment, the reader the is referredreader is to referred a prior to a prior of threeof kinds threesuch of thatkindsenough.” the of following More twocause precisely,φ:(s) formulas theψ(sψ hypothesized): needs arecheckersδ(s, valid: s to )= be for invariant conjoined FormulaφFormulascause(s φ) (ψs)ψ the 2.is(s 3ψ Instead, “not) hypothesized( andselement) (4) strong4δ above.( its, is s of )= usually invariantΣ thatformalφα satisfies( be-s ψ) is artifactsψ “not(Ωs ?Given) strongΣ(4), a does specification thereformal existα Ω artifacts anand aΣ class, does of there exist an 3)3)TheThe property property to to be be verified, verified,ΦΦ. . theyverificationverification require insight in in terms terms not of only of synthesis synthesis into the with problem with two two examples.but examples. also into ∧ ∧ φ(s⇒) ψ(s)∧ δ∧(s, s )=∧ φfunctions(s ) ⇒ψ(s ) corresponding∧(4) functions to aαfunctions particularcorresponding abstractcorresponding to a particular to a abstractparticular abstract article byarticle the author by the [12], author [13]. [12], [13]. models, specificationsmodels, specifications, and auxiliary,(strengthened) and auxiliary inputs. There inputs withenough.” another. There More formula,cause precisely, the known hypothesizedψ∧ needs asenough.”Abstraction-based the∧ toTo invariant be More make conjoined precisely,⇒ thingsψ domainis verification. “not concrete,∧ψ [15];needs strongelement we for toIn instantiate of example,beΣ this conjoinedformalthat case, satisfiesthe all artifacts symbolspossibledomainΩ?elementΣ localization [15];, does for of thereΣ example,that exist satisfies all an possibleΩ? localization TheThe verifier verifier generates generatesWe begin as asWe output by output begin examining a abyYES/NO YES/NO examining the traditional answer, answer, the traditional viewtheAsAs inner weof we view willareworkings will several see, of see,are oneof examplesone the several often proof often examplesneeds ofprocedure. needs these human humanverification of these insight insight verification artifacts,IIf into(sinto no)= such artifacts,φφ(sexists,) ψ(s then) it(3) meansIf no that such theφ propertyexists, then it means that the propertydomain [15]; for example, all possible localization auxiliary inductive(strengthened)⇒ invariant∧If.enough.” no with suchIf noanother f More such exists, formula,φagain, precisely,(strengthened)exists, thenΨ= itabove known then meansψ needsG it inwithψ meansas that. the theΣ to anotherabstractions twotheis be thatTo the conjoinedproperty examples. themake formula, set property[16]. of things y all knownisΩ abstractionconcrete,iselement the as statement the we ofabstractions instantiateToΣ “that makeα(M satisfies) things thesatisfies [16]. symbols concrete,ΩΩ?is the statement we instantiate “α(M the) satisfies symbols verificationverification as aS decision as a problem, decision withproblem, three withat inputs leastThe three including theissues inputs form raisedincluding inductive of these in the inductiveinvariants, artifacts, preceding invariants, abstractions,if notparagraph the abstractions, artifacts environ- canψ be environ- M ψ M abstractions [16]. Ω is the statement “α(M) satisfies indicatingindicating whether whether or or not not Ssatisfiessatisfies the the property property at least the form of these artifacts,φ(s) if notψMore(s) the precisely,δ artifacts(s, s is)=auxiliary not the an problemφnot( invariants inductive ) an(strengthened) ψinvariant of(sψ verifying of ) invariantis not(4), of since anM whether with.functions,auxiliary invariantsinceis at not another a at minimum an aα inductive of Invariantminimum invariantcorresponding formula,M, a since invariantΨ of a known inference. f atifabove characterizing and a, to since minimum. as only a in the particularInthe at if a this twoToM aminimum makesatisfies case, examples. abstract thingsΨ= aΨ”.Ψ concrete,aboveGifψ and. in only we the instantiatetwo if M examples.satisfies the symbolsΨ”. (see Figure(see 1): Figure 1): distilled intoment to assumptions, a singlement assumptions,challenge: input constraints, the input effective∧ constraints, auxiliary ∧synthesisφ lem-auxiliarycharacterizing of⇒ lem- ∧ all reachable statesφ characterizing of M should all reachable states of MΨshouldif and only if M satisfies Ψ”. ΦΦinin environment environmentEE.. Typically, Typically, a a NO NO output output is is themselves,themselves, to to succeed succeed in in verifying verifyingsystem the the design. design.satisfies In an InMore invariantall precisely, reachable propertyauxiliaryφ thecharacterizing reducesstates problem inductive domainofto ofMMore the invariant verifyingshould all [15];Σ precisely, reachableis . satisfy for whether the example, the set statesFormulas problemNote of ofInvariant all thatM possible Booleanof3 should thisand verifying above inference.is4 localization formulaea reduction whether in theIn twoover this inNote examples. theInvariant the case, complexity- thatΨ= this inference. isG aψ reduction. In this in case, the complexity-Ψ=G ψ. 1) A model1) A of model the system of the to system be verified, to bethreeS verified,; kindsmas,S; of rankingverificationmas, functions, ranking artifacts: functions,If and no suchinterpolants, models, andφ exists, interpolants, specifications, amongst thensatisfy it means amongst Formulas that 3 the and property 4 above.satisfy Formulas 3 and 4 above. Note that this is a reduction in the complexity- accompaniedaccompanied by by a a counterexample, counterexample, also also called called an an Sec.Sec. 2.2, 2.2, we we describe describe an an approach approach toproblem to the the synthesis synthesis of synthesizingsystem satisfiesabove. an auxiliaryMore ansatisfy invariant precisely, invariant Formulas propertyabstractionssystem theφ 3 problem satisfiesandreduces(propositional) [16]. 4 above. of antoΩ verifying the invariantisand the stateΣ statementcomputability-theoretic whetheris property variables the set“αreduces(M ofInvariant)M allsatisfies.toΩ Boolean the sense:comprises inference.andΣ formulaetheis computability-theoretic the verificationIn set this over of case, all the BooleanΨ= sense:G formulaeψ. the verification over the 2) A model2) A of model the environment, of the environment,E, and andE, andauxiliaryothers. inputs Inothers. Sec.. There 2.1, In weSec.are illustrateψ 2.1,severalis notwe this illustrate anexamples invariant formulation this of of formulationtheseM of , since of at a minimum a and computability-theoretic sense: the verification error trace, which is an execution of the system that ofof these these artifacts artifacts by by a a combination combinationsuch of of that inductive inductive the followingproblem two of formulas synthesizingsystem satisfies are valid: anCSIΨ anproblem auxiliary JOURNALif invariant andFormulas only ofCOMPUTING, invariant property synthesizing if M 3satisfies VOL.φandproblemreduces(propositional) XX, 4 an above. NO.Ψto has auxiliary”. XX, the aXXXXXXXXX solutionΣ stateinvariantis the if variables and setφ onlyproblem of(propositional) of all ifM the Boolean. hasΩ synthesiscomprises a solution state formulae variables if and over only of theM if. theΩ synthesiscomprises3 error trace, which3) The is an property3) executionThe to property be of verified, the to system beΦ verified,. thatverificationΦ. verification artifacts,verification in terms including of in synthesis termsφ characterizing of withsynthesisinductive two examples.all with reachable invariants, two examples. states of M should problem has a solution if and only if the synthesis indicates how Φ is violated. Other debugging output learning, deductive reasoning, and human insight2.1.2 insuch Abstraction-based that2.1.2 the problemAbstraction-based following2.1.2 of two Model Abstraction-based synthesizing2.1.2 formulassuchNote Checking Model that Abstraction-based that are the anAbstraction-basedthis valid:Checking followingauxiliary Model is aproblem reduction CheckinginvarianttwoFormulas Model formulas has verification.in one.φ Checking3 the and(propositional) are The complexity- 4 valid: above. synthesisIn thisproblemproblem stateFormulas case, variables has is 3 not one. and any of 4 TheM above.. synthesisΩ comprises problem is not any indicates howTheΦ is verifier violated.The generates verifier Other generates as debugging output as a YES/NO output output a YES/NOanswer,abstractions,learning, answer,As we deductiveenvironment willAs see, we reasoning, one will assumptions, often see,satisfy one needs and often Formulas human humaninput needs insightconstraints, insight3 human and into 4insight in above. I(s into)= φ(s) ψ(s) (3) problem has one. The synthesis problem is not any Another commonsuch approach that the to followingsolvingAnotherand computability-theoretic the commontwoagain, invariant formulasΨ= approacheasierG areψ valid:.Abstraction-based to toΣ sense: solving solve,is the the inFormulasthe set verification the invariant of theoretical all 3verification. abstraction andeasier 4above.Abstraction-based sense, toIn solve, than this the in case, the theoretical verification.Ω sense,In this than case, the maymay also also be beindicating provided. provided.indicating whether Some Some or whetherformal formal not S orverificationsatisfies verification not S thesatisfies propertyauxiliarythethe the form form property atlemmas, ofleast of a athe “structureat “structure formranking least of the these hypothesis.” form hypothesis.”functions, artifacts, of these and if artifacts, not interpolants, the artifactsif not the artifacts ⇒AnotherAnother ∧Icommon(s)= commoncheckers φapproach(s) approach forψ( Formula s)to tosolving(3) solving 2.I(s Instead,)=the the invariant invariantφ it(s is) usuallyψ (seasier) be-(3) to solve,Given in the aspecification theoretical sense,and than a class the of amongst others. In Sec. 2.1, we illustrate thisφ formulation(s) ψverification(s) ofδ (s, s verification problem)= φ is(verifications based ) problemψ on(s cause)⇒soundverificationproblem problem (4) the andis∧ functions hashypothesizedis completebased problem based a solution on onsoundα issoundoriginalcorresponding based invariant ifagain,and and and on⇒ verification onlysound completeΨ=ψ is if to∧ and “notGthe a Abstraction-basedψproblem. complete particularsynthesis. strongΣ is the However, abstractoriginalagain, setformal of theΨ=verificationverification. all artifactssynthe- abstractionG ψ. problem.ΣIn,is does this the However, there set case, of exist all the abstraction an synthe- toolstools also also include includeΦ in a environment a proof proofΦ in or environment or certificateE certificate. Typically,E of. of Typically, correct- a correct- NO output a NO is outputthemselves, is themselves, to succeed to2.1.2 in succeed verifying Abstraction-based in verifying the design.∧ the∧ In design. Model In Checking⇒ ∧ I(s)= φ(s) ψ(s) (3) original verification problem. However, the synthe- verification in terms of synthesis with two examples.abstraction. As weφ (s)abstraction. Givenψ(s) theδ original( s,Given s )= system abstraction.theproblemφ (φoriginalsM( s)),domain has oneψ Givenψ( ⇒ one.(s seekssystems )) [15]; theTheδ(4)( s,∧sis originalM synthesisfor s , versionψ)=functions one example, systemseeksφ ofproblem(sα the ) all Mcorrespondingagain,to problem possible,ψ oneis(s not)Ψ= seeks mayany localization(4)G betosisfunctionsψ easiera. versionΣ particularis to the ofα solve thecorresponding setΣ abstractin problem of all abstraction may toΩ be a particulareasier to solve abstractin ness with a YESaccompanied answer,accompanied by so a that counterexample, by an a independent counterexample, also called also an calledSec. an2.2,Sec. we describe 2.2, we an describe approach anIf approachno to the such synthesisφ toexists, the synthesis then∧ it means∧abstraction. that the propertyenough.”⇒ Given the More∧∧ original precisely,∧ system Mneeds⇒, one to seeks be∧ conjoinedsis version of theelement problem of maythat be satisfies easier to? solve in ness with a YES answer, so that an independent Another common approachto compute to solving an abstract theφ invariant(s) transitionψ(s)toeasier system computeδ(s, tos )=abstractionsα solve,( anM)= abstractφ in(s the) [16]. transitionψ theoreticaldomain(sΩ )is the(4) [15]; system statement sense,functions forα example, than(M “α)=( theMα )corresponding allsatisfiesdomain possible [15]; localization to for a particular example, abstractall possible localization will2.12.1 see, Verification one Verification often needs by human by Reduction Reduction insight into to to at Synthesis Synthesisleast the form compute anto computeabstract(strengthened) transition an abstract system transition with another apractice(M system) = formula,(I,a especially, αd(aM) such known)= ifpractice suitable as the , additional especiallypractice, constraints ifespecially suitable additionalif suitable constraintsadditional constraints systemsystem can can use useerror the the trace, prooferror proof which trace,to to is check checkan which execution (usually is (usually an execution of themore more system of the that systemof thesethat of artifacts these by artifacts a combination by aψ combinationis not of aninductive invariant ofIf no inductive such of Mφ, sinceexists, at then a∧ minimum it means∧If no a that such theφ⇒exists, property then∧ it means that the property To make things concrete, we instantiate the symbols of these artifacts, if not the artifactsverification themselves, problem to succeed is(I basedα,δ inα) onsuchsoundthat that a andα(M)(M complete( I) satisfiessatisfies,δ ) suchauxiliary( originalΨIα that,δifif α andand)αsuch( inductive verificationΨM only only)if thatsatisfies andif Mifα invariant only(MM problem.are Ψ) satisfies ifsatisfiesif imposed.abstractionsM and. satisfies However, only Ψ. Weif ifThis and [16].MΨdomain elaborate”. the onlyΩ synthe-is if [15]; the onM statementsolution forareabstractions example, imposed. techniques “α(M all We [16].) satisfies possible elaborate inΩ is the localization on statement solution “α techniques(M) satisfies in quickly) that theindicates property howindicates indeedΦ is violated.how holds.Φ is Other violated. debugging Other debugging outputWeWe consider considerlearning, output here here deductivelearning, a a common common deductive reasoning, verification verification reasoning, andφ humancharacterizing andproblem: problem:insight humanψ in all insightisreachable not an in invariantIf states no suchα ofαφMexists,,should sinceψ is then not at a itan minimum means invariant that a of theM property, since at a minimumare imposed. a above We in elaborate the two on examples. solution techniques in quickly) that the property indeed holds. verifying the design. In Sec. 2.2, abstraction.we describe Givenan approach thesatisfies original to Ψ system. Thisapproach approachM, onesatisfiesis computationally seeks is computationallyΨ.satisfiessis ThisMore version approach precisely, Ψadvantageous. This ofNote advan- the is approachthe computationallythat problem problemthe this whenΨ following is ismay computationallyif the a of and reduction be verifyingprocess advan- onlyeasier section.abstractions if to inM whetherthe solve advan- thesatisfies following complexity-in [16].theΨΨ”.Ω section. followingifis and the only statement section. if M “satisfiesα(M) satisfiesΨ”. Ψ=G ψ may alsomay be provided. also be provided. Some formal Some verification formalprovingproving verificationthe that that form a a ofcertainthe certain a form “structure property of property a “structure hypothesis.” is is ansatisfy anhypothesis.”invariantinvariant Formulasofφof 3characterizing and 4 above.ψ is all not reachable an invariantφ statescharacterizing of ofM,M sinceshould all at reachable a minimum states a of M should Invariant inference. In this case, . the synthesis of these artifacts byto a combination compute an abstract of inductivetageous transition whenof systemcomputing the processαtageous(M a)= of(M) computing when systemtageous andpractice the then satisfies process when,αand especially( verifyingM) computability-theoretic the anand of invariant process computing if whether suitable of propertyNote computing itα additional ( satisfiesM that)reducesand sense:thisΨα constraintsif( isM andto thea) theand reduction only verificationΣ if MNote insatisfies the that complexity-Ψ this”. is a reduction in the complexity- PropertyProperty tools alsotools include also a include proof or a certificate proof or certificate of correct- of correct- satisfy Formulasφ characterizing 3 and 4 above.satisfy all reachable Formulas states 3 and of 4M above.should is the set of all Boolean formulae over the learning,aa system system deductive — — i.e., i.e., reasoning, that that it it holds and holds(I humanα,δ inα in) allsuchall insight states states that inα of (theM ofthen that) form thatsatisfies verifying Ψ ifis whether andsignificantly only it if satisfiesM morethenareΨ imposed.isefficient verifying significantlyproblem We whetherthan elaborate has the a it process solutionand satisfies on computability-theoreticsolution of ifΨ directly andis techniques significantly onlyNote if that the in thissynthesis sense:and is a computability-theoretic reduction the verification in the complexity- sense: the verification Φ Φ ness withness a YES with answer, a YES so answer, that an so independent that an independent satisfythen Formulas verifyingproblem 3 whetherandof 4 above. synthesizing it satisfies2.2Ψ anis Integrating significantly auxiliary invariant Induction,φ (propositional)2.2 Deduction, Integrating state and Induction, variables of Deduction,M. Ω comprises and ofsystem. a “structure2.1 Let us Verificationhypothesis.” first2.1 set Verification up by some Reductionsatisfies bynotation.Ψ Reduction2.1.2. This to Synthesis Relevant approach Abstraction-basedmore to Synthesis is efficient computationallyverifying than Model the M processChecking advan-in the first ofmorethe directly followingplace. efficientproblem verifyingWe than section.do hasnot the one.seek processproblem Theto describe of synthesis directly has aandin solution problem verifying2.2 computability-theoretic Integrating if is and notproblem only any if Induction, thehas sense:synthesis a solution the Deduction, ifverification and only and if the synthesis YES system. Let us first set up some notation. Relevant more efficientsuch than that the processfollowing of two directly formulas verifying are valid: Formulas 3 and 4 above. SystemSystem system cansystem use the can proof use the to checkproofYES to (usually check more (usually more 2.1.2 Abstraction-based Model2.1.2 Checking Abstraction-basedStructure Model CheckingStructureStructure [proof][proof] background material may betageous found in whenAnother a recent the common process bookM in of approach the computing firstdetail place.to what solvingα(M We abstractions) and the do notinvariantM seekin are the toused,easier first describe or place. tohow solve, they Weproblem inare do thecomputed. not theoreticalhas seek one.problem to describe The sense, has synthesis a than solutionproblem theproblem if has and is one. notonly The any if the synthesis synthesis problem is not any S S quickly) thatquickly)M the property that the indeed property holds. indeed holds.backgroundWe consider materialWe here consider may a common behere found a verificationcommon in a recent verification problem: book problem: M in the first place. We do not seek to describe Abstraction-based verification. In this case, M 2.1 Verification by Reductionthen to verifyingSynthesisverification whether problemin it detailAnother satisfies is what basedThe commonΨ abstractionsis ononly2.1.2 significantlysound point approach Abstraction-based and arewe complete toinemphasize used,Another solving detail ororiginal what howcommon the here Model invariant they abstractions is verification approachI Checkingthat(Wes)= easierthe have are to problem.processφ so solving( toused,s) far solve, ψ discussedorproblemof However,( thes ) how inWe invariant the they have(3) has how theoreticalthe one. so synthe-We verificationeasier far The have discussed sense,synthesis to so solve, can far than howdiscussed be problem in the the verification theoretical is how not verification any can sense, be than can be the Environment ComposeCompose VerifyVerify chapterproving on modeling thatproving a certainfor that verification a property certain isproperty [14]. an invariant is an invariantof of in detail what abstractions are used, or how they again, Ψ=G ψ. Σ is the set of all abstraction Environment chapter on modeling for verification [14]. 2.2 Integrating Induction,performed⇒ by Deduction,∧ reduction to and synthesis.performed An effective by reduction to synthesis. An effective E E Property Property We consider here a common verificationmore efficientabstraction. problem: than the Givenareproving processverification computed. the originalofcomputing directly The problemAnother system only verifyingare theis point computed.M common basedabstraction, one weare onverification emphasizeseeksφ approachsound computed. The(s )is onlyasis andψsynthesis(s hereversionproblem to) point complete The solvingδ is( onlys, we task. of sis emphasize)= the basedthe pointoriginal problem invariant onweφ(sounds emphasizeverification here ) mayψeasier is and( bes )performed complete easier here toproblem.(4) solve, is to solve byoriginal However,in reduction thein theoretical verification the to synthe- synthesis. sense, problem. than An However, effective the the synthe- NO LetLetMaM system=(=(I,δI,δa —) systembe) i.e.,be a transitiona that transition — iti.e., holds that system system in it all holds where states where in allI ofIis statesthatis a a of that Structure functions α corresponding to a particular abstract Φ Φ NO M in theto first compute place.that an We abstractabstraction. the do process not transition seek GivenInverification of toother computing describe thesystem words, original problemα( thethat Minsteadabstraction. system)= abstraction the is based∧ processofpracticeM directly, Given one on∧ issound of, seeks a especially theverifying computingparadigm andoriginalsis⇒ complete version ifwhether suitable systemthe for abstraction∧ synthesisof MoriginalM the additional , oneproblemparadigm in seeksverification isrecent aconstraints mayparadigm forsis be times synthesis version problem. easier has for of to insynthesis beenHowever,the solve recent problemin intimes the may recent synthe- has betimes easier been hasto solve beenin counterexamplecounterexample that a certainsystem. property Letsystem. us is first an Let invariant set us up first some setof a notation.up system some – Relevantnotation. i.e., that Relevant that the process of computing the abstraction is a domain [15]; for example, all possible localization YES logicallogicalYES formula formula encoding encoding the the set set of of initial initial states, states, and and System System in detail( whatIα,δα) abstractionssuchsynthesis thattoα compute( areM task.) used,satisfiessatisfies anabstraction. or abstract Ψ how, ifwe and they seek transition Given onlyIf synthesis Wetoto no if thesynthesize computeM suchhave system originalare task.φ soexists, imposed. anfarα an system(M abstractdiscussed abstraction)= thento WeMit combinepractice, transition elaboratemeans one how function seeks verification, that induction, especially system on the sis solutiona propertyversionα deduction, can(ifM suitable techniques)= be of thetopractice additional and problemcombine in structure, especially may induction, constraints hy- be easier if deduction, suitable to solve additional andin structure constraints hy- S S [proof]it holds[proof] inbackground all statesbackground of material that system. may material be Let found may us first inbe a found recent set inup book asome recent book synthesis task. to combineabstractions induction, [16]. deduction,Ω is the and statement structure “α hy-(M) satisfies M M δ δisis a a formula formula representing representing the the transition transition relation. relation. performed by reductionpotheses; to synthesis. an approach An effective the authorpotheses; previously an termed approach the author previously termed Fig. 1. Formal verificationCompose procedure.ComposeVerify Verify notation. chapterRelevant on chapterbackground modeling on modeling for materialare verification computed. for satisfiesmay verification [14]. be The Ψfound. only This [14]. Inin point( approachI αa other,δ weα) suchsuch emphasize words, is computationally that to compute insteadαa(( hereM)) satisfiessatisfies is an ofψ advan- abstract( directlyisIInα not,δΨ otherαif) an andsuch transitiontheand verifying invariant words, onlyfollowing only that if αif system(M insteadofM M section.)M satisfiessatisfiesare,α since( imposed. ofM)= directly Ψ at, if aand andWepractice minimum verifyingelaborate onlypotheses;, if especially aM on anare solution approach ifimposed. suitable techniques the We additional author elaborate in previously constraints on solution termed techniques in Fig. 1. Formal verificationEnvironment Environment procedure. M In other words, instead of directly verifying Ψ if and only if M satisfies Ψ”. E E recentForFor simplicity, book simplicity, chapter assume assumeon modeling that that thatforMis verificationis the finite-state, finite-state, processtageous [14]. ofwhen so computingsowhether that thethatsatisfies processM thethenΨsatisfies. of abstraction This we( computingIα verify approach,δΨα,) such we iswhetherα a seek is( thatφM computationallywhetherparadigmsatisfiescharacterizing)α to aand((MM synthesize) satisfiesMsatisfiesΨ for. Thissatisfies synthesis advan-all an approachΨ. reachableifΨsciduction and, wethe in only is recent seek followingcomputationally states ifandM to times offormalized synthesizeare section.M imposed. hassciductionshould advan- been [12]. an Wesciductionand Inthe elaborate thisformalized following section,and on solution section.formalized [12].we In techniques this [12]. section, In in this we section, we NO NO Let M =(LetI,δM) be=( a transitionI,δ) be a transitionsystem where systemI is where a I is a whether M satisfies Ψ, we seek to synthesize an Note that this is a reduction in the complexity- This traditional view is somewhat high-level andcounterexampleI Iandcounterexampleδ δare Boolean formulas.synthesis Supposethen task. verifying we wantabstraction whethertageous it function when satisfiessatisfies theαΨ processsuchisΨ significantly. This that ofsatisfyabstractiontotageouscomputing approachα( combineM Formulas) satisfies when is functionα induction, computationally(M the 3Ψ) andand processαpresent 4such deduction, above. of thata advan- computing briefα and(M introduction)the structuresatisfiesα followingpresent(M) hy-and toΨ a somebrief section.present of introduction the a brief key ideas introduction to some of to the some key ofideas the key ideas This traditional view is somewhat high-level and Letand Mlogical = are(I , formulad Boolean) logicalbe a transition encoding formula formulas. encodingthesystem set Suppose of where theinitial set I states,is of we a initial logical want and states, and abstraction function2.2α such Integrating that α(M) Induction,satisfies Ψ Deduction,and and computability-theoretic sense: the verification to verify that M satisfies a temporalIn othermore logic words, efficient property insteadif than andthen theonly of verifying process2.1.3 directly if M tageousReduction whether ofsatisfies verifying directly when itΨ satisfiesto, verifying theandifSynthesispotheses;then and process thenΨ verifying onlyis we an significantly of if verifyapproach computingM whethersatisfiesin the thisitα satisfies author(ΨM approach.,) andand previouslyΨ thenis significantly wein termed verify this approach.in this approach. idealized.idealized. In In practice, practice, one one does does not not always always start startformulato. verify encodingδ is that a formula theMδ is setsatisfies a representing formulaof initial a temporal representingstates, the transitionand logic d the is a transitionproperty relation. formula relation. if and only if M satisfiesStructureΨ, and2.2 then we Integrating verify Induction,problem2.2 Deduction, Integrating has a solution and Induction, if and only Deduction, if the synthesis and Fig. 1. FormalFig. 1. verification Formal verification procedure. procedure. . whether M insatisfies the firstwhetherΨ, place.more we seekα efficient( WeM to) dosatisfies synthesizethen notthan verifying seek theΨ. process anto describe whethersciductionmore of directly efficient itα( satisfiesMand) verifyingsatisfies than formalizedΨ theisΨInduction significantlyprocess. [12]. of Inis directly this the section, process verifying we of inferringInduction a generalis the process of inferring a general withwith models modelsSSandandEE—— these these might might have have to to be berepresentingΨΨ==GGForψψwhere simplicity,thewhere transitionForψψis simplicity,is assume a alogical relation. logical that assume formulaM formulaForis that finite-state,simplicity, involvingM involvingis finite-state, assume so no that no so that Givenwhether the twoα examples2.1.2(M) satisfies Abstraction-based above,Ψ .let us nowStructure step Model back Checking and2.2 IntegratingInductionproblemStructureis Induction, the has process one. TheDeduction, of inferring synthesis aproblem and general is not any abstractioninfunction detail whatα such abstractionsM in that theα( firstM aremore) place.satisfies used, efficient WeorΨ how do thanpresentM not they thein seek the processaWe brief tofirst have describe introduction of place. directlysolaw far We ordiscussed verifying todo principle some not seek of how fromthe to verification keylaw describe observation ideas or principlelaw can of or be particular from principle observation in- from observation of particular of in- particular in- abstracted fromThis implementations traditionalThis traditional view oris somewhat hypothesized view is somewhat high-levelthattemporal high-leveltemporal and M isI finite-state, operators.and and operators.δ areI and Booleanso We Weδthatare now nowI formulas.and Boolean consider considerd are formulas. SupposeBoolean two two methods methodsweformulas. Suppose want we wantformalize the generalAnother notion common of performing approach verification to1 solving the Structure by invariant easier to1 solve, in the theoretical sense, than the abstracted from implementations or hypothesized M in the firstin place. this Weapproach.performed do not seek bystances. reduction toWe describe haveMachine to so synthesis. far learning discussed An algorithms1 effective howstances.We verification have areMachine typically so far can discussed learning be algorithms how verification are typically can be idealized.idealized. In practice, In practice,one does one not does alwaysSuppose not start always weto verify startwant to thatto verify verifyM satisfies that that MMif a andsatisfiessatisfies temporal onlyare a computed.ifa logictemporal temporalM propertysatisfies2.1.3 logic Thelogicin detailΨ onlyproperty Reduction, andreduction point what then we abstractions to we emphasize2.1.3to Synthesis synthesis. verify Reductionverification2.1.3 are herein used,detail is Reduction to or problemwhat Synthesis how abstractions they isto based Synthesis on aresound used, and or complete howstances. theyMachineoriginal verification learning algorithms problem. are However, typically the synthe- manually.manually. Also, Also, the the specification specificationΦΦisis rarely rarely com- com- toto perform perform. such such verification. verification.. Inductionparadigmis the for processinductive, synthesisperformed of generalizing inferring in by recentWe reduction a have general times from so to (labeled) hasfar synthesis.inductive,performed discussed been examples Angeneralizing howby effective reduction toverification from to synthesis.(labeled)can be examples An effective to with modelswithS modelsand E S—and theseE — might these have mightproperty to be have Ψ to= beGψ Ψwhere= G ψψ iswhereis aa logicallogicalwhetherψ is formula aformula logicalαthat(M the) involving satisfiesinvolving formula processΨ involving noare no. of computed. computing no Supposein The the detail only abstraction the what point original abstractionsare we is computed.verification emphasize a are The here used,problem only is or point is: how we they emphasizeM inductive, here is generalizing from (labeled) examples to plete and sometimes inconsistent, as has often been Given the two examplesGiven above, theabstraction.Given two let examples us the now Given two step examplesabove, the original let above, us system now let step usperformed, one now seeks step bysis reduction version toof thesynthesis. problem An may effective be easier to solve in plete and sometimes inconsistent, as has often beentemporal temporaloperators.temporal operators. We now operators.consider We now two consider We methodssynthesis now two consider totask. methods perform twothat methods the processare computed.of computing Thelawthat the only or the abstraction principle pointto process combine we from of emphasize is computing induction,obtain a observationparadigm a here learned deduction, the is offor abstraction particularconcept synthesisobtain and structure isor in- ina aclassifier learned recentobtainparadigm hy- timesaconcept[17], learned for has [18]. synthesisorconcept beenclassifier inor recent[17],classifier [18]. times[17], has [18]. been abstractedabstracted from implementations from implementations or hypothesized or hypothesized back and formalizeGiven theback generala andsystem notionformalizetoback compute M and of 1 performingthe (composition formalize an general abstract the notion general transition of of system performing notion system and of performing α(M)= practice, especially if suitable additional constraints notednoted in in industrial industrial practice. practice. Finally, Finally, the the figure figure omits omits 2.1.12.1.1 Invariant Invariant Inference Inference2.1.3 ReductionIn other to words, Synthesissynthesis instead task.that of directly the process verifyingstances.synthesis of computingpotheses;Machine task. the anlearning abstractionDeduction approachto algorithmscombine, the is on a author induction, theparadigm are other previouslyDeduction typically deduction, hand, for, synthesisDeduction termedto on involves combine the and inother, structure the onrecent induction, usethe hand, hy- timesother involves deduction, hand,has been the involves and use structure the use hy- manually.manually. Also, the Also, specification the specificationΦ is rarelyΦsuch com-is rarelyverification.to com- performCSI JOURNALto such perform COMPUTING, verification. such VOL. verification. XX, NO. XX, XXXXXXXXXverificationenvironment by reduction models), to synthesis.(verificationIα does,δα) Msuch satisfy by that reduction αa (specificationM) satisfies to synthesis.3 Ψ ?if and only if M are imposed. We elaborate on solution techniques in certaincertain inputs inputs that that are are perhaps perhaps the the most most crucial crucial in in whether M satisfiesInΨ other, we words, seeksynthesisverification to instead synthesize task. inductive, ofbyIn reductionan directly othersciduction generalizing words,verifying to synthesis.andof instead from formalizedgeneralpotheses; (labeled) of rules directly an [12]. approachto and examples combine Inof verifying axioms this general the section, to induction, author toof rulespotheses; infer general we previously andconclusions deduction, an rules axioms approach termed and and to axiomsinfer structure the author conclusions to hy- infer previously conclusions termed plete andplete sometimes and sometimes inconsistent, inconsistent, as has often asConsider hasbeenConsider often been first first an an approach approachGiven to to prove prove the this two this propertyexamples propertySuppose above, the let originalThe us approach now verification step insatisfies theSuppose problemtwoΨ examples. This the is: originalapproach above verification is to computationally reduce problem this advan- is: the following section. successfully completing the verification task. For 2.1.1 Invariant Inference abstraction functionwhetherα suchM satisfies thatInα( otherMSupposeΨ),satisfies we words,obtain seekthewhether originalΨ instead to a synthesizepresentM learned verificationsatisfies of a directlyconcept brief anaboutΨ introduction, problemsciduction we verifyingor particular seekclassifier is: toand to problempotheses; synthesize some formalized[17],about of [18].instances. the an particular an keyapproach [12].aboutsciduction ideas Traditional In problem particular this theauthor section,and instances. for- problemformalized previously we Traditionalinstances. [12]. termed In Traditional this for- section, for- we successfullynoted completing in industrialnoted the in industrial practice. verification Finally, practice. task. the Finally, figure For by the omitsby figure(mathematical) (mathematical)2.1.1 omits Invariant2.1.1 induction. induction. Invariant Inferenceback In InferenceIn thisand this formalize case, case, we we the seek seekgeneral notionquestion of performing to the formtageous below: when the process of computing α(M) and checkers for Formula 2. Instead,if and it only is usually if MabstractionGiven be-satisfies a system functionΨGivenwhether, andM (composition thenaαMGiven specificationsuch wesatisfiesDeduction thataabstraction verify systemGiven ofαΨ( systemMΩ,inM we),and a thissatisfies onfunctionsystem(composition seek aapproach. the class toMΨ othermalα synthesize(composition ofsuchpresent ofverification hand, system that a an involvesα brief( ofM andsciduction system) introductionsatisfies synthesis the useandΨ to techniques,mal formalizedpresent some verification of a the brief such [12]. key introduction and as In ideas this synthesis section, to some techniques, we of the key such ideas as example,example, if if performingcertain performing inputscertain a athat proof inputsproof are by perhaps that by mathematical are mathematical the perhaps most thecrucialto most proveConsider in crucialConsider the firstin validityConsider first an an ofapproach approach first theverification an following to approach to proveprove by two tothis reduction this prove logical property property this to synthesis.property by then verifying whether it satisfies Ψ is significantlymal verification and synthesis techniques, such as to provecause the validity the hypothesized of the following invariantwhetherψ twoisα( “notM logical) satisfies strongifand and environment onlyΨ. Given ifformalabstractionM models),asatisfies specification artifactsand function environment doesΨofif,Σ and andgeneral,Mand W doesα thensatisfyand onlysuch environmentmodels), thereInduction rulesa we ifthatclassM verify existandα of(satisfies doesM models),formalis axioms) an theinsatisfiesM this Ψsatisfy processartifacts, todoes approach.andΨ inferM then ofpresentSsatisfy, conclusions inferring wea verify brief a2.2 introductiongeneralin this Integrating approach. to some Induction, of the key ideas Deduction, and induction,induction, one onesuccessfully might mightsuccessfully need need completing to to supply supply completing the hints verification hints the to to the verification the task.(mathematical) For task.by (mathematical) For induction.by (mathematical) In induction. this case,Suppose induction. In we this seek the case, Inoriginalto thisprove we seekcase,verification the we seek problem is: more efficient than the process of directly verifying statements:statements:enough.” More precisely, ψ needs to be conjoinedwhethera specificationdoesα(M thereelementif) satisfies andΨ ?exist only ofa anΨ specificationΣ. if elementthatMaboutwhether satisfiessatisfiesa specificationparticularof lawΨαS? (thatΩMΨ? or,) andsatisfiessatisfies principleproblemΨ then? 1.Ψ WThe instances. wefrom.?Induction term verify observation “induction” Traditionalisin thethis is often of1. process approach. Theused particular for- term inStructure the “induction”of1. verificationInduction inferringThe in- term is “induction”communityoften ais used general the in is the process often verification used in of the community inferring verification communitya general example,example, if performing if performing a proof by a proof mathematical byvalidity mathematical of the following two logical statements: M in the first place. Weto refer do to mathematical not seek to induction describe, which isto actually refer to amathematical deductive proof induction, which is actually a deductive proof verifierverifier in in the the form form of of auxiliary auxiliary inductive inductive invariants. invariants. to prove(strengthened)to the prove validity with the of validity another theGiven following of formula, the a system following two knownM logical(composition as two the logical ofTo systemmake things concrete, we instantiate1 the symbols Inductionto refer to mathematicalis the process induction of, which inferring is actually a a general deductive proof 2.1.3 ReductionThe approach to SynthesisTo make inwhether the thingsThe twoα approach( concrete,M examples)Thesatisfiesmal inapproachverificationwe above the instantiateΨstances.. two is in toand examplesMachine the therule. synthesis symbolstwolaw Here learningexamples above or we techniques, principleare is employing algorithms toabove from such “induction” is observationare to as typicallyrule. inlaw its Here more or of we principle classic particular are employing usage from in- “induction” observation in its of more particular classic usage in- induction,induction, one might one need might to supply need to hints supply to the hintsstatements: to the statements: in detail what abstractions are used,1 or howrule. they Here weWe are have employing1 so far “induction” discussed in its howmore classic verification usage can be Similarly,Similarly, while while performing performing a a proof proof by by abstraction, abstraction, auxiliary inductiveI(Is()=s invariant)=andψ(ψ.s environment()s) reduce models),(1)(1) this doesabove questionaboveM in insatisfy the to the thetwo two form examples. examples.reduce below: thisinductive, question generalizing toarising thestances. fromform thebelow: from fieldMachine oflaw (labeled) Philosophy. or learning principle examples algorithmsarisingstances. from to from observation theMachine are field typically of Philosophy. learning of particular algorithms inare typically verifier inverifier the form in of the auxiliary form of inductiveauxiliary inductiveinvariants. invariants. ⇒ 2.1.3 Reductionreduce to Synthesis thisare2.1.3 question computed. Reduction to the The form only to below: pointSynthesis we emphasizearising here isfromperformed the field of Philosophy. by reduction to synthesis. An effective one may need to pick the right abstract domain and More precisely, the problem⇒a specificationGiven of verifying theΨ two? whether examples above, let us now1. step The term “induction”Ψ= is ofteninductive,G usedconceptψ in the verification generalizingstances.classifier community1 Machine frominductive, (labeled) learning examples generalizing algorithms to are from typically (labeled) examples to one may needSimilarly, to pickSimilarly, while the right performing while abstract performing a proof domain by a abstraction,proof and by abstraction,ψ(ψs()s) δ(δs,(s, s )= s I)=(s)=ψ(ψIs(ψ s)()= s)) ψ(s)(2)(1)(2) (1)InvariantInvariant2.1.3 Reduction inference. inference. toIn In Synthesis this obtainthis case, case, a learned . is theor set [17],paradigm [18]. for synthesis in recent times has been system satisfies∧ an invariant⇒ propertyback andreduces formalizeGivento the the generalthe two notion examples of performing above,thattoGiven refer the tolet processmathematical the us two now of examples inductionstep computing, which above, the is actually abstraction let a us deductive now is proof step a refinementrefinement strategy strategy for for generating generating suitable suitable abstrac- abstrac- ∧ The⇒ approach⇒ ⇒ in the two examplesofΣ allis Boolean above the set isformulae of to allrule. Booleanover Here the weDeduction (propositional) are formulae employing, on over “induction”obtain thestate the other variables ina its learned hand, moreinductive, classic involvesconcept usage generalizingor theobtainclassifier use a from learned[17], (labeled) [18].concept examplesor classifier to [17], [18]. one may needone may to pick need the to right pick abstract the right domain abstract and domain and ψ(s) δ(s,ψ( ss ))=δ(s,verificationψ s ()=s ) ψ( bys ) reduction(2)back and(2) toformalize synthesis.Given the the general twosynthesisback examples notion and task. of formalize above,performing let the us general now notion step of performingto combine induction, deduction, and structure hy- where,where, in in the theproblem usual usual ofway, way, synthesizing ψ(sreduce) denotesdenotes an this auxiliary that question thethe invariant logi- tological the form φ of below:(propositional) M. W comprises state Formulasarising variables from 3of and the of general field M4 above.. ofΩ Philosophy. rulescomprisesDeduction and axioms, onobtain to the infer othera learned conclusions hand,Deductionconcept involves,or on theclassifier the use other[17], hand, [18]. involves the use tions.tions. Generating Generatingrefinement these theserefinement auxiliary strategy auxiliary for strategy inputs inputs generating to for to the generating the suitable proof proof abstrac- suitablewhere, abstrac- in the usual way,∧ ψ(s)∧denotes⇒ that⇒ the logi-verification byback reduction and formalize to synthesis.verificationIn the other general words, by notion reduction instead of performing to of synthesis. directly verifying potheses; an approach the author previously termed such that the following two formulasSuppose are the valid: original verificationFormulas 3 problem and 4 above. is: about particularof problem general instances. rulesDeduction and Traditional axioms, onof the to for- general inferother conclusions hand, rulesand involves axioms the to use infer conclusions procedureprocedure are are oftentions. often Generatingthe thetions. trickiest trickiest Generating these steps steps auxiliary in these in getting getting inputs auxiliary ver- to ver- the inputscal proofcal toformula formula thewhere, proofψ inψiswhere, theis expressed expressed usual in theway, over usualψ over(s) variables way,denotes variablesψ(s) that encodingdenotes encoding the logi- thatSuppose the logi- theverification original verification bywhether reductionSuppose problemM tosatisfies the synthesis. is:originalΨ, we verification seek to synthesize problem is: an sciduction and formalized [12]. In this section, we Given a system M (compositionAbstraction-based of system verification.mal verificationIn thisabout andcase, synthesis particularof techniques, problemgeneral rules instances. suchabout and as axiomsparticular Traditional to problem infer for- conclusions instances. Traditional for- ificationification to to succeed, succeed,procedure as asprocedure theyare they often require require the are trickiest often insight insight the steps trickiestnot not onlyin only getting stepsa ina statever- state gettingscal.s. ver- formulacalψ formulais expressedψ Iis(s expressed)= over variablesφ( overs) variablesencodingψ(s) encoding Suppose theabstraction original verification function α problemsuch that is: α(M) satisfies Ψ present a brief introduction to some of the key ideas and environment(3)Given models),again, a systemΨ= does MMG(compositionsatisfyψ. Σ is theGiven of set system a of system all abstractionM (compositionmal verification ofabout system and particular synthesismal problem techniques, verification instances. such and asTraditional synthesis for-techniques, such as ification toification succeed, to as succeed, they require as they insight require not insight onlyUsually, nota state only whens. a onestate attemptss. a proof⇒ by induction∧ M Ψ in this approach. intointo the the problem problem but but also also into into the the inner inner workings workings of ofCSI JournalUsually, of whenφ (Computings) oneψ(s) attemptsδ( s,| sVol. )= a proof2 • φNo.(s bya )4, specification induction 2015ψ(s ) (4)Ψand? environmentGiven models), a systemif and doesMand onlyM(composition environment1.satisfy if The termsatisfies “induction” of models), system, is and often does thenusedM in we thesatisfy verification verify community into the probleminto the but problem also into but the also inner into workingsthe inner workings of Usually, of whenUsually, one when attempts one a attempts proof by a inductionproof by inductionfunctions α corresponding to a particular abstract mal verificationInduction and synthesisis the techniques, process of such inferring as a general thethe proof proof procedure. procedure. asas above, above, one one fails fails∧ to to∧ prove prove the the⇒ validity validity∧ of of the the a specificationandΨ? environmentwhethera models),α specification(toM refer) satisfies to doesmathematicalΨM?Ψ.satisfy1. induction The term, which “induction” is actually is often a deductive used in1. proofthe The verification term “induction” community is often used in the verification community the proofthe procedure. proof procedure. as above,as one above, fails one to prove fails to theThe prove validity approach the of validity the in the of twodomain the examples [15]; for above example, is to allrule. possible Here we localization are employing “induction” in its more classic usage If no such φ exists, then it means that the property to refer to mathematical1. The induction term “induction”, whichlawto is refer or actually is principleoften to mathematical a used deductive in from the proofinductionverification observation, which community is actually of particular a deductive in- proof TheThe issues issues raised raised in in the the preceding preceding paragraph paragraph can can secondsecond statement, statement, Formula Formula 2. 2. This This failurereduce failure thisis is rarely question rarelyThe to approach theabstractions form in below:a the specification [16]. twoΩ is examplesThe theΨ statement approach? arising above from “ inα is( theM the to) fieldsatisfies two of Philosophy. examples above is to 1 The issuesThe raised issues in theraised preceding in the preceding paragraph paragraph can secondψ canis statement, notsecond an invariant statement, Formula of 2. FormulaM This, since failure 2. at This a is minimum failure rarely is a rarely rule. Here we areto employing refer to mathematical “induction”stances.rule. induction in Here itsMachine more we, which are classic employing is actually learning usage a “induction” deductive algorithms proof in itsare more typically classic usage bebe distilled distilled into into to to a a single single challenge: challenge: the the effective effective duedue to to any any limitation limitation in in the the underlying underlying validity validityreduce thisΨ if questionThe and only approach to if theM2.1.3 formsatisfies inreduce the below: Reduction twothisΨ”. question examples to Synthesis to above thearising form from is below:to the field of Philosophy. arising from the field of Philosophy. be distilledbe into distilled to a singleinto to challenge: a single challenge: the effective the effectivedueφ tocharacterizing anydue limitation to any all limitation inreachable the underlying in states the underlyingof validityM should validity rule. Here we areinductive, employing “induction” generalizing in its more from classic (labeled) usage examples to Notereduce that this this question isGiven a reduction to the the two formin examplesthe below: complexity- above, let usarising now from step the field of Philosophy. satisfy Formulas 3 and 4 above. obtain a learned concept or classifier [17], [18]. and computability-theoreticback and formalize sense: the the verification general notion of performing Deduction, on the other hand, involves the use problem has a solutionverification if and onlyby reduction if the synthesis to synthesis. 2.1.2 Abstraction-based Model Checking of general rules and axioms to infer conclusions problem has one. TheSuppose synthesis the problem original is verification not any problem is: about particular problem instances. Traditional for- Another common approach to solving the invariant easier to solve, in theGiven theoretical a system sense,M (composition than the of system mal verification and synthesis techniques, such as verification problem is based on sound and complete original verification problem.and environment However, the models), synthe- does M satisfy abstraction. Given the original system M, one seeks sis version of the problema specification may be easierΨ? to solve in 1. The term “induction” is often used in the verification community to refer to mathematical induction, which is actually a deductive proof to compute an abstract transition system α(M)= practice, especiallyThe if suitable approach additional in the constraints two examples above is to (I ,δ ) such that α(M) satisfies Ψ if and only if M rule. Here we are employing “induction” in its more classic usage α α are imposed. We elaboratereduce this on solution question techniques to the form in below: arising from the field of Philosophy. satisfies Ψ. This approach is computationally advan- the following section. tageous when the process of computing α(M) and then verifying whether it satisfies Ψ is significantly 2.2 Integrating Induction, Deduction, and more efficient than the process of directly verifying Structure M in the first place. We do not seek to describe in detail what abstractions are used, or how they We have so far discussed how verification can be are computed. The only point we emphasize here is performed by reduction to synthesis. An effective that the process of computing the abstraction is a paradigm for synthesis in recent times has been synthesis task. to combine induction, deduction, and structure hy- In other words, instead of directly verifying potheses; an approach the author previously termed whether M satisfies Ψ, we seek to synthesize an sciduction and formalized [12]. In this section, we abstraction function α such that α(M) satisfies Ψ present a brief introduction to some of the key ideas if and only if M satisfies Ψ, and then we verify in this approach. whether α(M) satisfies Ψ. Induction is the process of inferring a general law or principle from observation of particular in- 1 2.1.3 Reduction to Synthesis stances. Machine learning algorithms are typically inductive, generalizing from (labeled) examples to Given the two examples above, let us now step obtain a learned concept or classifier [17], [18]. back and formalize the general notion of performing Deduction, on the other hand, involves the use verification by reduction to synthesis. of general rules and axioms to infer conclusions Suppose the original verification problem is: about particular problem instances. Traditional for- Given a system M (composition of system mal verification and synthesis techniques, such as and environment models), does M satisfy a specification Ψ? 1. The term “induction” is often used in the verification community to refer to mathematical induction, which is actually a deductive proof The approach in the two examples above is to rule. Here we are employing “induction” in its more classic usage reduce this question to the form below: arising from the field of Philosophy. CSI JOURNAL COMPUTING,CSICSI JOURNAL JOURNAL VOL. COMPUTING, XX, COMPUTING, NO. XX, VOL.XXXXXXXXX VOL. XX, XX, NO. NO. XX, XX, XXXXXXXXX XXXXXXXXX 4 4 4

model checkingmodelmodel or theorem checking checking proving, or ortheorem theorem are deductive. proving, proving, are area deductive. candidate deductive.concepta candidatea candidate. A commonconceptconcept sciductive. A. common A common approach sciductive sciductive approach approach One may wonderOneOne whethermay may wonder wonder inductive whether whether reasoning inductive inductive may reasoning reasoningis to formulate may mayis the tois toformulate overall formulate problem the the overall overall as one problem ofproblemactive as as one one of ofactiveactive seem out of placeseemseem here, out out of since ofplace place typically here, here, since an since inductive typically typically anlearning an inductive inductiveusinglearning alearningquery-basedusingusing amodel.query-based a query-based Activemodel. learningmodel. Active Active learning learning argument onlyargument ensuresargument that only only the ensures ensurestruth thatof that its the premises the truth truth of ofitsis its premises a premises specialis caseis a special a of special machine case case learning of of machine machine in which learning learning in in which which make it only makelikelymake it or only it probable onlylikelylikelythat or or probable its probable conclusionthatthat its itsthe conclusion conclusion learning algorithmthethe learning learning can algorithm control algorithm the can canselection control control the of the selection selection of of is also true.is However,is also also true. true.one However, observes However, one that one observes humans observes thatexamples that humans humans thatexamples itexamples generalizes that that it from generalizes it generalizes and can from query from and one and can can query query one one CSI JOURNAL COMPUTING, VOL. XX, NO. XX, XXXXXXXXX often employoften aoften combination employ employ a ofcombination a combination inductive andof of inductive de- inductiveor3 more and and de- oracles de-oror moreto moreobtain oracles oracles both to examples toobtain obtain both as both well examples examples as as aswell well as as CSI JOURNAL COMPUTING, VOL. XX, NO. XX, XXXXXXXXX ductive reasoningductiveductive while reasoning3 reasoning performing while while verification performing performing or verification verificationlabels for or those orlabelslabels examples. for for those those (Typically examples. examples. the (Typically labels (Typically are the the labels labels are are checkers for Formula 2. Instead, it is usuallysynthesis. be- Given Forsynthesis. example,synthesis. a specification For while For example, example, provingΩ and while a while a theorem, class proving proving of positive a theorem, a theorem, or negative.)positivepositive We or ornegative.) refer negative.) the reader We We refer torefer athe paper the reader reader to ato paper a paper checkers for Formula 2. Instead,cause it is usually the hypothesized be- Given invariant a specificationψ is “not strongΩoneand often a classstartsformalone byofone often artifacts working often starts starts outΣ by,examples does by working working there outand outexist examples trying examples an by and Angluin and trying trying [21]byby Angluin for Angluin an overview [21] [21] for for of an various an overview overview models of ofvarious various models models cause the hypothesized invariant enough.”ψ is “not More strong precisely,formalψ needs artifacts to be conjoinedΣ, doesto find there a pattern existelementto anto find in find of the aΣ pattern propertiesathat pattern satisfies in inthe satisfied the propertiesΩ? properties by those satisfied satisfiedfor by query-based by those thoseforfor active query-based query-based learning. active active learning. learning. enough.” More precisely, ψ needs(strengthened) to be conjoined with anotherelement formula, of Σ knownthat satisfies asexamples. the Ω?To make Theexamples. things latterexamples. concrete,step The is The lattera weprocess latter instantiate step step of is inductiveais process athe process symbols of of inductiveThe inductive query oraclesTheThe query are query often oracles oracles implemented are are often often implemented using implemented using using (strengthened) with another formula,auxiliary known inductive as the invariantTo make. things concrete, wegeneralization. instantiateabove the ingeneralization. symbols the Thesegeneralization. two patterns examples. These might These patterns take patterns the might form might take takedeductive the the form form proceduresdeductivedeductive such procedures procedures as model such such checkers as as model model or checkers checkers or or auxiliary inductive invariant. More precisely, theabove problem in the of two verifying examples. whetherof lemmasInvariant orof backgroundof lemmas lemmasinference. or facts or backgroundIn background that this then case, facts guidefactsΨ= that that a G thensatisfiability thenψ. guide guide a asatisfiability solvers.satisfiability Thus, solvers. thesolvers. overall Thus, Thus, synthesis the the overall overall synthesis synthesis More precisely, the problem of verifyingsystem satisfies whether an invariantInvariant property inference.reducesInto thisdeductive the case,Σ is processΨ= thedeductivedeductive setG ofψ of proving. process all process Boolean the of statement of proving formulaeproving the of the statementoverthe statementalgorithm the of of the theusuallyalgorithmalgorithm comprises usually usually a comprisestop-level comprises inductive a top-level a top-level inductive inductive system satisfies an invariant propertyproblemreduces ofto synthesizing the Σ is the an auxiliary set of all invariant Booleantheoremφ formulae(propositional) fromtheorem over knowntheorem the state from facts from variables known and known previously factsof factsM. andΩ and estab-comprises previously previouslylearning estab- estab- algorithmlearninglearning constrained algorithm algorithm by constrained the constrained structure by by hy-the the structure structure hy- hy- problem of synthesizing an auxiliarysuch that invariant the followingφ (propositional) two formulas state are variables valid: lished of FormulasM theorems. Ω compriseslished 3lished (rules). and theorems 4 theorems The above. process (rules). (rules). usually The The process iterates process usually usuallypothesis iterates iterates thatpothesis invokespothesis deductivethat that invokes invokes procedures deductive deductive (query procedures procedures (query (query such that the following two formulas are valid: Formulas 3 and 4 above. betweenAbstraction-based inductivebetweenbetween and inductive deductive inductive verification. and reasoning and deductive deductive untilIn reasoning this the reasoning case,oracles). until until the When theoracles).oracles). the top-level When When the strategy the top-level top-level is inductive, strategy strategy is isinductive, inductive, I(s)= φ(s) ψ(s) (3) Abstraction-based verification.final resultagain,In isΨ= thisfinal obtained.final case, resultG resultψ. isΣ obtained.isis obtained. the set of all abstractionwe typically referwewe typically to typically the approach refer refer to simplytheto the approach approach as “induc- simply simply as as“induc- “induc- I(s)= φ(s) ψ(s) (3) ⇒ ∧ φ(s) ψ(s) δ(again,s, s )=Ψ=φG(s )ψ. ψΣ(sis ) the(4) setSciduction offunctions all abstraction [13],Sciductionα Sciductioncorresponding [12] is a [13], formalization [13], [12] to [12] a is particular a is formalization of a formalization such abstract a tive of synthesis” ofsuch such a ative (eventive synthesis” synthesis” when deductive (even (even when procedureswhen deductive deductive are procedures procedures are are ⇒ ∧ ∧ ∧ ⇒ ∧ 2 2 2 φ(s) ψ(s) δ(s, s )= φ(s ) ψ(s ) (4) functions α corresponding tocombination a particulardomain ofcombination [15]; abstract inductivecombination for example, and of ofinductive deductive inductive all possible and reasoning. and deductive localization deductive reasoning.used). reasoning. Thereused). areused). two There important There are are two choices two important important one must choices choices one one must must CSI JOURNAL COMPUTING, VOL. XX, NO. XX, XXXXXXXXX ∧ ∧ ⇒ ∧ 3 φ If no such exists,domain then it means [15]; for that example, the property allThe possible keyabstractions in localizationThe integratingThe [16]. key keyΩ inis induction in integrating the integrating statement and induction “deduction inductionα(M) satisfies and andmake deduction deduction to fix anmakemake inductive to to fix fix ansynthesis an inductive inductive algorithm: synthesis synthesis (1) algorithm: algorithm: (1) (1) If no such φ exists, then it meansψ that the property M is not an invariantabstractions of , since [16]. atΩ ais minimum the statementis a theΨ use“ifα( andM ofis) satisfies onlystructureis the the if useM use ofhypothesessatisfies ofstructurestructureΨ,”. mathematical hypotheses hypotheses, mathematical,search mathematical strategy:searchsearchHow strategy: should strategy: oneHowHow search should should the one concept one search search the the concept concept ψ is not an invariant of M, sinceφ atcharacterizing a minimum a all reachable states of M should checkers for Formula 2. Instead, it is usually be- Given a specification Ω and a class of Ψ if and only if M satisfieshypothesesΨ”. Note usedhypotheses thathypotheses to this define is used a the reductionused to class to define define of in artifacts the the the classcomplexity- class to of ofclass? artifacts artifacts and to to(2)class?exampleclass? and and selection(2) (2)exampleexample strategy: selection selectionWhich strategy: strategy:WhichWhich φ characterizing all reachable statessatisfy of M Formulasshould 3 and 4 above. cause the hypothesized invariant ψ is “not strong formal artifacts Σ, does there exist an Note that this is a reductionbe synthesized inand the computability-theoretic complexity-bebe within synthesized synthesized the overall within within sense: verification the the the overall overall verification or verification verificationexamples or do or weexamplesexamples learn from? do do we we learn learn from? from? satisfy Formulas 3 and 4 above. Sanjit A. Seshia R1 : 5 enough.” More precisely, ψ needs to be conjoined element of Σ that satisfies Ω? and computability-theoreticsynthesis sense:problem the problem. verificationsynthesis hassynthesis a Sciduction solution problem. problem. if constrains and Sciduction Sciduction only if inductive the constrains constrains synthesis inductiveCounterexample-guided inductive Counterexample-guidedCounterexample-guided inductive synthesis inductive inductive synthesis synthesis (strengthened) with another formula, known as the To make things concrete, we instantiate the2.1.2 symbols Abstraction-basedproblem Model has a solution Checking if andand only deductiveproblem if theand synthesis reasoning hasand deductive one. deductive The using reasoning synthesis reasoningstructure usingproblem hypotheses, using structure structure is not hypotheses,any(CEGIS) hypotheses, [22](CEGIS) shown(CEGIS) in [22] [22]Figure shown shown 2 inis in perhaps Figure Figure 2 the is2 is perhaps perhaps the the 2.1.2 Abstraction-based Model Checking auxiliary inductive invariant. above in the two examples. AnotherAbstraction-based common approachproblem verification. to has solving one. The theIn synthesis invariant thisand case, problem activelyeasier again, to combinesisand not solve,anddeductive actively any actively in inductive the combinesreasoning combines theoretical and inductiveusing deductive inductive sense, structure and rea- than and deductive hypotheses, deductivemost the popular rea- rea- andmost approach activelymost popular popular to inductive approach approach synthesis to inductiveto inductive today. synthesis synthesis today. today. More precisely, the problem of verifying whether InvariantAnother inference. common approachIn this case, to solvingΨ=verification theG ψ invariant. S problem is theeasieris set based toof on solve,allsound abstraction in and the complete theoretical functionssoning:original sense, for a instance,soning: verification thansoning:combines the deductivefor for instance, problem.inductive instance, techniques deductive However, and deductive deductive generate techniques the techniques synthe- reasoning:The generate generatedefining forThe instance, aspectThe defining defining of CEGIS aspect aspect is of itsof CEGIS CEGISexample is is its its example example system satisfies an invariant property reduces to the Σ is theverification set of problem all Boolean is based formulae on soundcorrespondingabstraction. over and the complete Given to the aoriginal original particular verification system abstractM problem., one domain seeksexamples However, sis [15]; version for theexamples learning, synthe-examplesdeductive of the problem for and for techniques learning, inductive learning, may be and reasoninggenerate easierand inductive inductive to examplessolve is reasoningselectionin reasoning for is learning,strategy: isselectionselection learning and strategy: strategy: fromlearning counterexampleslearning from from counterexamples counterexamples problem of synthesizing an auxiliary invariant φ (propositional)abstraction. state Given variables the original of M. systemΩfortocomprises computeexample,M, one seeks an all abstract possiblesis version transition localization of the system problem abstractionsα(M may)=used be to practice easier [16]. guide toused, the especiallysolveusedinductive deductive toin guideto guide if reasoning theengines. suitable the deductive deductive additionalIt is is used beyond engines. engines. to constraints guide the It is It providedbeyond isthe beyond deductive the by theprovided a engines.provided verification by by a oracle averification verification. The oracle learning oracle. The. The learning learning scope of this paper to explain the concept of sciduc- algorithm, whichalgorithm,algorithm, is initialized which which is with is initialized initialized a particular with with a particulara particular such that the following two formulas are valid: Formulasto compute 3 and 4 above. an abstract transitionW( systemIisα ,δtheα) statementsuchα(M that)= αa((M)practiceM) satisfiessatisfies, especially Ψ if and if onlysuitableonly ifif M additional satisfiesare imposed. constraintsscopescopeIt is We of beyond ofthis elaborate this paper the paper on toscope explainto solution explain of this the techniques thepaper concept concept to ofinexplain ofsciduc- sciduc- the concept of tion in general.tiontion However, in in general. general. we However, provide However, below we we provide providean choice below below of an an conceptchoicechoice class of of concept and concept possibly class class and with and possibly anpossibly with with an an Abstraction-based(Iα,δα) such that verification.α(M) satisfiesInΨsatisfies”. thisif and case, onlyΨ. This if M approachare imposed. is computationally We elaborate advan- on solutionthe following techniquessciduction section. in in general. However, we provide below an intuitive I(s)= φ(s) ψ(s) (3) intuitive explanationintuitiveintuitive illustrated explanation explanation by illustrated counterexample- illustrated by by counterexamplecounterexample-initial set ofinitial (positive)initial set set of examples, of (positive) (positive) proceeds examples, examples, by proceeds proceeds by by ⇒ ∧ again, satisfiesΨ=GΨψ. This. Σ is approach the set is of computationally all abstractiontageousNote whenthat advan- thethis processtheis a following reduction of computing section. in theα( Mcomplexity) and and explanation illustrated by counterexampleguided abstraction φ(s) ψ(s) δ(s, s )= φ(s ) ψ(s ) (4) tageousα when the process of computingthen verifyingα(M) and whether it satisfies Ψ is significantlyguided abstractionguidedguidedrefinement refinement abstraction abstraction (CEGAR) (CEGAR) refinement refinement [19], [19], (CEGAR)a a (CEGAR)very very effective [19],searching [19],a approach very a very thesearching space tosearching model of thecandidate the space space ofconcepts of candidate candidate for conceptsone concepts for for one one ∧ ∧ ⇒ ∧ functions corresponding to a particularcomputability-theoretic abstract sense: the verification problem2.2 Integratinghas Induction, Deduction, and domainthen [15]; verifying for example, whether all it possible satisfiesa localization Ψmoresolutionis significantly efficient if and than only the if processthe synthesis of directly problem verifying haseffective one. The approach effectiveeffectivechecking. to model approach approach checking. to tomodel model checking. checking.that is consistentthatthat is with is consistent consistent the examples with with the seen the examples soexamples far. seen seen so so far. far. If no such φ exists, then it means that the property 2.2 Integrating Induction, Deduction,Structure and abstractionsmore [16].efficientΩ is than the statementthe process “α of(synthesisM directlyM) satisfiesin the verifyingproblem first place. is not Weany doeasier not to seek solve, to in describe the theoretical There may beThereThere several may may such be be several consistent several such such concepts, consistent consistent concepts, concepts, ψ is not an invariant of M, since at a minimum a Structure Ψ if andM onlyin the if M firstsatisfies place.Ψ We”. do notsense,in seek detail than to describe what the original abstractions verification are used, problem. or how they 2.2.1 However,We Sciduction have the 2.2.1 so2.2.1 far and SciductionSciduction discussed Sciduction Counterexample-Guided how and and and verificationCounterexample-Guided Counterexample-Guided Counterexample-Guided canand be the searchand and strategy the the search search determines strategy strategy the determines determines chosen the the chosen chosen φ characterizing all reachable states of M should Notein that detail this what is a abstractions reduction in are the used,synthesis complexity-are computed. or how version they The of onlyWethe haveproblem point so we farmay emphasize discussed be easier here how Inductiveto is solve verificationperformed in Synthesis Inductive canInductive by reductionbe Synthesis Synthesis Synthesis to synthesis. An effectivecandidate. Thiscandidate. iscandidate. then presented This This is thenis to then the presented presented verification to tothe the verification verification satisfy Formulas 3 and 4 above. performed by reduction to synthesis. An effective oracle, whichoracle, checksoracle, which thewhich candidate checks checks the against the candidate candidate the against against the the and computability-theoreticare computed. The only sense: point the wepractice emphasizeverificationthat the, especially process here is of if computing suitable theadditional abstraction constraints isIn a sciduction,paradigm are In oneIn forsciduction, combinessciduction, synthesisIn sciduction, one an onein combinesinductive recent combines one timescombines inference an an inductive has inductive beenan inferenceinductive inference inference correctness specification.correctnesscorrectness specification. If specification. the candidate If If the the is candidate candidate is is problemthat has the a solutionprocess of if and computing only if theimposed.synthesis abstractionsynthesis We task. elaborate is a paradigm on solution for techniques synthesis in the recentprocedure followingto times combine ,procedure has aprocedure deductive induction, been , engine a, deduction, a deductive a deductive deductive, and and engine engine enginea structure struc- , and and, hy- and a astructure struc- a struc- hypothesis I I I D D D correct, the synthesizercorrect,correct, the terminates the synthesizer synthesizer and terminates outputs terminates this and and outputs outputs this this 2.1.2 Abstraction-based Model Checking problemsynthesis has one. task. The synthesis problemsection. isIn not other any words,to instead combine of induction, directly verifying deduction,ture hypothesis andpotheses; structureture anture hypothesis approachtoto hy- hypothesisobtain obtain thea a synthesis synthesis authortoto obtain obtain previously algorithm. algorithm. a synthesis a synthesis termed The algorithm. structure algorithm. hypothesis H candidate. Otherwise,candidate.candidate. the Otherwise, verification Otherwise, the oracle the verification verification returns oracle oracle returns returns Another common approach to solving the invariant easier toIn solve, other in words, the theoretical instead sense, of directlywhether than the verifyingM satisfiespotheses;Ψ, we seek an approach to synthesize the authorThe an structure previouslysciductionThe hypothesis termedThesyntacticallyand structure structure formalized syntactically hypothesis Hconstrains hypothesisH [12]. constrains syntacticallyIn the syntactically this space section, the of constrains artifacts constrains we thebeing the searched. a counterexamplea counterexamplea counterexample to the learning to algorithm, tothe the learning learning which algorithm, algorithm, which which verification problem is based on sound and complete originalwhether verificationM satisfies problem.Ψ, However, we seek2.2 theabstraction to Integrating synthe-synthesize function an Induction,sciductionα such thatDeduction,andα formalized(M) satisfies and [12].StructurespaceΨ Inpresent of this artifacts section, aspace briefspaceThis being of we introduction constraint of artifacts searched. artifacts being is to beingThisalso some searched. constraintreferred searched. of the This keyto is Thisas ideas constraintsyntax constraint guidance is is [20]. In present a brief introduction to somein of this the approach. key ideas adds the counterexampleaddsadds the the counterexample counterexample to its set of to examples to its its set set of of examples examples abstraction. Given the original system M, one seeks sis versionabstraction of the problem function mayα such be easier that αif to(M and solveWe) satisfies only havein if soΨM farsatisfies discussedΨ, and how then verification we verifyalso referred can be toalso asalsomachine referredsyntax referred guidancelearning, to asto assyntax syntax[20].such guidance a In guidancerestricted machine[20].[20]. space In Inmachine is machine called the concept in this approach. Induction is the process of inferring a generaland repeats itsandand search. repeats repeats It its is its search. possible search. It that, It is is possible after possible that, that, after after to compute an abstract transition system α(M)= practiceif,and especially only if ifM suitablesatisfies additionalΨ, andperformedwhether constraints then weα by(M verify reduction) satisfies toΨ .synthesis. An effectivelearning, paradigm such learning, a restrictedlearning,class, and such such space each a restricted a element is restricted called space of the spacethatconcept is space called is called is the often theconcept conceptcalled a candidate Induction is the process of inferringlaw or principle a general from observation of particularsome in- numbersome ofsome iterations number number of of this ofiterations iterations loop, the of ofthis learning this loop, loop, the the learning learning (Iα,δα) such that α(M) satisfies Ψ if and only if M are imposed.whether Weα( elaborateM) satisfies on solutionΨ. techniquesfor synthesis in in recent times has been to combineclass induction,, and each class elementclassconcept, and, and. each ofA each thatcommon element space element sciductive is of often ofthat that space called spaceapproach is often is often is called to called formulate the 1 algorithm mayalgorithm bealgorithm unable may to may find be be unable a candidate unable to findto findconcept a candidate a candidate concept concept satisfies Ψ. This approach is computationally advan- the following section. deduction,2.1.3 Reduction and structure tolaw Synthesis hypotheses; or principle an from approach observation the authorstances. of particular Machineoverall in- problem learning as algorithms one of active are typicallylearning using a query-based stances.1 Machine learning algorithmsinductive, are typically generalizing from (labeled) examplesconsistent to withconsistent itsconsistent current with withset its of its current (positive/negative) current set set of of(positive/negative) (positive/negative) tageous when the process of computing α(M) and 2.1.3 Reduction to SynthesispreviouslyGiven the termed two examples sciduction above, and let formalized us now [12]. step2. sciduction In this stands2.model. sciduction2. for sciduction structure-cActive stands stands onstrainedlearning for for structure-c structure-cis induction a onstrainedspecialonstrained and case induction inductionof machine and and learning then verifying whether it satisfies Ψ is significantly inductive, generalizing fromdeduction. (labeled)obtain examples adeduction. learneddineduction. which to concept the learningor classifier algorithm[17], can [18]. examples,control the in selectionexamples, whichexamples, of case in in the which which learning case case the step, the learning and learning step, step, and and 2.2 IntegratingGiven the two Induction, examples Deduction, above,section,back let us and andwe now formalize present step a the brief general introduction notion of performingto some of the key more efficient than the process of directly verifying ideas in this approach.obtain a learned concept or classifierDeduction[17],,examples on [18]. the otherthat it hand,generalizes involves from the and use can query one or more Structureback and formalize the general notionverification of performing by reductionDeduction to synthesis., on the other hand, involvesof general the rules use and axioms to infer conclusions M in the first place. We do not seek to describe Induction is the process of inferring a general law or oracles to obtain both examples as well as labels for those verification by reduction to synthesis.Suppose the originalof general verification rules problem and axioms is: to inferabout conclusions particular problem instances. Traditional forin detail what abstractions are used, or how they We have so far discussed how verificationprinciple can from be observation of particular instances.1 Machine examples. (Typically the labels are positive or negative.) We Suppose the original verification problemGiven is: a systemaboutM (composition particular problem of system instances.mal Traditional verification for- and synthesis techniques, such as are computed. The only point we emphasize here is performed by reduction to synthesis. Anlearning effective algorithms are typically inductive, generalizing refer the reader to a paper by Angluin [21] for an overview of Given a system M (composition ofand system environmentmal models), verification does andM satisfy synthesis techniques, such as that the process of computing the abstraction is a paradigm for synthesis in recent timesfrom has (labeled) been examples to obtain a learned concept or various models for query-based active learning. to combineand induction, environment deduction, models), and does structureMasatisfy specification hy- Ψ? 1. The term “induction” is often used in the verification community synthesis task. classifier [17], [18]. Deduction, on the other hand, involvesto refer to mathematicalThe induction query, oracles which is actually are often a deductive implemented proof using deductive potheses; ana approach specification the authorΨ? previouslyThe termed approach in the1. twoThe term examples “induction” above is often used is into the verification community In other words, instead of directly verifying the use of general rulesto refer and to axiomsmathematical to infer induction conclusions, which is actually rule.about Here a deductive weprocedures are proof employing such “induction” as model in its more checkers classic usage or satisfiability solvers. whether M satisfies Ψ, we seek to synthesize an sciductionTheand approach formalized in the [12]. two In examples thisparticular section,reduce above this we problem question is to instances.rule. to theHere form we Traditional are below: employing formal “induction” verification inarising its more from classic theThus, field usage ofthe Philosophy. overall synthesis algorithm usually comprises a abstraction function α such that α(M) satisfies Ψ presentreduce a brief this introduction question tosome the form of theand below: key synthesis ideas techniques,arising such from theas fieldmodel of Philosophy. checking or theorem top-level inductive learning algorithm constrained by the if and only if M satisfies Ψ, and then we verify in this approach. proving, are deductive. One may wonder whether inductive structure hypothesis that invokes deductive procedures whether α(M) satisfies Ψ. Induction is the process of inferringreasoning a general may seem out of place here, since typically an (query oracles). When the top-level strategy is inductive, law or principle from observation of particularinductive in-argument only ensures that the truth of its we typically refer to the approach simply as “inductive 1 2.1.3 Reduction to Synthesis stances. Machine learning algorithms arepremises typically make it only likely or probable that its conclusion is synthesis” (even when deductive procedures are used). There inductive, generalizing from (labeled) examples to are two important choices one must make to fix an inductive Given the two examples above, let us now step also true. However, one observes that humans often employ obtain a learned concept or classifier [17], [18]. synthesis algorithm: (1) search strategy: How should one back and formalize the general notion of performing a combination of inductive and deductive reasoning while Deduction, on the other hand, involvesperforming the use verification or synthesis. For example, while search the concept class? and (2) example selection strategy: verification by reduction to synthesis. of general rules and axioms to inferproving conclusions a theorem, one often starts by working out examples Which examples do we learn from? Suppose the original verification problem is: about particular problem instances. Traditionaland trying for- to find a pattern in the properties satisfied by Counterexample-guided inductive synthesis (CEGIS) M Given a system (composition of system mal verification and synthesis techniques,those such examples. as The latter step is a process of inductive [22] shown in Figure 2 is perhaps the most popular approach and environment models), does M satisfy generalization. These patterns might take the form of lemmas to inductive synthesis today. The defining aspect of CEGIS is a specification Ψ? 1. The term “induction” is often used in the verificationor background community facts that then guide a deductive process of to refer to mathematical induction, which is actually a deductive proof its example selection strategy: learning from counterexamples The approach in the two examples above is to rule. Here we are employing “induction” in its moreproving classic usagethe statement of the theorem from known facts and provided by a verification oracle. The learning algorithm, reduce this question to the form below: arising from the field of Philosophy. previously established theorems (rules). The process usually which is initialized with a particular choice of concept iterates between inductive and deductive reasoning until the class and possibly with an initial set of (positive) examples, final result is obtained. proceeds by searching the space of candidate concepts for one Sciduction [13], [12] is a formalization of such a that is consistent with the examples seen so far. There may combination of inductive and deductive reasoning.2 The key be several such consistent concepts, and the search strategy in integrating induction and deduction is the use of structure determines the chosen candidate. This is then presented to hypotheses, mathematical hypotheses used to define the class the verification oracle, which checks the candidate against of artifacts to be synthesized within the overall verification the correctness specification. If the candidate is correct, or synthesis problem. Sciduction constrains inductive and the synthesizer terminates and outputs this candidate.

1. The term “induction” is often used in the verification community to refer to mathematical induction, which is actually a deductive proof rule. Here we are employing “induction” in its more classic usage arising from the field of Philosophy. 2. sciduction stands for structure-constrained induction and deduction.

CSI Journal of Computing | Vol. 2 • No. 4, 2015 CSI JOURNAL COMPUTING, VOL. XX, NO. XX, XXXXXXXXX 4

model checking or theorem proving, are deductive. a candidate concept. A common sciductive approach One may wonder whether inductive reasoning may is to formulate the overall problem as one of active seem out of place here, since typically an inductive learning using a query-based model. Active learning argument only ensures that the truth of its premises is a special case of machine learning in which make it only likely or probable that its conclusion the learning algorithm can control the selection of is also true. However, one observes that humans examples that it generalizes from and can query one often employ a combination of inductive and de- or more oracles to obtain both examples as well as ductive reasoning while performing verification or labels for those examples. (Typically the labels are synthesis. For example, while proving a theorem, positive or negative.) We refer the reader to a paper one often starts by working out examples and trying by Angluin [21] for an overview of various models to find a pattern in the properties satisfied by those for query-based active learning. examples. The latter step is a process of inductive The query oracles are often implemented using CSI JOURNAL COMPUTING, VOL.generalization. XX, NO. XX, XXXXXXXXX These patterns might take the form deductive procedures such4 as model checkers or of lemmas or background facts that then guide a satisfiability solvers. Thus, the overall synthesis model checking or theoremdeductive proving, process are deductive. of provinga the candidate statementconcept of the. A commonalgorithm sciductive usually approach comprises a top-level inductive One may wonder whethertheorem inductive from reasoning known may facts andis to previously formulate the estab- overalllearning problem algorithm as one of constrainedactive by the structure hy- seem out of place here, sincelished typically theorems an (rules). inductive The processlearning usuallyusinga iteratesquery-basedpothesismodel. that Active invokes learning deductive procedures (query argument only ensures thatbetween the truth inductive of its premises and deductiveis a reasoning special caseuntil the of machineoracles). learning When the in top-level which strategy is inductive, make it only likely or probablefinal resultthat is its obtained. conclusion the learning algorithm canwe controltypically the refer selection to the approach of simply as “induc- is also true. However, oneSciduction observes [13], that[12] humans is a formalizationexamples that of it such generalizes a tive from synthesis” and can (even query when one deductive procedures are often employ a combinationcombination of inductive of inductive and de- andor deductive more oracles reasoning. to obtain2 used). both examplesThere are as two well important as choices one must ductive reasoning whileThe performing key in verification integrating or inductionlabels for and those deduction examples.make (Typically to fixan the inductive labels are synthesis algorithm: (1) synthesis. For example,is while the useproving of structure a theorem, hypothesespositive, or mathematical negative.) Wesearch refer the strategy: readerHow to a papershould one search the concept one often starts by workinghypotheses out examples used and to define trying theby class Angluin of artifacts [21] for to an overviewclass? and of (2)variousexample models selection strategy: Which to find a pattern in the propertiesbe synthesized satisfied within by those the overallfor query-based verification active or learning.examples do we learn from? examples. The latter stepsynthesis is a process problem. of inductive Sciduction constrainsThe query inductive oracles are oftenCounterexample-guided implemented using inductive synthesis generalization. These patternsand deductive might take reasoning the form usingdeductive structure hypotheses,procedures such(CEGIS) as model [22] checkers shown in or Figure 2 is perhaps the of lemmas or backgroundand facts actively that combines then guide inductive a satisfiability and deductive solvers. rea- Thus,most the popular overall approach synthesis to inductive synthesis today. deductive process of provingsoning: the for statement instance, of deductive the algorithm techniques usually generate comprisesThedefining a top-level aspect inductive of CEGIS is its example theorem from known factsexamples and previously for learning, estab- and inductivelearning algorithm reasoning constrained is selection by strategy: the structurelearning hy- from counterexamples lished theorems (rules). Theused process to guide usually the deductive iterates engines.pothesis It isthat beyond invokes the deductiveprovided procedures by a verification (query oracle. The learning between inductive and deductivescope of reasoning this paper until to explain the oracles). the concept When of sciduc- the top-levelalgorithm, strategy which is inductive, is initialized with a particular final result is obtained. tion in general. However, wewe providetypically below refer to an the approachchoice of simply concept as “induc- class and possibly with an Sciduction [13], [12] isintuitive a formalization explanation of such illustrated a tive by synthesis” counterexample- (even wheninitial deductive set of procedures (positive) are examples, proceeds by combination of inductiveguided and deductive abstraction reasoning. refinement2 used). (CEGAR) There [19], are a very two importantsearching choices the space one of must candidate concepts for one The key in integratingeffective induction approach and deduction to model checking.make to fix an inductivethat synthesis is consistent algorithm: with (1) the examples seen so far. is the use of structure hypotheses, mathematical search strategy: How shouldThere one may search be the several concept such consistent concepts, CSI JOURNAL COMPUTING, VOL. XX, NO. XX, XXXXXXXXX New Frontiers in Formal Methods: Learning, 5 hypotheses used to define2.2.1 the class Sciduction of artifacts and Counterexample-Guidedto class? and (2) exampleand selection the search strategy: strategyWhich determines the chosen R1 : 6 Cyber-Physical Systems, Education, and Beyond be synthesized within theInductive overall Synthesis verification or examples do we learn from?candidate. This is then presented to the verification synthesis problem.The inductive SciductionIn sciduction, engine constrains oneis inductive combines an algorithm anCounterexample-guided inductive to learn inference oracle, inductive which checks synthesis the candidate against the “Concept Class”, Initial Examples • I correctness specification. If the candidate is Otherwise,INITIALIZE the verification oracle returns a counterexampleand to deductive aThe new reasoning inductive abstractionprocedure using engine structure function , is a hypotheses,an deductive algorithm from a engine spurious(CEGIS)to learn, anda [22] coun-new a shown struc- in Figure 2 is perhaps the I D correct, the synthesizer terminates and outputs this the learning algorithm, which adds the counterexample toand its activelyabstraction combines inductive turefunction hypothesis and from deductive a spuriousto obtain rea- counterexample. amost synthesis popular algorithm. approach to inductive synthesis today. terexample. Consider theH case of localization ab- candidate. Otherwise, the verification oracle returns set of examples and repeats its search. It is possible that, aftersoning: for instance,Consider deductive theThe case structure techniques of hypothesislocalization generate syntactically abstraction.The defining constrains One aspect the of CEGIS is its example Candidate straction. One approach in CEGAR is to walk the a counterexample to the learning algorithm, which some number of iterations of this loop, the learning algorithmexamples forapproach learning, in andspace CEGAR inductive of artifactsis to reasoningwalk being the searched.lattice is selection of abstraction This constraintstrategy: learning is from counterexamples Concept adds the counterexample to its set of examples may be unable to find a candidate concept consistent withused its to guidelatticefunctions, the deductive of abstraction fromalso engines. referred most Itabstract functions, to is as beyondsyntax (hide the from guidance allprovided most variables)[20]. abstract by In tomachine a verification oracle. The learning CSI JOURNAL COMPUTING, VOL. XX, NO. XX, XXXXXXXXX algorithm, which5 is initializedand repeats with its a search. particular It is possible that, after current setLEARNING of (positive/negative) examples,VERIFICATION in which case scopethe of this(hideleast paper allabstract to explain variables)learning, (the the original concept such to a least restricted ofsystem). sciduc- abstract space This is(the problem called original the canconcept some number of iterations of this loop, the learning learningALGORITHM step, and hence the overall CEGISORACLE procedure, fails.tion in general.be viewed However, asclass a form we, and provideof each learning element below based of an thatonchoice version space isof spaces often concept called class and possibly with an system). This problem can be viewedinitial as a set form of of (positive)algorithm examples, may proceeds be unable by to find a candidate concept Counterexample intuitive explanationThe[17], inductive although illustrated engine the by counterexample- traditionalis an CEGAR algorithm refinement to learn guided abstractionlearning refinement based (CEGAR) on version [19], spaces a very searching [17], although the space of candidateconsistent concepts with its current for one set of (positive/negative) INITIALIZE “Concept Class”, Initial Examples • aalgorithms new abstraction are2. sciductionsomewhat function standsI different for from structure-c from a spurious onstrainedthe learning icoun-nduction and effective approachthealgorithms traditional to model dproposededuction. checking. CEGAR in the refinement version spacesthat algorithms framework. is consistent are withexamples, the examples in whichseen so case far. the learning step, and Learning Fails Learning Succeeds somewhatterexample.Gupta, Clarke, different Consider et al. [23] from the have thecase previously learning of localizationThere observed algorithms may the be ab- several such consistent concepts, Candidate 2.2.1 Sciductionproposedstraction.link to inductive and inOne Counterexample-Guided theapproach learning version and spaces in have CEGAR framework.proposedand is toversions the walkGupta, search of the strategy determines the chosen Fig. 2. Counterexample-GuidedConcept Inductive Syn-Inductive Synthesis candidate. This is then presented to the verification Clarke,latticeCEGAR of etbased abstraction al. on [23] alternative have functions, previously learning from algorithms mostobserved abstract (such the thesis (CEGIS) In sciduction,as one induction combines on decision an inductive trees). inference oracle, which checks the candidate against the LEARNING VERIFICATION (hide all variables) to least abstractcorrectness (the original specification. If the candidate is procedure link,The a deductivedeductive to inductive engineengine learning, for and finite-state a struc- andhave model proposedchecking, ALGORITHM ORACLE Isystem). This problemD can be viewedcorrect, as a the form synthesizer of terminates and outputs this ture hypothesisversionscomprisesto ofobtain the CEGAR model a synthesis checker based algorithm. onand alternativea SAT solver. learning The Counterexample H candidate. Otherwise, the verification oracle returns The structurealgorithmslearningmodel hypothesis checker based syntactically (such is invoked on as version induction constrainson the spacesabstract the on decision [17],model althoughto check trees). hence the overall CEGIS procedure, fails. a counterexample to the learning algorithm, which space of artifactsthethe traditionalproperty being searched. of interest, CEGAR This while constraint refinement the SAT is solver algorithms is used to are Several search strategies are possible, and the The deductive engine , for finite-stateadds the modelcounterexample to its set of examples Learning Fails Learning Succeeds also referred• check to as syntaxif a counterexample guidance [20]. isD In spurious. machine choice depends on the application domain; see [20] checking,somewhat differentcomprises from the the model learning checkerand algorithmsrepeats and its a search. It is possible that, after Fig. 2. : Counterexample-Guided Inductive Synthesislearning, suchAs a a restricted point of contrast, space is calledthe cone-of-influence the concept approach [5] of proposed in the version spaces framework.some number Gupta, of iterations of this loop, the learning forFig. a 2. more Counterexample-Guided detailed discussion.(CEGIS) Inductive Syn-class, andcomputing eachSAT element solver. an abstract of The that spacemodel model isis oftena checker purely called deductive is invoked approach. on the Clarke, et al. [23] have previouslyalgorithm observed may the be unable to find a candidate concept thesis (CEGIS) It isabstract often the modelcase that to the check abstract the model property computed of via interest, cone- consistent with its current set of (positive/negative) Several search strategies are possible, and the choice2. sciductionof-influencelink stands to for is inductive stootructure-c detailedonstrained learningto provide induction the and and efficiency have proposedbenefits of 2.2.2 Counterexample-Guided Abstraction-deduction. while the SAT solver is used toexamples, check in if which a case the learning step, and depends on the application domain; see [20] for a more abstraction.versions of CEGAR based on alternative learning Refinementdetailed discussion. counterexample is spurious. hence the overall CEGIS procedure, fails. algorithms (such as induction on decision trees). Counterexample-guided abstraction refinement (CE- 2.2.3As Directions a point offor contrast,Future Work the cone-of-influence ap- 2.2.2Several Counterexample-Guided search strategies Abstraction- are possible, and the The deductive engine , for finite-state model GAR) [19] is a precursor to the CEGIS approach proachThere [5] are offour computing major differences an abstract between modeltraditional is a choiceRefinement depends on the application domain; see [20] • checking, comprises theD model checker and a described above. CEGAR solves a synthesis sub- machinepurely deductive learning algorithms approach. (and It theiris often applications) the case and that for aCounterexample-guided more detailed discussion. abstraction refinement SAT solver. The model checker is invoked on the task of generating abstract models that are sound the ones abstract used model in formal computed verification via and cone-of-influence synthesis. First, (CEGAR) [19] is a precursor to the CEGIS approach described in manyabstract traditional model uses to checkof machine the learning, property the of learning interest, (theyabove. containCEGAR solves all behaviors a synthesis of subtask the original of generating system) is too detailed to provide the efficiency benefits of 2.2.2 Counterexample-Guided Abstraction- algorithmwhile operates the SAT by drawing solver (labeled) is used examples to check at random if a andabstractprecise models(any that arecounterexample sound (they contain for all the behaviors abstract fromabstraction. some source with no control over the examples it Refinement counterexample is spurious. modelof the original is also system) a counterexample and precise (any counterexample for the original for draws. In contrast, the form of learning in verification and the abstract model is also a counterexample for the original system).Counterexample-guided One can view abstraction CEGAR as refinement an instance (CE- of synthesisAs a pointmore commonly of contrast, tends the tocone-of-influence be active, where theap- system). One can view CEGAR as an instance of sciduction learning2.2.3 algorithm Directions actively for Futureselects the Work examples it learns GAR)as follows: [19] is a precursor to the CEGIS approach proach [5] of computing an abstract model is a described above. CEGAR solves a synthesis sub- from.purelyThere Second, are deductive four in traditional major approach. differences machine It learning, is between often concept the traditional case classes that System Initial Abstract tend to be simpler than the ones considered in verification. +Property Abstraction Domain task of generatingFunction abstract models that are sound Inthemachine verification abstract learning model and algorithms synthesis, computed we(and via are their cone-of-influence usually applications) interested in (they contain all behaviors of the original system) synthesizingisand too the detailed ones fairly used to general provide in formallogical the artifacts efficiencyverification and benefitsprograms. and syn- of

Abstract Model Invoke Valid Third, in traditional machine learning, the learning and preciseGenerate(any counterexample for the abstractDone abstraction.thesis. First, in many traditional uses of machine + Property Model algorithm tends to be a special-purpose algorithm designed model isAbstraction also a counterexampleChecker for the original learning, the learning algorithm operates by drawing for the specific concept class being learned. On the other Counter- system). One can view CEGARexample as an instance of hand,(labeled) in inductive examples synthesis, at random the learning from algorithm some sourcetends New Abstraction Function 2.2.3 Directions for Future Work towith be a nogeneral control purpose over logical the reasoning examples engine, it draws.such as a In Check SATThere or SMT are foursolver. major Fourth, differences while machine between learning traditional is almost Refine Initial YES NO System SpuriousAbstract Counterexample: contrast, the form of learning in verification and Fail AbstractionAbstraction Done +Property CounterexampleDomain Spurious? entirelymachine data-driven, learning inductive algorithms learning (and in their formal applications) verification FunctionFunction synthesis more commonly tends to be active, where typicallyand the operates ones used via a in combination formal verification of data-driven and and syn- SYNTHESIS VERIFICATION model-driventhe learning algorithms. algorithm actively selects the exam- Abstract Model Invoke Valid thesis. First, in many traditional uses of machine Generate + Property Model Done plesThese it learns differences from. mean Second, that there in traditional is much more machine to be Fig. 3 : Counterexample-guidedAbstraction abstractionChecker refinement Fig. 3. Counterexample-guided(CEGAR) as inductive synthesis. abstraction re- donelearning, to develop theconcept learninga theory classes of algorithm inductive tend to learning operates be simpler in bythe drawingthancontext the Counter- example of(labeled) formal methods. examples Specifically, at random one needs from to understand some source the finementNew Abstraction (CEGAR) Function as inductive synthesis. ones considered in verification. In verification and The abstract domain, which defines the form of the impactsynthesis,with noof different control we are logical overusually reasoning the interested examples engines inon itaccuracy synthesizing draws. and In abstraction function, is the structureCheck hypothesis. For efficiency of learning, as well as the properties of a learning Refine YES NO sciduction as follows: Spurious Counterexample: contrast, the form of learning in verification and Fail Abstraction Done algorithmfairly general that might logical make artifacts it well-suited and for programs. use in an overall Third, example, in verifyingCounterexample digital circuits,Spurious? one might use Function synthesis more commonly tends to be active, where Thelocalizationabstract abstraction domain ,[16], which in which defines abstract the states form are of verificationin traditional or synthesis machine task. learning, the learning algo- • the learning algorithm actively selects the exam- thecubes abstraction overSYNTHESIS the state function, variables. is the structureVERIFICATION hypothe- rithmIn this tends regard, to be there a special-purposeare two results obtained algorithm by the de- sis. For example, in verifying digital circuits, one signedples it learnsfor the from. specific Second, concept in class traditional being machine learned. Fig.might 3. Counterexample-guided use localization abstraction abstraction [16], in which re- Onlearning, the other concept hand, classes in inductive tend to synthesis, be simpler the than learn- the finementCSIabstract Journal (CEGAR) states of Computing are cubesas inductive | over Vol. the 2 synthesis.• stateNo. 4, variables.2015 ingones algorithm considered tends in verification. to be a general In verification purpose logical and synthesis, we are usually interested in synthesizing sciduction as follows: fairly general logical artifacts and programs. Third, The abstract domain, which defines the form of in traditional machine learning, the learning algo- • the abstraction function, is the structure hypothe- rithm tends to be a special-purpose algorithm de- sis. For example, in verifying digital circuits, one signed for the specific concept class being learned. might use localization abstraction [16], in which On the other hand, in inductive synthesis, the learn- abstract states are cubes over the state variables. ing algorithm tends to be a general purpose logical Sanjit A. Seshia R1 : 7

author and colleagues that provide some initial insights. First, Adaptive: the system must adapt to its dynamic the notion of teaching dimension, elucidated by Goldman environment, and thus the distinction between “design- and Kearns [24], provides a lower bound on the number of time” and “run-time” is blurred, and examples needed by inductive synthesis techniques (such Heterogeneous: swarm components are of various types, as CEGIS) to converge to the correct concept [25]. Further, requiring interfacing and interoperability across multiple one can formally analyze CEGIS to understand the impact platforms and models of computation. of the quality of counterexamples provided by the verifier This unique combination of characteristics necessitates on the overall convergence to the correct concept [26]. Given advances in several topics in formal specification, verification the centrality of counterexample-guided learning in formal and synthesis. In particular, there is a need for correct-by- methods today, these results provide an initial basis for construction synthesis techniques that allow quick adaptation developing a deeper theoretical insights into its operation, and deployment of components into the swarm while ensuring and provide guidance for the use of other learning techniques that certain key design requirements are always satisfied. for formal methods. An extended treatment of these results In this section, we give the reader a flavor of the problems along with a theoretical framework may be found in [53]. in the design of distributed cyberphysical systems through CSI JOURNAL COMPUTING, VOL. XX, NO. XX, XXXXXXXXX 7 3. Cyber-Physical Systems a sample problem: multirobot motion planning for complex requirements specified in temporal logic. Cyber-physical systems (CPS) are computational systems that are tightly integrated with the physical world [27]. Depending on the characteristics of CPS that are emphasized, they are also variously termed as embedded systems, the Internet of Things (IoT), the Internet of Everything (IoE), or the Industrial Internet. Examples of CPS include today’s automobiles, fly-by-wire aircraft, medical devices, power generationCSI JOURNAL and COMPUTING, distribution systems, VOL. XX, building NO. XX, XXXXXXXXXcontrol systems, 7 robots, and many other systems. CPS have been around for a long time, but it is only recently that the area has come together as an intellectual discipline. Many CPS operate in safety critical or mission-critical settings, and therefore it is important to gain assurance that they will operate correctly, (a) (b) as per specification. Thus, formal methods is an essential tool in the design of CPS. However, CPS also operateFig. in highly 4. Compositional SMT-Driven Multi-Robot Motion Planning: (a) Top view of sample execution dynamic and uncertain environments, and thereforeand associated they simulation, and (b) Nano-quadrotor platform from KMel Robotics [29] (reproduced must be designed to be robust and adaptive. This imposes additional challenges on the modeling, verification,from [30]). and synthesis of such systems. We examine in this section some of the main challenges in CPS design and how formal methods can help in addressing these. Multi-Robot Motion Planning with Temporal These results are only a small first step. There 3.1 Distributed Cyber-Physical Swarms(a) Logic Objectives (b) are many more problems that remain to be solved, including inferring effective logical characteriza- One of the biggest trends in CPS today isIn the Robotics, large- theFig. traditional 4 : Compositional motion planning SMT-Driven prob- Multi-Robot Fig. 4. Compositional SMT-Driven Multi-Robot Motion Planning: (a) Top view of sampletions execution of motion primitives, handling dynamic and scale networking of devices. The author, along withlem several is to moveMotion a robot Planning: from Point(a) Top A view to Pointof sample B execution and associated simulation, and (b) Nano-quadrotorand platform associated from simulation, KMel Robotics and (b) Nano-quadrotor [29] (reproduceduncertain environments, dealing with non-linear en- colleagues, is involved in a multiyear, multi-institutionwhile effort avoiding obstacles. However, more recently, tofrom develop [30]). systematic scientific and engineering principles platform from KMel Robotics [29] (reproducedcodings, from incremental planning, and scaling up to an there is growing interest in extending[30]). this problem to addressing the huge potential (and associated risks) of order of magnitude more robots. pervasive integration of smart, networked sensorsalong twoand dimensions. The first extension is to im- actuatorsMulti-Robot into our Motion connected Planning world [28]. These with pose distributed Temporal more complexMulti-RobotThese requirements results Motion are Planning on only the robot, awith small Temporal such first step. Logic There CPSLogic are Objectivesbeing termed as the “swarm” and are identified to Objectivesare many more problems that remain to3.2 be solved, Human-in-the-Loop Systems have the following important characteristics [28]:as visiting certain locations “infinitely often.” Such including inferring effective logical characteriza- In Robotics,Large-scale: the swarm traditional comprises motion a vast number planningrequirements of nodes prob- canIn beRobotics, conveniently the traditional specified motion in planning a Many problem cyber-physical systems are interactive, i.e., generating corresponding “big data;” formal notationistions to such move of as a motion linear robot from temporal primitives, Point logicA to handling Point (LTL). B while dynamicthey avoiding interact and with one or more human beings, and lem is to move a robot from Point A to Point B obstacles. However, more recently, there is growing interest Distributed: components of the swarm areThe typically second extensionuncertain is environments, to handle swarms dealing of many with non-linear en- in extending this problem along two dimensions.the The human first operators’ role is central to the correct whilenetworked, avoiding and obstacles.are potentially However, separated morephysically recently, and/ robots executingextensioncodings, coordinated is incremental to impose plans. more planning, Such complex problems and requirements scalingworking up on theto of an the system. Examples of such systems thereor temporally; is growing interest in extending this problem arise in manyrobot,order application such of magnitude as visiting settings, certain more including robots. locations per- “infinitely often.” include fly-by-wire aircraft control systems (inter- along Cyber-physical: two dimensions. the swarm The fuses first computational extension processes is to im- Such requirements can be conveniently specified in a formal posewith more the physical complex world; requirements on thesistent robot, such surveillance, search and rescue, formation acting with a pilot), automobiles with “self-driving” control, and aerialnotation imaging. such as More linear complextemporal logic require- (LTL). The second as visitingDynamic: certainthe environment locations evolves “infinitely continually; often.” Such extension3.2 Human-in-the-Loop is to handle swarms of Systemsmany robotsfeatures executing (interacting with a driver), and medical ments require more sophisticated methods to ensure requirements can be conveniently specified in a Many cyber-physical systems are interactivedevices,(interacting i.e., with a doctor, nurse, or patient). that the synthesized plans are provably correct. Scal- human- formal notation such as linear temporal logic (LTL). they interact with one or more humanWe beings, refer and to the control in such systems as ing planning algorithms to larger swarms requires in-the-loop control systems The second extension is to handle swarms of many the human operators’ role is central to the correct . The costs of incorrect more efficient algorithmsCSI Journal and of design Computing methodologies. | Vol. 2 • No. 4, 2015 robots executing coordinated plans. Such problems working of the system. Examples of suchoperation systems in the application domains served by these arise in many application settings, includingRecent per- workinclude [30] addresses fly-by-wire these aircraft challenges control with systemssystems (inter- can be very severe. Human factors are often sistent surveillance, search and rescue,a two-prongedformation acting approach. with First, a pilot), a compositional automobiles with ap- “self-driving”the reason for failures or “near failures”, as noted by control, and aerial imaging. More complexproach require- is employed,features where (interacting pre-characterized with a driver),motion andseveral medical studies (e.g., [31], [32]). Correct operation ments require more sophisticated methodsprimitives, to ensure baseddevices on well-known (interacting control with a algorithms, doctor, nurse,of or these patient). systems depends crucially on two design as- interfaces that the synthesized plans are provably correct.are used Scal- asWe a component refer to the library. control Each in such motion systemspects: as human- (i) between human operator(s) and control ing planning algorithms to larger swarmsprimitive requires is specifiedin-the-loop in a control suitable systems combination. The costs of ofautonomous incorrect components, and (ii) strategies more efficient algorithms and design methodologies.logical theories.operation Second, in using the application an encoding domains similar servedfor bysuch these human-in-the-loop systems. Recent work [30] addresses these challengesto thewith one usedsystems for bounded can be very model severe. checking, Human factors a A are particularly often interesting domain is that of auto- a two-pronged approach. First, a compositionalsatisfiability ap- modulothe reason theories for failures (SMT) or solver “near [11]failures”, is mobiles as noted with by “self-driving” features, otherwise also proach is employed, where pre-characterizedused motion to find aseveral composition studies (e.g.,of motion [31], [32]).primitives Correcttermed operation as “driver assistance systems”. Such sys- primitives, based on well-known controlthat algorithms, achieves theof these desired systems LTL depends objectives. crucially Figure on 4 twotems, design already as- capable of automating tasks such as are used as a component library. Eachdepicts motion a samplepects: result (i) ofinterfaces this approach,between showing human the operator(s)lane keeping, and navigating in stop-and-go traffic, and primitive is specified in a suitable combinationtop view of of fourautonomous nano quadrotor components, robots achieving and (ii) control a parallelstrategies parking, are being integrated into high-end logical theories. Second, using an encodingdesired similar LTL specification.for such human-in-the-loop systems. automobiles. However, these emerging technologies to the one used for bounded model checking, a A particularly interesting domain is that of auto- satisfiability modulo theories (SMT) solver [11] is mobiles with “self-driving” features, otherwise also used to find a composition of motion primitives termed as “driver assistance systems”. Such sys- that achieves the desired LTL objectives. Figure 4 tems, already capable of automating tasks such as depicts a sample result of this approach, showing the lane keeping, navigating in stop-and-go traffic, and top view of four nano quadrotor robots achieving a parallel parking, are being integrated into high-end desired LTL specification. automobiles. However, these emerging technologies New Frontiers in Formal Methods: Learning, R1 : 8 Cyber-Physical Systems, Education, and Beyond coordinated plans. Such problems arise in many application Specification: How does the formal specification change settings, including persistent surveillance, search and for a h-CPS? rescue, formation control, and aerial imaging. More complex Verification: What new verification problems arise from requirements require more sophisticated methods to ensure the human aspect? that the synthesized plans are provably correct. Scaling Synthesis: What advances in controller synthesis are planning algorithms to larger swarms requires more efficient required for h-CPS? algorithms and design methodologies. For lack of space, we only give a brief preview of the main Recent work [30] addresses these challenges with a ideas. Further detail can be obtained in the papers by Li et al. two-pronged approach. First, a compositional approach is [33] and Sadigh et al. [34]. employed, where pre-characterized motion primitives, based on well-known control algorithms, are used as a component 3.2.1 Modeling library. Each motion primitive is specified in a suitable For a typical (fully autonomous) control system, there combination of logical theories. Second, using an encoding are three entities: the plant being controlled, the controller, similar to the one used for bounded model checking, a and the environment in which they operate. In an h-CPS, satisfiability modulo theories (SMT) solver [11] is used to we additionally have the human operator(s) and therefore, find a composition of motion primitives that achieves the also need a subsystem that mediates between the human desired LTL objectives. Figure 4 depicts a sample result of operator(s) and the autonomous controller. this approach, showing the top view of four nano quadrotor robots achieving a desired LTL specification. Formally, we model a h-CPS as a composition of five types of entities (components) [33], as illustrated in Fig. 5. These results are only a small first step. There are many The first is the plant, the entity being controlled. In the case more problems that remain to be solved, including inferring of automobiles, these are the sub-systems that perform the effective logical characterizations of motion primitives, various driving maneuvers under either manual or automatic handling dynamic and uncertain environments, dealing with control. The second is the human operator (or operators); e.g., non-linear encodings, incremental planning, and scaling up the driver in an automobile. For simplicity, this discussion to an order of magnitude more robots. uses a single human operator, denoting her by HUMAN. 3.2 Human-in-the-Loop Systems The third entity is the environment. HUMAN perceives the environment around her and takes actions based on this Many cyber-physical systems are interactive, i.e., they perception and an underlying behavior model. We denote interact with one or more human beings, and the human by HP HUMAN’s perception of the environment and by HA operators’ role is central to the correct working of the system. the actions by HUMAN to control the plant. In the case of Examples of such systems include fly-by-wire aircraft control a fully-autonomous system, the human operator is replaced systems (interacting with a pilot), automobiles with “self- by an autonomous controller (AUTO, for short). In practice, driving” features (interacting with a driver), and medical each specialized function of the system may be served by a devices (interacting with a doctor, nurse, or patient). We refer separately designed autonomous controller; however, for the to the control in such systems as humanin-the-loop control sake of simplicity, we model the autonomous controller as a systems. The costs of incorrect operation in the application CSI JOURNAL COMPUTING, VOL. XX, NO. XX, XXXXXXXXX single component. AUTO perceives the environment through8 domains served by these systems can be very severe. Human sensors (denoted by ES, “environment sensors”) and provides factors are often the reason for failures or “near failures”, as control input to the plant. also give rise to concerns over the safety of an noted by several studies (e.g., [31], [32]). Correct operation Environment of these systemsultimately depends driverlesscrucially car.on two For thesedesign reasons, aspects: the field ES (i) interfaces betweenof semi-autonomous human operator(s) driving and is a autonomous fertile application ES components, andarea (ii) forcontrol formal strategies methods. for such human-in- Autonomous the-loop systems. In this section, we give an overview of some of u Controller A particularlythe interesting challenges domain in the designis that of automobiles human-in-the-loop Advisory CPS (h-CPS), including: Controller with “self-driving” features, otherwise also termed as “driver Plant CONTROL. assistance systems”. Such systems, already capable of Modeling: What distinguishes a model of a h-CPS HS automating tasks• fromsuch as a typicallane keeping, CPS? navigating in stop- ADV. and-go traffic, and parallel parking, are being integrated into Specification: How does the formal specification u =HA User Interface high-end automobiles.• However, these emerging technologies HP change for a h-CPS? Human also give rise to concerns over the safety of an ultimately Operator Verification: What new verification problems arise HP driverless car. For these reasons, the field of semi-autonomous • from the human aspect? driving is a fertile application area for formal methods. Synthesis: What advances in controller synthesis In this section, we give an overview of some of the Fig. 5 : Structure of a Human Cyber-Physical System. • are required for h-CPS? Fig. 5. Structure of a Human Cyber-Physical challenges in the design of human-in-the-loop CPS (h-CPS), ES denotes environment sensing, HS denotes human System. ES denotes environment sensing, HS denotes human including: For lack of space, we only give a brief preview ofsensing, HP denotes human perception, HA denotes humansensing, actions,HP denotes and human u perception,denotesHA thedenotes vector human of control actions, Modeling: Whatthe main distinguishes ideas. Further a model detail of can a h-CPS be obtained from in the and u denotes the vectorinputs of control to the inputs plant. to the plant. a typical CPS?papers by Li et al. [33] and Sadigh et al. [34].

3.2.1 Modeling the environment through sensors (denoted by ES, CSI Journal ofFor Computing a typical (fully | Vol. autonomous) 2 • No. 4, 2015 control system, “environment sensors”) and provides control input there are three entities: the plant being controlled, to the plant. the controller, and the environment in which they The distinctive aspect of h-CPS arises from its operate. In an h-CPS, we additionally have the partial autonomy. We capture this by including a human operator(s) and therefore, also need a sub- fifth component, the advisory controller (ADVISOR, system that mediates between the human operator(s) for short) [33]. The function of ADVISOR is to medi- and the autonomous controller. ate between HUMAN and AUTO. This mediation can Formally, we model a h-CPS as a composition take different forms. For instance, ADVISOR may of five types of entities (components) [33], as illus- decide when to switch from full control by HUMAN trated in Fig. 5. The first is the plant, the entity being to full control by AUTO, or vice-versa. ADVISOR controlled. In the case of automobiles, these are the may also decide to combine the control inputs from sub-systems that perform the various driving ma- HUMAN and AUTO to the plant in a systematic way neuvers under either manual or automatic control. that achieves design requirements. This is indicated The second is the human operator (or operators); in Fig. 5 by the yellow box adjoining plant, between e.g., the driver in an automobile. For simplicity, this it and the AUTO and HUMAN components. We note discussion uses a single human operator, denoting that for legal and policy reasons, it may not be her by HUMAN. The third entity is the environ- possible in many applications (including driving) ment.HUMAN perceives the environment around for ADVISOR to always take decisions that override her and takes actions based on this perception and HUMAN. Hence, we use the term ADVISOR, indicat- an underlying behavior model. We denote by HP ing that the form of control exercised by ADVISOR HUMAN’s perception of the environment and by may, in some situations, only provide suggestions HA the actions by HUMAN to control the plant. In to HUMAN as to the best course of action. the case of a fully-autonomous system, the human operator is replaced by an autonomous controller 3.2.2 Specification (AUTO, for short). In practice, each specialized function of the system may be served by a sepa- h-CPS have certain unique requirements which need rately designed autonomous controller; however, for to be formalized as formal specifications for ver- the sake of simplicity, we model the autonomous ification and control. We illustrate these by the controller as a single component. AUTO perceives example of semi-autonomous driving. Sanjit A. Seshia R1 : 9

The distinctive aspect of h-CPS arises from its partial controller is in control (and not the human operator) it autonomy. We capture this by including a fifth component, must satisfy a given formal specification (e.g., provided the advisory controller (Advisor, for short) [33]. The function in temporal logic). of Advisor is to mediate between Human and Auto. This 3. Prescience. The advisory controller must determine if the mediation can take different forms. For instance, Advisor above specification may be violated ahead of time, and may decide when to switch from full control by Human to issues an advisory to the human operator in such a way full control by Auto, or vice-versa. Advisor may also decide that she has sufficient time to respond. to combine the control inputs from Human and Auto to the Li et al. [33] show how these requirements can be plant in a systematic way that achieves design requirements. made precise for the case of controller synthesis from linear This is indicated in Fig. 5 by the yellow box adjoining plant, temporal logic. between it and the Auto and Human components. We note that for legal and policy reasons, it may not be possible in 3.2.3 Verification and Control many applications (including driving) for Advisor to always h-CPS are not only an application area for formal take decisions that override Human. Hence, we use the term verification and control, but they have also given rise to new Advisor, indicating that the form of control exercised by classes of problems. Advisor may, in some situations, only provide suggestions to Human as to the best course of action. For verification, one such problem is the verification of quantitative models of h-CPS inferred from experimental 3.2.2 Specification data. An example is the work on probabilistic modeling and verification of human driver behavior by Sadigh et al. [34]. h-CPS have certain unique requirements which need Here the aim is to infer a Markov Decision Process (MDP) to be formalized as formal specifications for verification model for the whole closed-loop system (including human, and control. We illustrate these by the example of semi- controller, plant, and environment) from experimental autonomous driving. data obtained from a industrial-scale car simulator. Since Recognizing both the safety issues and the potential the model is inferred from an empirical data set that is benefits of vehicle automation, in 2013 the U.S. National incomplete, the model has estimation errors, e.g., in the Highway Traffic Safety Administration (NHTSA) published transition probabilities. Therefore, in this case we infer a a statement that provides descriptions and guidelines for the generalization of an MDP called a Convex-MDP (CMDP) [36], continual development of these technologies [35]. Particularly, where the uncertainty in the values of transition probabilities the statement defines five levels of automation ranging from is captured as a first-class component of the model. Puggelli vehicles without any control systems automated (Level 0) to et al. [36] show how one can extend algorithms for model vehicles with full automation (Level 4). We focus on Level checking to the CMDP model. Sadigh et al. [34] use that model 3 which describes a mode of automation that requires only to infer desired properties about human driver behavior, such limited driver control: as a quantitative evaluation of distracted driving. “Level 3 - Limited Self-Driving Automation: Vehicles In the setting of control of h-CPS, given that we have at this level of automation enable the driver to cede full new kinds of specifications (as described earlier), control control of all safety-critical functions under certain traffic algorithms have to be modified to synthesize not only the or environmental conditions and in those conditions to autonomous controllers but also the advisory controllers. Li et rely heavily on the vehicle to monitor for changes in those al. [33] show how to modify standard algorithms for synthesis conditions requiring transition back to driver control. from LTL for this purpose. It is still an open problem to adapt The driver is expected to be available for occasional other kinds of control algorithms to the h-CPS setting. control, but with sufficiently comfortable transition time. The vehicle is designed to ensure safe operation during 4. Education the automated driving mode.” [35] The advent of massive open online courses (MOOCs) Essentially, this mode of automation stipulates that the [37] promises to bring world-class education to anyone with human driver can act as a fail-safe mechanism and requires Internet access. Moreover, it has placed a renewed focus on the driver to take over control should something go wrong. the development and use of computational aids for teaching The challenge, however, lies in identifying the complete set of and learning. MOOCs present a range of problems to which conditions under which the human driver has to be notified the field of formal methods has much to contribute. These ahead of time. Based on the NHTSA statement, we have include automatic grading, automated exercise generation, identified [33] four important criteria required for a human- and virtual laboratory environments. In automatic grading, in-the-loop controller to achieve this level of automation. a computer program verifies that a candidate solution 1. Effective Monitoring. The advisory controller should be provided by a student is “correct”, i.e., that it meets certain able to monitor all information about the h-CPS and its instructor-specified criteria (the specification). In addition, environment needed to determine if human intervention and particularly when the solution is incorrect, the automatic is needed. grader (henceforth, autograder) should provide feedback 2. Conditional Correctness. When the autonomous to the student as to where he/she went wrong. Automatic

CSI Journal of Computing | Vol. 2 • No. 4, 2015 New Frontiers in Formal Methods: Learning, R1 : 10 Cyber-Physical Systems, Education, and Beyond exercise generation is the process of synthesizing problems group represents an interaction between models, properties (with associated solutions) that test students’ understanding and traces. The first column shows the given entity, and of course material, often starting from instructor-provided the second column is the unknown entity. The third column sample problems. Finally, for courses involving laboratory shows some of the variations of the same class of problem. assignments, a virtual laboratory (henceforth, lab) seeks to Table 2 states a solution technique for each problem provide the remote student with an experience similar to that category listed in Table 1. Note that major topics investigated provided in a real, on-campus lab. in formal methods such as model checking, specification In this section we briefly describe two applications of mining, and synthesis can be applied to various tasks in formal methods to education. Both applications have been exercise generation. Moreover, since textbook problems are to an undergraduate course at UC Berkeley, Introduction typically smaller than those arising in industrial use, their to Embedded Systems [38]. In this course, students learn size is within the capacity of existing tools for synthesis and theoretical content on modeling, design, and analysis [39], verification. and also perform lab assignments on programming an CSI JOURNAL COMPUTING, COMPUTING, VOL. VOL. XX, XX, NO. NO. XX, XX, XXXXXXXXX XXXXXXXXX 11 11 embedded platform interfaced to a mobile robot [40]. Thus, 4.2 Grading and Feedback in Virtual Laboratories this course provides a suitable setting to investigate the range Lab-based courses that are not software-only pose a of problems associated with MOOCs that are mentioned the particular technical challenge for MOOCs. A key component preceding paragraph.

4.1 Exercise Generation Consider first the task of automatic exercise generation. It is unrealistic and also somewhat undesirable to completely remove the instructor from the problem generation process, since this is a creative process that requires the instructor’s input to emphasize the right concepts. However, some aspects of problem generation can be tedious for an instructor, and moreover, generating customized problems for students in Fig.Fig. 7. 7. Cal Cal Climber Climber Laboratory Laboratory Platform. Platform. a MOOC is impossible without some degree of automation. Additionally, creating many different versions of a problem can be effective at reducing cheating by blind copying of solutions. Fig.Fig. 6. 6 Models,: Models, Properties Properties and and Traces: Traces: Entities Entities in key component of learning in lab-based courses Fig. 6.exercises Models, in PropertiesLee and Seshia and textbook Traces: [27 Entities key component of learning in lab-based courses Examining problems from all three parts of the Lee and in exercises in Lee and Seshia textbook [27] is working with hardware, getting “one’s hands in exercises(reproduced in Lee and from Seshia [41]). textbook [27] is working with hardware, getting “one’s hands Seshia textbook [39], Sadigh et al. [41] take a template-based (reproduced from [41]). dirty.” It appears to be impossible to provide that approach to automatic problem generation. Specifically, (reproducedTable 1 : Classification from [41]). of Model-Based Problems dirty.” It appears to be impossible to provide that several existing exercises in the book are shown to conform experience online. And yet, it would be useful to in Lee and Seshia [27], First Edition, Version 1.06 experience online. And yet, it would be useful to to a template. The template identifies common elements provide a learning experience that approximates the (reproduced from [41]) provide a learning experience that approximates the of these problems while representing the differentiating Given Find Variations Exercise # real lab as well as possible. Indeed, in industrial Given Find Variations Exercise # elements as parameters or “holes”. In order to create a new φ M (i)φ English 3.1, 3.2, designreal lab one as often well prototypes as possible. a design Indeed, in a simulated in industrial φ M or(i) LTLφ∈ English 3.3,3.1, 4.1, 3.2, problem, the template essentially must be instantiated with ∈ environmentdesign one before often prototypes building the a realdesign artifact. in a simulated new parameter values. However, it is often useful to create (ii)useor LTL hybrid 4.2,3.3, 4.3, 4.1, (ii)use hybrid 4.2, 4.3, environment before building the real artifact. new problems that are “similar” in difficulty to existing systems for M 4.4, 4.5, In an ideal world, we would provide an infras- (iii)Modifysystems for M 4.6,4.4, 4.8, 4.5, hand-crafted problems. To facilitate this, new problems In an ideal world, we would provide an infras- pre-existing(iii)ModifyM 9.4,4.6, 9.6, 4.8, tructure where students can log in remotely to a are generated using a bounded number of mutations to an pre-existing M 13.2,9.4, 13.3 9.6, computertructure which where has students been can preconfigured log in remotely with all to a existing problem, under suitable constraints and pruning M ψ (i)reachable 3.3,13.2, 3.5, 13.3 4.2 computer which has been preconfigured with all development tools and laboratory exercises; in fact, to ensure well-defined results. An instructor can then select M ψ trace(i)reachable 3.3, 3.5, 4.2 development tools and laboratory exercises; in fact, results that look reasonable to him or her. pilot projects exploring this approach have already (ii)describetrace pilot projects exploring this approach have already For brevity, we outline some of the main insights reported output(ii)describe been undertaken (e.g., see [42]). However, in the in [41] as they relate to the application of formal methods. M φ Models can be 3.2, 12.3 MOOCbeen undertakensetting, the large (e.g., numbers see [42]). of students However, makes in the output The first insight relates to the structure of exercises After M φ givenModels in can code be 3.2, 12.3 suchMOOC a remotely-accessible setting, the large numbers physical of lab students expensive makes or formal de- investigating the exercises from certain relevant chapters given in code andsuch impractical. a remotely-accessible A virtual lab physicalenvironment, lab expensivedriven (Ch. 3,4,9,12,13) of Lee and Seshia [27], we found that more scriptionor formal de- than 60% of problems fit into the model-based category, where M & ψ ψ Givenscription input 9.5 byand simulation impractical. of real-world A virtual environments, lab environment, appears driven trace find the problem tests concepts involving relationships between M & ψ ψ Given→ input 9.5 toby be simulation the only solution of real-world at present. environments, appears outputtrace trace find models, properties and traces. Fig. 6 is an illustration of the M φ ψ toThe be authorthe only and solution colleagues at present. have taken initial & Findoutput→ trace coun- 3.4, 4.3, three entities, and their characteristics. At any point, given terexample or 12.1 M & φ ψ Find coun- 3.4, 4.3, stepsThe towards author addressing and colleagues this challenge have taken for lab- initial one or two of these entities, we can ask about instances of witness trace the unknown entity. Table 1 groups exercises into different terexample or 12.1 basedsteps education towards [43], addressing [44]. The this main challenge contribution for lab- TABLEwitness 1 trace to date is CPSGrader, which combines virtual lab classes based on what is given and what is to be found. Each Classification of Model-Based Problems in Lee based education [43], [44]. The main contribution TABLE 1 softwareto date with is CPSGrader, automatic grading which combines and feedback virtual for lab Classificationand Seshia [27], of Model-Based First Edition, Problems Version 1.06 in Lee courses in the areas of cyber-physical systems and (reproduced from [41]) software with automatic grading and feedback for CSI Journal of Computing | Vol. 2 • No. 4, 2015 and Seshia [27], First Edition, Version 1.06 roboticscourses [44], in the [45], areas [46]. of In cyber-physical particular, CPSGrader systems and (reproduced from [41]) hasrobotics been successfully [44], [45], used[46]. in InIntroduction particular, CPSGrader to Em- beddedhas been Systems successfullyat UC Berkeley used in [38]Introduction and its online to Em- counterpart on edX [47]. In the lab component of Given Find Solution Technique bedded Systems at UC Berkeley [38] and its online φ M Constrained Synthesis or Repair this course, students program the Cal Climber [40] counterpart on edX [47]. In the lab component of GivenM Findψ SimulationSolution Technique of Model robot (see Fig. 7) to perform certain navigation Mφ φM SpecificationConstrained Mining Synthesis or Repair this course, students program the Cal Climber [40] tasks like obstacle avoidance and hill climbing. Stu- M M& ψ ψψ SimulationSimulation with of Model Guidance robot (see Fig. 7) to perform certain navigation M M& φ ψφ ModelSpecification Checking Mining dents can prototype their controller to work within tasks like obstacle avoidance and hill climbing. Stu- M & ψ ψ TABLESimulation 2 with Guidance a simulated environment based on the LabVIEW dents can prototype their controller to work within M & φ ψ Model Checking Robotics Environment Simulator by National Instru- Techniques to Find Solutions for Model-Based a simulated environment based on the LabVIEW Problems (reproducedTABLE 2 from [41]) ments (see Figure 8 and 9). Techniques to Find Solutions for Model-Based RoboticsThe virtual Environment lab dynamical Simulator model by is National a complex, Instru- Problems (reproduced from [41]) physics-basedments (see Figure one. CPSGrader 8 and 9). employs simulation- The virtual lab dynamical model is a complex, physics-based one. CPSGrader employs simulation- CSI JOURNAL COMPUTING, VOL. XX, NO. XX, XXXXXXXXX 11

Fig. 7. Cal Climber Laboratory Platform.

Fig. 6. Models, Properties and Traces: Entities key component of learning in lab-based courses in exercises in Lee and Seshia textbook [27] is working with hardware, getting “one’s hands (reproduced from [41]). dirty.” It appears to be impossible to provide that experience online. And yet, it would be useful to provide a learning experience that approximates the Given Find Variations Exercise # real lab as well as possible. Indeed, in industrial φ M (i)φ English 3.1, 3.2, design one often prototypes a design in a simulated or LTL∈ 3.3, 4.1, (ii)use hybrid 4.2, 4.3, environment before building the real artifact. systems for M 4.4, 4.5, In an ideal world, we would provide an infras- (iii)Modify 4.6, 4.8, pre-existing M 9.4, 9.6, tructure where students can log in remotely to a 13.2, 13.3 computer which has been preconfigured with all M ψ (i)reachable 3.3, 3.5, 4.2 development tools and laboratory exercises; in fact, trace pilot projects exploring this approach have already (ii)describe output been undertaken (e.g., see [42]). However, in the M φ Models can be 3.2, 12.3 MOOC setting, the large numbers of students makes given in code such a remotely-accessible physical lab expensive or formal description and impractical. A virtual lab environment, driven M & ψ ψ Given input 9.5 by simulation of real-world environments, appears trace find to be the only solution at present. output→ trace M & φ ψ Find coun- 3.4, 4.3, The author and colleagues have taken initial terexample or 12.1 steps towards addressing this challenge for lab- witness trace based education [43], [44]. The main contribution TABLE 1 to date is CPSGrader, which combines virtual lab Classification of Model-Based Problems in Lee software with automatic grading and feedback for and Seshia [27], First Edition, Version 1.06 courses in the areas of cyber-physical systems and Sanjit A. Seshia R1 : 11 (reproduced from [41]) CSIrobotics JOURNAL [44], COMPUTING, [45], [46]. VOL. XX, In NO. particular, XX, XXXXXXXXX CPSGrader 12 has been successfully used in Introduction to Em- Table 2 : Techniques to Find Solutions for Model- bedded Systems at UC Berkeley [38] and its online Based Problems (reproduced from [41]) counterpart on edX [47]. In the lab component of Given Find Solution Technique φ M Constrained Synthesis or Repair this course, students program the Cal Climber [40] M ψ Simulation of Model robot (see Fig. 7) to perform certain navigation M φ Specification Mining tasks like obstacle avoidance and hill climbing. Stu- M & ψ ψ Simulation with Guidance M & φ ψ Model Checking dents can prototype their controller to work within a simulated environment based on the LabVIEW of learning in lab-based coursesTABLE is working 2 with hardware, gettingTechniques “one’s hands to Find dirty.” Solutions It appears forto be Model-Based impossible to Robotics Environment Simulator by National Instru- provide thatProblems experience (reproduced online. And yet, from it would [41]) be useful ments (see Figure 8 and 9). to provide a learning experience that approximates the The virtual lab dynamical model is a complex,Fig. 9. Simulator with auto-grading functionality real lab as well as possible. Indeed, in industrial design physics-based one. CPSGrader employs simulation-used in EECS 149.1x. one often prototypes a design in a simulated environment before building the real artifact. In an ideal world, we would provide an infrastructure Fig.Fig. 8. Cal8 : Cal Climber Climber inin the LabVIEW LabVIEW Robotics Robotics where students can log in remotely to a computer which has Environment Simulator. from these trends, there are many other exciting been preconfigured with all development tools and laboratory Environment Simulator. topics including new research in computational en- exercises; in fact, pilot projects exploring this approach have There are several interesting directions for future work, gines for formal methods (e.g., [49]) and emerging already been undertaken (e.g., see [42]). However, in the including developing quantitative methods for assigning applications to areas such as synthetic biology [50] MOOC setting, the large numbers of students makes such a partialbased credit, verification, mining temporal the only logic scalable testers to formalcapture new ap- remotely-accessible physical lab expensive and impractical. A mistakes, and online monitoring of these testers to improve and computational music [51], [52]. virtual lab environment, driven by simulation of real-world responsiveness.proach. Correctness and the presence of cer- environments, appears to be the only solution at present. tain classes of mistakes are both checked using ACKNOWLEDGMENTS The author and colleagues have taken initial steps 5.test Conclusion benches formalized in Signal Temporal Logic towards addressing this challenge for labbased education (STL)In summary, [48]. However, in this comingpaper we up have with outlined these STLsome The author thanks Prof. R. K. Shyamasundar who [43], [44]. The main contribution to date is CPSGrader, which excitingproperties trends can in beformal tedious methods and and error-prone, its applications. even These for encouraged him to write this article. Some of the combines virtual lab software with automatic grading and includeinstructors the combination well-versed of informal formal methods methods. and inductive There- work outlined in Sec. 2 is joint with Susmit Jha. The feedback for courses in the areas of cyber-physical systems machine learning, and new applications to cyber-physical and robotics [44], [45], [46]. In particular, CPSGrader has systemsfore, Juniwal and education. et al. [44] Apart show from how these these trends, temporal there research described in Sec. 3.1 is joint with Indranil been successfully used in Introduction to Embedded Systems arelogic many testers other can exciting be generated topics including via machine new research learning in Saha, Rattanachai Ramaithitima, Vijay Kumar, and at UC Berkeley [38] and its online counterpart on edX [47]. In computationalfrom solutions engines that havefor formal the fault methods (positive (e.g., [49]) exam- and George Pappas, that in Sec. 3.2 is joint with Wen- the lab component of this course, students program the Cal emerging applications to areas such as synthetic biology [50] CSI JOURNAL COMPUTING, VOL. XX, NO. XX, XXXXXXXXX ples) and those that do not (negative examples). An12 chao Li, Dorsa Sadigh, and S. Shankar Sastry, and Climber [40] robot (see Fig. 7) to perform certain navigation and computational music [51], [52]. the work in Sec. 4 is joint with Alexandre Donze,´ tasks like obstacle avoidance and hill climbing. Students active learning framework has also been developed can prototype their controller to work within a simulated to ease the burden of labeling solutions as positive Mona Gupta, Jeff Jensen, Garvit Juniwal, Edward environment based on the LabVIEW Robotics Environment or negative [45]. In machine learning terminology, A. Lee, and Dorsa Sadigh. The work described in Simulator by National Instruments (see Figure 8 and 9). this can be thought of as the training phase. The this paper was supported in part by an Alfred P. The virtual lab dynamical model is a complex, physics- resulting test bench then becomes the classifier that Sloan Research Fellowship, NSF CAREER grant based one. CPSGrader employs simulation based verification, determines whether a student solution is correct, #0644436, NSF Expeditions grant #1139138, and the only scalable formal approach. Correctness and the presence of certain classes of mistakes are both checked using and, if not, which fault is present. CPSGrader was TerraSwarm, one of six centers of STARnet, a Semi- test benches formalized in Signal Temporal Logic (STL) [48]. used successfully in the edX course EECS149.1x conductor Research Corporation program sponsored However, coming up with these STL properties can be tedious offered in May-June 2014 [47]. by MARCO and DARPA. and error-prone, even for instructors well-versed in formal There are several interesting directions for future methods. Therefore, Juniwal et al. [44] show how these work, including developing quantitative methods temporal logic testers can be generated via machine learning Fig. 9 : Simulator with auto-grading functionality REFERENCES from solutions that have the fault (positive examples) and Fig.for assigning 9. Simulator partialused with in credit, EECS auto-grading mining149.1x. temporal functionality logic used in EECS 149.1x. [1] J. M. Wing, “A specifier’s introduction to formal methods,” those that do not (negative examples). An active learning testers to capture new mistakes, and online moni- IEEE Computer, vol. 23, no. 9, pp. 8–24, September 1990. framework has also been developed to ease the burden of toring of these testers to improve responsiveness. Acknowledgments [2] E. M. Clarke and J. M. Wing, “Formal methods: State of the labeling solutions as positive or negative [45]. In machine art and future directions,” ACM Computing Surveys (CSUR), learning terminology, this can be thought of as the training The author thanks Prof. R. K. Shyamasundar who Fig. 8. Cal Climber in the LabVIEW Robotics from these trends, there are many other exciting vol. 28, no. 4, pp. 626–643, 1996. phase. The resulting test bench then becomes the classifier encouraged5CONCLUSION him to write this article. Some of the work [3] E. M. Clarke and E. A. Emerson, “Design and synthesis Environment Simulator. topics including new research in computational en- of synchronization skeletons using branching-time temporal that determines whether a student solution is correct, and, if outlinedIn summary, in Sec. in 2 this is joint paper with we Susmit have outlined Jha. The some research ex- not, which fault is present. CPSGrader was used successfully ginesdescribed for in formal Sec. 3.1 methods is joint with (e.g., Indranil [49]) Saha, and Rattanachai emerging logic,” in Logic of Programs, 1981, pp. 52–71. in the edX course EECS149.1x offered in May-June 2014 [47]. applicationsRamaithitima,citing trends to Vijay in areas formal Kumar, such methods and as George synthetic and Pappas, its biology applications. that in [50] Sec. [4] J.-P. Queille and J. Sifakis, “Specification and verification of These include the combination of formal methods concurrent systems in CESAR,” in Symposium on Program- based verification, the only scalable formal ap- and computational music [51], [52]. ming, ser. LNCS, no. 137, 1982, pp. 337–351. proach. Correctness and the presence of cer- and inductive machine learning, and new applica- [5] E. M. Clarke, O. Grumberg, and D. A. Peled, Model Checking. tain classes of mistakes are both checked using tions to cyber-physical systems and education. Apart MIT Press, 2000. CSI Journal of Computing | Vol. 2 • No. 4, 2015 test benches formalized in Signal Temporal Logic ACKNOWLEDGMENTS (STL) [48]. However, coming up with these STL The author thanks Prof. R. K. Shyamasundar who properties can be tedious and error-prone, even for encouraged him to write this article. Some of the instructors well-versed in formal methods. There- work outlined in Sec. 2 is joint with Susmit Jha. The fore, Juniwal et al. [44] show how these temporal research described in Sec. 3.1 is joint with Indranil logic testers can be generated via machine learning Saha, Rattanachai Ramaithitima, Vijay Kumar, and from solutions that have the fault (positive exam- George Pappas, that in Sec. 3.2 is joint with Wen- ples) and those that do not (negative examples). An chao Li, Dorsa Sadigh, and S. Shankar Sastry, and active learning framework has also been developed the work in Sec. 4 is joint with Alexandre Donze,´ to ease the burden of labeling solutions as positive Mona Gupta, Jeff Jensen, Garvit Juniwal, Edward or negative [45]. In machine learning terminology, A. Lee, and Dorsa Sadigh. The work described in this can be thought of as the training phase. The this paper was supported in part by an Alfred P. resulting test bench then becomes the classifier that Sloan Research Fellowship, NSF CAREER grant determines whether a student solution is correct, #0644436, NSF Expeditions grant #1139138, and and, if not, which fault is present. CPSGrader was TerraSwarm, one of six centers of STARnet, a Semi- used successfully in the edX course EECS149.1x conductor Research Corporation program sponsored offered in May-June 2014 [47]. by MARCO and DARPA. There are several interesting directions for future work, including developing quantitative methods REFERENCES for assigning partial credit, mining temporal logic [1] J. M. Wing, “A specifier’s introduction to formal methods,” testers to capture new mistakes, and online moni- IEEE Computer, vol. 23, no. 9, pp. 8–24, September 1990. toring of these testers to improve responsiveness. [2] E. M. Clarke and J. M. Wing, “Formal methods: State of the art and future directions,” ACM Computing Surveys (CSUR), vol. 28, no. 4, pp. 626–643, 1996. 5CONCLUSION [3] E. M. Clarke and E. A. Emerson, “Design and synthesis of synchronization skeletons using branching-time temporal In summary, in this paper we have outlined some ex- logic,” in Logic of Programs, 1981, pp. 52–71. citing trends in formal methods and its applications. [4] J.-P. Queille and J. Sifakis, “Specification and verification of These include the combination of formal methods concurrent systems in CESAR,” in Symposium on Program- ming, ser. LNCS, no. 137, 1982, pp. 337–351. and inductive machine learning, and new applica- [5] E. M. Clarke, O. Grumberg, and D. A. Peled, Model Checking. tions to cyber-physical systems and education. Apart MIT Press, 2000. New Frontiers in Formal Methods: Learning, R1 : 12 Cyber-Physical Systems, Education, and Beyond

3.2 is joint with Wenchao Li, Dorsa Sadigh, and S. Shankar [19] E. M. Clarke, O. Grumberg, S. Jha, Y. Lu, and H. Veith, Sastry, and the work in Sec. 4 is joint with Alexandre Donz´e, “Counterexample-guided abstraction refinement,” in 12th Mona Gupta, Jeff Jensen, Garvit Juniwal, Edward A. Lee, International Conference on Computer Aided Verification (CAV), ser. Lecture Notes in Computer Science, vol. 1855. Springer, and Dorsa Sadigh. The work described in this paper was 2000, pp. 154–169. supported in part by an Alfred P. Sloan Research Fellowship, [20] R. Alur, R. Bodik, G. Juniwal, M. M. K. Martin, M. Raghothaman, NSF CAREER grant #0644436, NSF Expeditions grant S. A. Seshia, R. Singh, A. Solar-Lezama, E. Torlak, and A. #1139138, and TerraSwarm, one of six centers of STARnet, a Udupa, “Syntax-guided synthesis,” in Proceedings of the IEEE Semiconductor Research Corporation program sponsored by International Conference on Formal Methods in Computer- Aided Design (FMCAD), October 2013. MARCO and DARPA. [21] D. Angluin, “Queries and concept learning,” Machine Learning, vol. 2, pp. 319–342, 1988. References [22] A. Solar-Lezama, L. Tancau, R. Bodík, S. A. Seshia, and V. [1] J. M. Wing, “A specifier’s introduction to formal methods,”IEEE Saraswat, “Combinatorial sketching for finite programs,” in Computer, vol. 23, no. 9, pp. 8–24, September 1990. ASPLOS, 2006. [2] E. M. Clarke and J. M. Wing, “Formal methods: State of the art [23] A. Gupta, “Learning abstractions for model checking,” Ph.D. and future directions,” ACM Computing Surveys (CSUR), vol. dissertation, Computer Science Department, Carnegie Mellon 28, no. 4, pp. 626–643, 1996. University, 2006. [3] E. M. Clarke and E. A. Emerson, “Design and synthesis of [24] S. A. Goldman and M. J. Kearns, “On the complexity of synchronization skeletons using branching-time temporal teaching,” Journal of Computer and System Sciences, vol. 50, logic,” in Logic of Programs, 1981, pp. 52–71. pp. 20–31, 1995. [4] J.-P. Queille and J. Sifakis, “Specification and verification of [25] S. Jha, S. Gulwani, S. A. Seshia, and A. Tiwari, “Oracleguided concurrent systems in CESAR,” in Symposium on Programming, component-based program synthesis,” in Proceedings of the ser. LNCS, no. 137, 1982, pp. 337–351. 32nd International Conference on Software Engineering (ICSE), [5] E. M. Clarke, O. Grumberg, and D. A. Peled, Model Checking. 2010, pp. 215–224. MIT Press, 2000. [26] S. Jha and S. A. Seshia, “Are there good mistakes? a theoretical [6] S. Owre, J. M. Rushby, and N. Shankar, “PVS: A prototype analysis of CEGIS,” in 3rd Workshop on Synthesis (SYNT), July verification system,” in 11th International Conference on 2014. Automated Deduction (CADE), ser. Lecture Notes in Artificial [27] E. A. Lee and S. A. Seshia, Introduction to Embedded Systems: Intelligence, D. Kapur, Ed., vol. 607. Springer-Verlag, June A Cyber-Physical Systems Approach, First Edition http:// 1992, pp. 748–752. leeseshia.org, 2011. [7] M. Kaufmann, P. Manolios, and J. S. Moore, Computer-Aided [28] E. A. Lee, B. Hartmann, J. Kubiatowicz, T. S. Rosing, Reasoning: An Approach. Kluwer Academic Publishers, 2000. J. Wawrzynek, D. Wessel, J. M. Rabaey, K. Pister, A. L. [8] M. J. C. Gordon and T. F. Melham, Introduction to HOL: Sangiovanni-Vincentelli, S. A. Seshia, D. Blaauw, P. Dutta, K. A Theorem Proving Environment for Higher-Order Logic. Fu, C. Guestrin, B. Taskar, R. Jafari, D. L. Jones, V. Kumar, R. Cambridge University Press, 1993. Mangharam, G. J. Pappas, R. M. Murray, and A. Rowe, “The [9] S. Malik and L. Zhang, “Boolean satisfiability: From theoretical swarm at the edge of the cloud,” IEEE Design & Test, vol. 31, hardness to practical success,” Communications of the ACM no. 3, pp. 8-20, 2014. (CACM), vol. 52, no. 8, pp. 76–82, 2009. [29] “KMel robotics.” [Online]. Available: http://kmelrobotics.com/ [10] R. E. Bryant, “Graph-based algorithms for Boolean function [30] I. Saha, R. Ramaithitima, V. Kumar, G. J. Pappas, and S. manipulation,” IEEE Transactions on Computers, vol. C-35, no. A. Seshia, “Automated composition of motion primitives for 8, pp. 677–691, August 1986. multirobot systems from safe ltl specifications,” in Proceedings [11] C. Barrett, R. Sebastiani, S. A. Seshia, and C. Tinelli, of the IEEE/RSJ International Conference on Intelligent Robots “Satisfiability modulo theories,” in Handbook of Satisfiability, and Systems (IROS), September 2014. A. Biere, H. van Maaren, and T. Walsh, Eds. IOS Press, 2009, [31] Federal Aviation Administration (FAA), “The interfaces vol. 4, ch. 8. between flight crews and modern flight systems,” http://www. [12] S. A. Seshia, “Sciduction: Combining induction, deduction, and faa. gov/avr/afs/interfac.pdf, 1995. structure for verification and synthesis,” in Proceedings of the [32] L. T. Kohn and J. M. Corrigan and M. S. Donaldson, editors., “To Design Automation Conference (DAC), June 2012, pp. 356–365. err is human: Building a safer health system,” A report of the [13] ——, “Sciduction: Combining induction, deduction, and Committee on Quality of Health Care in America, Institute of structure for verification and synthesis,” EECS Department, Medicine, Washington, DC, Tech. Rep., 2000, national Academy University of California, Berkeley, Tech. Rep. UCB/EECS-2011- Press. 68, May 2011. [33] W. Li, D. Sadigh, S. Sastry, and S. A. Seshia, “Synthesis of [14] S. A. Seshia, N. Sharygina, and S. Tripakis, “Modeling for human-in-the-loop control systems,” in Proceedings of the verification,” in Handbook of Model Checking, E. M. Clarke, T. 20th International Conference on Tools and Algorithms for the Henzinger, and H. Veith, Eds. Springer, 2014, ch. 3. Construction and Analysis of Systems (TACAS), April 2014. [15] P. Cousot and R. Cousot, “Abstract interpretation: a unified [34] D. Sadigh, K. Driggs-Campbell, A. Puggelli, W. Li, V. Shia, R. lattice model for static analysis of programs by construction Bajcsy, A. L. Sangiovanni-Vincentelli, S. S. Sastry, and S. A. or approximation of fixpoints,” in Proceedings of the 4th ACM Seshia, “Data-driven probabilistic modeling and verification of SIGACT-SIGPLAN symposium on Principles of programming human driver behavior,” in Formal Verification and Modeling languages. ACM, 1977, pp. 238–252. in Human-Machine Systems, AAAI Spring Symposium, March [16] R. Kurshan, “Automata-theoretic verification of coordinating 2014. processes,” in 11th International Conference on Analysis and [35] National Highway Traffic Safety Administration, “Preliminary Optimization of Systems – Discrete Event Systems, ser. LNCS. statement of policy concerning automated vehicles,” May 2013. Springer, 1994, vol. 199, pp. 16–28. [36] A. Puggelli, W. Li, A. Sangiovanni-Vincentelli, and S. A. Seshia, [17] T. M. Mitchell, Machine Learning. McGraw-Hill, 1997. “Polynomial-time verification of PCTL properties of MDPs with [18] D. Angluin and C. H. Smith, “Inductive inference: Theory and convex uncertainties,” in Proceedings of the 25th International methods,” ACM Computing Surveys, vol. 15, pp. 237–269, Sept. Conference on Computer-Aided Verification (CAV), July 2013. 1983. [37] The Year of the MOOC, New York Times, November 2012.

CSI Journal of Computing | Vol. 2 • No. 4, 2015 Sanjit A. Seshia R1 : 13

[38] E. A. Lee and S. A. Seshia. EECS 149 course website. http:// [46] A. Donzé, G. Juniwal, J. C. Jensen, and S. A. Seshia. CPSGrader chess.eecs.berkeley.edu/eecs149. University of California, website. http://www.cpsgrader.org. UC Berkeley. Berkeley. [47] E. A. Lee, S. A. Seshia, and J. C. Jensen. EECS 149.1x Course [39] ——, Introduction to Embedded Systems - A Cyber-Physical Website on edX. https://www.edx.org/course/uc-berkeleyx/ uc- Systems Approach. Berkeley, CA: LeeSeshia.org, 2011. berkeleyx-eecs149-1x-cyber-physical-1629. UC BerkeleyX. [40] J. C. Jensen, E. A. Lee, and S. A. Seshia, An Introductory [48] O. Maler and D. Nickovic, “Monitoring temporal properties of Lab in Embedded and Cyber-Physical Systems. Berkeley, CA: continuous signals,” in FORMATS/FTRTFT, 2004, pp. 152- LeeSeshia.org, 2012. 166. [41] D. Sadigh, S. A. Seshia, and M. Gupta, “Automating exercise [49] S. Chakraborty, D. J. Fremont, K. S. Meel, S. A. Seshia, and generation: A step towards meeting the MOOC challenge for M. Y. Vardi, “Distribution-aware sampling and weighted model embedded systems,” in Proc. Workshop on Embedded Systems counting for sat,” in Proceedings of the 28th AAAI Conference on Education (WESE), October 2012. Artificial Intelligence (AAAI), July 2014. [42] Massachusetts Institute of Technology (MIT), “The iLab [50] S. Srivastava, J. Kotker, S. Hamilton, P. Ruan, J. Tsui, J. C. Project,” https://wikis.mit.edu/confluence/display/ILAB2/ Home, Anderson, R. Bodik, and S. A. Seshia, “Pathway synthesis Last accessed: February 2014. using the Act ontology,” in Proceedings of the 4th International [43] J. C. Jensen, E. A. Lee, and S. A. Seshia, “Virtualizing cyber Workshop on Bio-Design Automation (IWBDA), June 2012. physical systems: Bringing CPS to online education,” in Proc. [51] A. Donzé, S. Libkind, S. A. Seshia, and D. Wessel, “Control First Workshop on CPS Education (CPS-Ed), April 2013. improvisation with application to music,” EECS Department, [44] G. Juniwal, A. Donzé, J. C. Jensen, and S. A. Seshia, University of California, Berkeley, Tech. Rep. UCB/EECS- 2013- “CPSGrader: Synthesizing temporal logic testers for auto- 183, Nov 2013. [Online]. Available: http://www.eecs. berkeley. grading an embedded systems laboratory,” in Proceedings edu/Pubs/TechRpts/2013/EECS-2013-183.html of the 14th International Conference on Embedded Software [52] A. Donzé, R. Valle, I. Akkaya, S. Libkind, S. A. Seshia, and D. (EMSOFT), October 2014. Wessel, “Machine improvisation with formal specifications,” [45] G. Juniwal, “CPSGrader: Auto-grading and feedback generation in Proceedings of the 40th International Computer Music for cyber-physical systems education,” Master’s thesis, EECS Conference (ICMC), September 2014. Department, University of California, Berkeley, Dec 2014. [53] S. Jha and S. A. Seshia, “A Theory of Formal Synthesis via [Online]. Available: http://www.eecs.berkeley.edu/Pubs/ Inductive Learning,” ArXiv e-prints, May 2015. TechRpts/2014/EECS-2014-237.html nnn

About the Author

Sanjit A. Seshia received the B.Tech. degree in Computer Science and Engineering from the Indian Institute of Technology, Bombay, India in 1998, and the M.S. and Ph.D. degrees in Computer Science from Carnegie Mellon University, Pittsburgh, PA, USA, in 2000 and 2005 respectively.

He is currently an Associate Professor in the Department of Electrical Engineering and Computer Sciences at the University of California, Berkeley, CA, USA. His research interests are in dependable computing and computational logic, with a current focus on applying automated formal methods to problems in embedded and cyber-physical systems, electronic design automation, computer security, and synthetic biology. His Ph.D. thesis work on the UCLID verifier and decision procedure helped pioneer the area of satisfiability modulo theories (SMT) and SMT-based verification. He is co-author of a widely-used textbook on embedded systems. He led the offering of a massive open online course on cyber-physical systems for which his group developed novel virtual lab auto-grading technology based on formal methods.

Prof. Seshia has served as an Associate Editor of the IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, and as co-chair of the Program Committee of the International Conference on Computer-Aided Verification (CAV) in 2012. His awards and honors include a Presidential Early Career Award for Scientists and Engineers (PECASE) from the White House, an Alfred P. Sloan Research Fellowship, the Prof. R. Narasimhan Lecture Award, and the School of Computer Science Distinguished Dissertation Award at Carnegie Mellon University.

CSI Journal of Computing | Vol. 2 • No. 4, 2015

R2 : 14 Static Analysis and Symbolic Code Execution Static Analysis and Symbolic Code Execution Dhiren Patel, Madhura Parikh and Reema Patel*

* National Institute of Technology, Surat, Gujarat, India E-mail: (dhiren29p, madhuraparikh and reema.mtech)@gmail.com

As software becomes increasingly pervasive and affects critical areas of application, verifying the correctness of programs can no longer be neglected. Several approaches in the past have utilized testing to prove program correctness, but this is an incomplete approach. The second alternative is to perform static analysis and model checking. While this is an exhaustive approach, it has several limitations viz; high cost, poor scalability, spurious warnings. In this paper, we explore Symbolic execution that takes the middle way between these two extremes, and has the potential to be applicable in real world settings. Initially proposed in the 1970’s, symbolic execution has recently gained focus amongst researchers. Starting from its birth, we survey tools that have successfully implemented symbolic execution and the modifications that have been proposed to make it widely applicable. We also examine the research challenges that exist with this approach and how well- received it has been in industry.

Index Terms : Static Analysis, Symbolic Execution, Dynamic Test Generation, Concolic Execution, Compositional Testing.

1. Introduction As software is being used in increasingly complex and sound and complete. A sound approach is guaranteed to find critical applications, the importance of program verification any true error in the program; a complete one will find only cannot be stressed enough. Conventionally two approaches errors that are true. Some of the commonly used methods of to verification have been cited - static analysis and dynamic formal static analysis are summarized below: analysis. Both these approaches however, have their own short-comings that have curtailed their use in practice. We 1.1.1 Type Systems first introduce these conventional methods of verification, Types are most widely used in static analysis. This is examining their pros and cons. We then look at the idea of because most of the major programming languages use the symbolic execution, as it was originally proposed by King concept of types. A type is just a set of values. For e.g the type [1] in 1976. Symbolic execution was essentially a static Int would denote all the integers whereas the type Bool in a analysis method. However it has recently evolved to a new language like C ++ would represent the set true, false. The form, imbibing advantages of both static and dynamic code concept of types in fact follows from the concept of abstract analysis. The possible areas of application of this novel form interpretation, which is used in any static analysis approach. of symbolic execution are then discussed in later sections. Ideally static analysis should be able to represent all possible executions of a program. This is however un-decidable in 1.1 Static Analysis of Programs practice. So a compromise is to consider only abstract values An excellent introduction and review of static analysis rather than the infinite set of all possible concrete values. can be found in the book by Nielson [2]. Static analysis of code, Thus Int is an abstraction for the infinite possible integers. as the name suggests tries to analyze the program behavior without actually executing the code. Thus the correctness of 1.1.2 Data Flow Analysis the program is analyzed by applying formal techniques on the Data flow analysis usually requires us to construct source code or the object code. One of the earliest examples a Control Flow Graph (CFG). This is a directed graph in of a static analysis tool is the wellknown Lint program for which each node is a statement and edges represent the flow C. Such bug finding tools generally use static analysis only of control. Data flow analysis tries to find facts about the in a shallow way, meaning that they look out for unexpected program by considering all possible states of the program. constructs in the program. This means that they will also It includes different analyses such as liveliness analysis, raise an increased number of false alarms, while missing out reaching definitions, etc., that are used in optimizing on subtle errors. Some important concepts associated with compilers. Detailed literature on data-flow analysis may be any static analysis approach are whether the approach is

CSI Journal of Computing | Vol. 2 • No. 4, 2015 Dhiren Patel, et.al. R2 : 15

found in any standard text on compilers such as [3]. A good visualization of these different methods is shown in Figures 1 2 3. In each of these figures x(t) is a vector of 1.1.3 Model Checking the input state and output of the program that evolves as a function of time t. Figure 1 shows the possible different Model checking usually uses some form of temporal CSI JOURNAL COMPUTING, VOL. XX, NO. XX, XXXXXXXXX 3 logic such as finite automata, to represent the specifications trajectories that a program may take up when executed. As of a program. The state space is then exhaustively searched, Figure 2 shows, model checking tries to cover each of the to ensure that the specifications are met correctly. It is an trajectories,leave out however some of to thebe practical, trajectories it must entirely be bounded. while Thusthe criteria that the input must satisfy for an approach that suffers from state space explosion. Bounded itsome misses are out covered on any late completely. errors. On Clearlythe other even hand thistestingexecution to follow the associated path [1]. model checking examines only a prefix of all possible asis shown not satisfactory. in Figure 3 will leave out some of the trajectoriesAs we encounter, conditions or branches in entirely while some are covered completely. Clearly even this executions. However this may result in later errors being the program flow, such as IF statements, these is not satisfactory. missed. conditions are conjoined with the pc along that 1.2 Dynamic Analysis of Programs path. Thus a pc exists, corresponding to each alternative path that the program may follow. Dynamic analysis is mainly done through testing and This pc is thus a Boolean formula. In order to debugging. Testing tries to run a program on a subset of the possible inputs to observe if a failure occurs or not. Since it is reason about whether a particular control flow CSI JOURNAL COMPUTING, VOL. XX, NO. XX, XXXXXXXXX not possible to run the program exhaustively over all possible2 is possible for the program, we should check inputs, testing has the problem of absence of coverage. One whether the corresponding Boolean formula CSI JOURNAL COMPUTING, VOL. XX, NO. XX, XXXXXXXXX 2 follows from the concept of abstract interpre- importantDebugging concern involves is that the pinpointing test case generation the exact should be that represents its pc is satisfiable. For this tation, which is used in any static analysis locationautomated of if errors,it is to be and applicable fixing theseto commercial up. In ear-programs powerful SMT/SAT solvers are used. Of course that are thousands of lines long. The way these tests are followsapproach. from Ideally the concept static analysis of abstract should interpre- be able lierDebugging times, debugging involves was pinpointing usually achieved the exact by while many satisfiability problems are beyond tation, which is used in any static analysis locationgenerated ofis errors,important, and since fixing naively these generating up. In ear-the test Fig. 3 : Trajectory coverage with dynamic analysis [4]the scope of these solvers, many solvers do to represent all possible executions of a pro- insertingcases in any print random statements manner may at likely suspicious miss out loca- on several Fig. 3. Trajectory coverage with dynamic analy- approach. Ideally static analysis should be able lier times, debugging was usually achieved by exist for linear equations, e.g Z3, STP, Yices, etc. gram. This is however un-decidable in practice. tionserrors. in the program. With the advent of IDEs sis [4] to represent all possible executions of a pro- inserting print statements at suspicious loca- Thus corresponding to each program, we can So a compromise is to consider only abstract andDebugging better debuggers, involves pinpointing this situation the exact is some- location of 1.3 Symbolic Execution : Old Paradigm Gets New Life gram. This is however un-decidable in practice. tionswhat in eased, the program. however With debugging the advent even of today IDEs generate the execution tree that shows all the values rather than the infinite set of all possible errors, and fixing these up. In earlier times, debugging was The concept of symbolic execution was first proposed by Soconcrete a compromise values. Thus is toInt consideris an abstraction only abstract for andisusually highly better achieved arcane debuggers, by andinserting tedious, this print situation statements requiring is at much some- suspicious possible execution paths of the program. One King1.3 in Symbolichis seminal paper Execution [1]. We first : Oldtry to Paradigmunderstand the valuesthe infinite rather possible than the integers. infinite set of all possible whatmanuallocations eased, involvement.in the however program. debuggingWith the advent even of todayIDEs and such symbolic execution tree and the corre- conceptGets Newin the Lifeform it was originally proposed. As we have concrete values. Thus Int is an abstraction for isbetter highly debuggers, arcane this and situation tedious, is somewhat requiring eased, much however already noted, program verification conventionally is spondingdone code snippet appear in the Figure 4. the1.1.2 infinite Data possible Flow Analysis integers. manualdebugging involvement. even today is highly arcane and tedious, requiring eitherThe through concept proving of symbolic by formal execution techniques, was or by first testingHere it is seen that at each point in a particular much manual involvement. Data flow analysis usually requires us to con- withproposed a random by input King set. in hisSymbolic seminal execution paper is [1]. a Wemiddlepath, the pc is updated with some additional 1.1.2struct Data a Control FlowAnalysis Flow Graph (CFG). This is a wayfirst between try to understandthese two extremes. the concept In symbolic in the execution, form constraints if necessary. If the final pc for a Datadirected flow graph analysis in whichusually each requires node us is ato state- con- theit wasprogram originally is executed, proposed. but the Asinput we fed have to the already programparticular path is found to be un-satisfiable, is symbolic rather than concrete. This means that each structment a and Control edges Flow represent Graph the (CFG). flow of This control. is a noted, program verification conventionally is then that state of the program is unreachable. symbolicdone eitherinput stands through for provingan entire class by formal of inputs tech- ratherThe various path conditions accrued may be directedData flow graph analysis in which tries to each find node facts is about a state- the than individual input. Thus while testing tries to check the mentprogram and edgesby considering represent all the possible flow of states control. of programniques, only or byon a testing very small with subset a random of the possible input set.inputsused to generate test cases. All concrete values Datathe program. flow analysis It includes tries to different find facts analyses about such the andSymbolic verification execution checks the is aprogram middle over way all of between the possiblethat satisfy the path condition for a particular programas liveliness by considering analysis, reaching all possible definitions, states etc., of inputs,these symbolic two extremes. execution In checks symbolic the program execution, over theclassespath are guaranteed to follow that path when thethat program. are used It in includes optimizing different compilers. analyses Detailed such ofprogram the input. So is it executed, may be considered but the to input be a more fed generalized to the inputted to the program. asliterature liveliness on analysis, data-flow reaching analysis definitions, may be found etc., formprogram of testing is or symbolic a less stringent rather verification. than concrete. This thatin any are used standard in optimizing text on compilers compilers. such Detailed as [3]. Fig. 1. All the trajectories in a program execu- meansWhenever that a each program symbolic is executed input concretely, stands the for program an literature on data-flow analysis may be found tion space [4] stateentire usually class consists of inputs of the program rather counter than individualand the values 1.1.3 Model Checking Fig. 1 : All the trajectories in a ofinput. the program Thus variables. while testingWhen a program tries to is checkto be executed the in any standard text on compilers such as [3]. Fig. 1. All the trajectories in a program execu- symbolically, one additional piece of information must also Model checking usually uses some form of program execution space [4] program only on a very small subset of the tion space [4] be maintained in the so-called symbolic program state. This 1.1.3temporal Model logic Checking such as finite automata, to rep- ispossible the path inputscondition and (pc). verification The pc is the checks accumulation the proof all Modelresent thechecking specifications usually of uses a program. some The form state of thegram criteria over that all the of input the possiblemust satisfy inputs, for an symbolic execution to temporalspace is logicthen exhaustively such as finite searched, automata, to to ensure rep- followexecution the associated checks path the [1].program As we encounter, over classes conditions of that the specifications are met correctly. It is an resent the specifications of a program. The state orthe branches input. in So the it may program be considered flow, such to as be IF a more statements, approach that suffers from state space explo- these conditions are conjoined with the pc along that path. space is then exhaustively searched, to ensure generalized form of testing or a less stringent sion. Bounded model checking examines only Thus a pc exists, corresponding to each alternative path that that the specifications are met correctly. It is an verification. a prefix of all possible executions. However this the program may follow. This pc is thus a Boolean formula. approach that suffers from state space explo- Whenever a program is executed concretely, may result in later errors being missed. Inthe order program to reason state about usually whether consists a particular of the control pro- flow sion. Bounded model checking examines only is possible for the program, we should check whether the a prefix of all possible executions. However this correspondinggram counter Boolean and formula the values that ofrepresents the program its pc isFig. 4. A symbolic execution tree [5] 1.2 Dynamic Analysis of Programs satisfiable.variables. WhenFor this a program powerful is SMT/SAT to be executed solvers are used. may result in later errors being missed. Fig. 2 : Trajectory coverage with model-checking [4] Dynamic analysis is mainly done through test- Fig. 2. Trajectory coverage with model-checking Ofsymbolically, course while onemany additional satisfiability piece problems of informa- are beyond It is important to note that classical sym- [4] tion must also be maintained in the so-called bolic execution, will work only for sequential 1.2ing andDynamic debugging. Analysis Testing of Programs tries to run a program on a subset of the possible inputs to symbolicCSI Journal program of Computing state. This | is Vol. the 2 path• No. con-4, 2015programs. In recent times, several extensions Fig. 2. Trajectory coverage with model-checking Dynamicobserve if analysis a failure is occurs mainly or done not. Sincethrough it is test- not A good visualization of these different meth- dition (pc). The pc is the accumulation of all to this classical approach have been proposed. [4] ingpossible and debugging. to run theprogram Testing tries exhaustively to run a overpro- ods is shown in Figures 1 2 3. In each of gramall possible on a subset inputs, of testing the possiblehas the problem inputs toof these figures x(t) is a vector of the input state observeabsence if of a failurecoverage. occurs One or important not. Since concern it is not is andA goodoutput visualization of the program of these that different evolves meth- as a possiblethat the to test run case the generationprogram exhaustively should be auto- over odsfunction is shown of time in t. Figure Figures 1 shows 1 2 3. the In possible each of allmated possible if it inputs, is to be testing applicable has the to problemcommercial of thesedifferent figures trajectories x(t) is a that vector a program of the input may take state absenceprograms of coverage.that are thousands One important of lines concern long. The is andup when output executed. of the Asprogram Figure that 2 shows, evolves model as a thatway the these test tests case aregeneration generated should is important, be auto- functionchecking of tries time to t.cover Figure each 1 showsof the trajectories, the possible matedsince naivelyif it is to generating be applicable the test to cases commercial in any differenthowever totrajectories be practical, that it a must program be bounded. may take programsrandom manner that are may thousands likely miss of lines out on long. several The upThus when it misses executed. out on As any Figure late 2 errors. shows, On model the wayerrors. these tests are generated is important, checkingother hand tries testing to cover as shown each of in the Figure trajectories, 3 will since naively generating the test cases in any however to be practical, it must be bounded. random manner may likely miss out on several Thus it misses out on any late errors. On the errors. other hand testing as shown in Figure 3 will CSI JOURNAL COMPUTING, VOL. XX, NO. XX, XXXXXXXXX 3 leave out some of the trajectories entirely while the criteria that the input must satisfy for an some are covered completely. Clearly even this execution to follow the associated path [1]. is not satisfactory. As we encounter, conditions or branches in the program flow, such as IF statements, these conditions are conjoined with the pc along that path. Thus a pc exists, corresponding to each alternative path that the program may follow. This pc is thus a Boolean formula. In order to reason about whether a particular control flow is possible for the program, we should check whether the corresponding Boolean formula that represents its pc is satisfiable. For this powerful SMT/SAT solvers are used. Of course while many satisfiability problems are beyond the scope of these solvers, many solvers do Fig. 3. Trajectory coverage with dynamic analy- exist for linear equations, e.g Z3, STP, Yices, etc. sis [4] Thus corresponding to each program, we can R2generate : 16 the execution tree that shows all the Static Analysis and Symbolic Code Execution possible execution paths of the program. One 1.3 Symbolic Execution : Old Paradigm such symbolic execution tree and the corre- Gets New Life the scope of these solvers, many solvers do exist for linear 1.4.3 Static detection of run time errors equations,sponding e.g codeZ3, STP, snippet Yices, etc. appear in the Figure 4. The concept of symbolic execution was first Here it is seen that at each point in a particular This uses symbolic execution to detect if program states, Thus corresponding to each program, we can generate that may lead to run time violations, exist. Such analysis proposed by King in his seminal paper [1]. We path, the pc is updated with some additional the execution tree that shows all the possible execution paths is especially useful in the case of malware analysis, where first try to understand the concept in the form ofconstraints the program. ifOne necessary. such symbolic If the execution final pctree forand athe it may be expensive to run the code in order to observe its it was originally proposed. As we have already correspondingparticular pathcode snippet is found appear to bein the un-satisfiable, Figure 4. Here noxious behavior. noted, program verification conventionally is it thenis seen that that state at each of thepoint program in a particular is unreachable. path, the pc is done either through proving by formal tech- updatedThe various with some path additional conditions constraints accrued if necessary. may beIf the 1.4.4 Invariant inference niques, or by testing with a random input set. finalused pc tofor generatea particular test path cases. is found All to concretebe un-satisfiable, values then that state of the program is unreachable. The various path By using symbolic execution, we may predict the pre Symbolic execution is a middle way between conditionsthat satisfy accrued the may path be conditionused to generate for a particulartest cases. All or post conditions that are likely to be maintained by the these two extremes. In symbolic execution, the concretepath arevalues guaranteed that satisfy tothe followpath condition that path for a particular when program. program is executed, but the input fed to the pathinputted are guaranteed to the program.to follow that path when inputted to the 1.4.5 Parallel numerical program analysis program is symbolic rather than concrete. This program. means that each symbolic input stands for an This is a new application of symbolic execution to entire class of inputs rather than individual establish that the parallel program is indeed equivalent to its input. Thus while testing tries to check the corresponding sequential counterpart. program only on a very small subset of the 1.4.6 Differential symbolic execution possible inputs and verification checks the pro- This is again an emerging area. Here two programs are gram over all of the possible inputs, symbolic compared to find out the logical difference between them. It execution checks the program over classes of may be used in software development and maintenance to the input. So it may be considered to be a more ensure for instance that re-factored code is equivalent to the generalized form of testing or a less stringent original source. verification. The rest of the paper is organized as follows: In section Whenever a program is executed concretely, 2 we discuss some tools such as KLEE that are developed using symbolic execution. Some exciting new trends that are the program state usually consists of the pro- Fig. 4 : A symbolic execution tree [5] gram counter and the values of the program Fig. 4. A symbolic execution tree [5] emerging in symbolic execution are described in section 3. variables. When a program is to be executed Section 4 examines some of the major challenges that symbolic It is important to note that classical symbolic execution, execution techniques must counter. Finally, in section 5, we symbolically, one additional piece of informa- will It work is importantonly for sequential to note programs. that classical In recent sym- times, see how well accepted the various symbolic execution tools tion must also be maintained in the so-called severalbolic execution,extensions to will this work classical only approach for sequential have been have been in industry. symbolic program state. This is the path con- proposed.programs. Excellent In recent reviews times, of the current several trends extensions in symbolic dition (pc). The pc is the accumulation of all executionto this classical are available approach in [6] [7]. have These been new proposed. approaches 2. Symbolic Execution: Case Studies include ideas like generalized symbolic execution, concolic symbolic execution, compositional symbolic execution, etc. In this section we explore two tools that have successfully They have made it possible to extend symbolic execution used symbolic execution for program analysis. Several to multi-threaded programs and advanced programming breakthrough tools have been proposed, but we focus on only constructs, such as recursive data-structures. We shall survey two such contemporary tools. Our first case study is on KLEE a few of these ideas in a later section. - which is an automatic test case generation framework for C. Since C is a conventional statically typed language, the study 1.4 Applications of Symbolic Execution of KLEE will acquaint us with how symbolic execution can be applied in such a mainstream language. Our second case The paper by Visser et al [8] covers a depth of areas to study deals with Kudzu - which is an automated vulnerability which symbolic execution may be applied. Here we enlist analysis tool for JavaScript. A study of Kudzu will give us some major areas of focus. a flavour of how symbolic execution may be applied to solve string constraints, a major application challenge. 1.4.1 Test case generation This is one of the traditional uses of symbolic execution. 2.1 Case Study 1: KLEE Since symbolic execution deals with classes of input, it can KLEE is one of the most successful implementations of be used to automate generation of test-cases with a high symbolic execution as a means for automated generation of degree of coverage. This has recently become well known as high coverage test cases. The paper that provides detailed whitebox testing. technical overview of KLEE is [9]. KLEE has two major execution goals: 1.4.2 Proof of program properties 1) Try to hit every line of code. This will help achieve high Symbolic execution can be successfully applied for proving coverage. that certain assertions, say for instance loop invariants, are 2) At each dangerous operation that is detected, try to indeed maintained when a program is executed.

CSI Journal of Computing | Vol. 2 • No. 4, 2015 Dhiren Patel, et.al. R2 : 17

generate the test case that could potentially cause the It uses the concept of counter-example cache to store results error. Do this by solving that path’s path condition. of previous queries. These results are checked to see if they The authors of this paper were previously involved in satisfy a current query, since it is much easier to check the theCSI development JOURNAL COMPUTING, of a similar VOL. tool XX,EXE NO. [10]. XX, They XXXXXXXXX have based solution rather than solve. Also it tries to see if the cached5 KLEE on their previous experiences and have shown that solution has some subset, superset relation to the current KLEE generated tests could indeed achieve high test coverage constraint’s solution state. of over 90%,could sometimes potentially even beating cause manually the error. developed Do test this suites. Hereby we solving briefly thatexplore path’s some pathmajor features. condition. 2.1.1The Architecture authors of this paper were previously in- volvedThe KLEE in framework the development is designed for: of a similar tool EXESimplicity: [10]. They It operates have basedon the well KLEE known on LLVM their byte pre- code.vious This experiences means that it and can havebe applied shown to a thatnumber KLEE of languagesgenerated that testsare supported could indeedby LLVM. achieve LLVM has high a RISC test like instruction set and the byte code is generated from the sourcecoverage code, which of over is compiled 90%, sometimes using the well even known beating GCC compiler.manually Thus developed no special modifications test suites. need Here to webe brieflymade in orderexplore to have some the code major analyzed features. by KLEE. KLEE can take in the raw code directly and generate the test cases, and is Fig. 5 : Different components of KLEE [11] therefore2.1.1 quick Architecture and easy to use. Fig. 5. Different components of KLEE [11] TheScalability: KLEE framework Since there is designedcan be anfor: exponential 2.1.3 Major Challenges increase in the number of states, KLEE uses compact state Simplicity: It ooperates on the well known KLEE has tried to answer two challenges: representation. It borrows the well-known principle of copy the concept of constraint independence to give onLLVM write (vfork byte vs. code. fork) This from Linux,means to that reduce it memory can be 1) How to model the interaction of the program with the requirements.applied to It a also number implements of the languages heap as immutable that are onlyenvironment: relevant portionsMany programs of the have memory command to theline andsupported shares the byheap LLVM. amongst LLVM multiple has states. a RISCFinally like it solverarguments, for a particular environment constraint variables, rather file handling, than the etc. usesinstruction compact expression set and representation the byte code (e.g. is when generated it has completeTo provide store a realistic of all environment the variables. in such It cases, uses KLEE the expression x+1=10 simply store this as the expression x=9.) makers have tried to provide a symbolic environment, from the source code, which is compiled using concept of counter-example cache to store results Speed: It achieves speed by trying to minimize the calls of previouswhich the program queries. sees These when it results is executed are symbolically. checked the well known GCC compiler. Thus no special This tries to model its interaction with real-life system to the SAT solver, since these are the most expensive. It also to see if they satisfy a current query, since usesmodifications simplified expressions need to as be input made to the in constraint order to solver, have calls and OS resources. it is much easier to check the solution rather andthe uses code the concept analyzed of caches by to KLEE.dramatically KLEE improve can speed. take 2) How to address exponential paths through the code: The inThe the figure raw code 5 shows directly how and the different generate thecomponents test of thanstate solve. space Alsomay grow it tries exponentially. to see So if KLEE the cached uses two KLEEcases, integrate and is to thereforeproduce very quick accurate and test easy cases to for use. the solutionstrategies has to some determine subset, which supersetstate to choose relation next from to sample code snippet on the lower-left in the figure. thethe current current constraint’s state. The first solution strategy state. is Random Path Scalability: Since there can be an exponential Selection - this uses a binary tree to represent active 2.1.2increase Solver inOptimization the number of states, KLEE uses 2.1.3states, Major and the Challenges tree is randomly traversed from the root. compactThe major state time consumed representation. by most symbolic It borrows execution the The approach is biased towards states higher up in the toolswell-known is in trying to principle solve the ofconstraints. copy on Thus write KLEE (vfork has KLEEtree has where tried simpler to answer constraints two need challenges: to be solved. The usedvs. some fork) clever from techniques Linux, to greatly to reduce reduce memorythis time cost. re- 1)second How is toCoverage model Optimized the interaction Search where of it the uses pro- some It uses the concept of constraint independence to give only heuristics to compute weights for each state and then CSIquirements. JOURNAL COMPUTING, It also VOL. implements XX, NO. XX, theXXXXXXXXX heap as im- gram with the environment: Many pro- 6 relevant portions of the memory to the solver for a particular randomlygrams selects have commandthe state that line is likely arguments, to offer better en- constraintmutable rather and than shares the complete the heap store amongst of all the variables. multiple coverage. states. Finally it uses compact expression rep- vironment variables, file handling, etc. To resentation (e.g. when it has expression x+1=10 provide a realistic environment in such simply store this as the expression x=9.) cases, KLEE makers have tried to pro- Speed: It achieves speed by trying to min- vide a symbolic environment, which the imize the calls to the SAT solver, since these program sees when it is executed sym- are the most expensive. It also uses simplified bolically. This tries to model its interac- expressions as input to the constraint solver, tion with real-life system calls and OS and uses the concept of caches to dramatically resources. improve speed. 2) How to address exponential paths The figure 5 shows how the different com- through the code: The state space may ponents of KLEE integrate to produce very grow exponentially. So KLEE uses two Fig.accurate 6. The test various cases components forFig. the 6 sample : The in various the code design components snippet of Kudzu in thestrategies [12] design of Kudzu to determine [12] which state to on the lower-left in the figure. choose next from the current state. The first strategy is Random Path Selection 2.1.2 Solver Optimization CSI- this Journal uses of Computing a binary | tree Vol. to2 • representNo. 4, 2015 Thesolved. major The time second consumed is Coverage by most Optimized symbolic orderactive - and states, secondly and value the tree space is - randomly these are executionSearch where tools is it in uses trying some to solve heuristics the con- to the valuestraversed provided from by the the root. user The in approach form fields, is straints.compute Thus weights KLEE has for used each some state clever and then tech- text areas,biased URL towards parameters, states etc. higher Kudzu up uses in the the niquesrandomly to greatly selects reduce the this state time that cost. is Itlikely uses concepttree of where GUI exploration simpler constraints to handle need vulnera- to be to offer better coverage. bilities associated with the event space while using symbolic execution to handle the un- 2.1.4 Evaluation Highlights trusted input. Next, it incorporates the power- The metric for evaluation is line coverage. ful string constraint solver Kaluza. For this, the KLEE generated test cases for Coreutils, Minix, original JavaScript instructions are translated to Busybox, etc. It was able to outperform man- a simple language - JASIL. A block diagram ually designed test suites developed over a for the system architecture appears in Figure period of 15 years. The coverage was over 90%. 2.1.1. Kudzu is the automated tool for finding It found several major flaws in GNU Coreutils, security vulnerabilities in JavaScript code. which is one of the most heavily tested program suite. Thus it was shown to have great 2.2.2 Constraint Language potential for real world application. The constraint language is very expressive for string constraints. Various constraints are provided to check if a given string matches 2.2 Case Study 2: Kudzu a given regular expression, comparing two Kudzu [13] is an automated tool for find- strings for equality or concatenation, comparing client side code injection vulnerabilities in ing the length of two strings and checking JavaScript. It is the first attempt of this kind. the length of a string against some integer. In a language like JavaScript, the input is in All these facilities make the constraint solver the form of strings. This means that the solver more powerful than its contemporaries. More- should be able to solve string constraints. To over in-spite of its high expressiveness, the enable this, Kudzu introduces an expressive constraint language is simple, making the con- constraint language and a constraint solver straint solver efficient. Kaluza specifically geared for handling string constraints, rather than numeric ones. 2.2.3 Constraint Solver Kaluza is a SAT based SMT constraint solver. It 2.2.1 System Design first of all takes in as input the JavaScript code For the rich web applications that are created and translates it to the core constraint language using AJAX, the input space is conceptually using a DFA-based approach. Next, it solves for divided into 2: event space - that deals with various constraints such as length and integer various event handler code e.g. mouse clicks constraints, etc. Finally it takes the input strings or form submissions, that may occur in any and translates them into bit vector notation,

R2 : 18 Static Analysis and Symbolic Code Execution

2.1.4 Evaluation Highlights It then checks to see if the bit vector notation of the string contents satisfies the constraints using the SMT solver. It The metric for evaluation is line coverage. KLEE also uses the concept of k-boundedness, and expects the user generated test cases for Coreutils, Minix, Busybox, etc. It was to input the value of k- the maximum length of the strings to able to outperform manually designed test suites developed be included in the search space. This is essential as otherwise over a period of 15 years. The coverage was over 90%. It the problem would become unbounded and unsolvable found several major flaws in GNU Coreutils, which is one of practically. the most heavily tested program suite. Thus it was shown to have great potential for real world application. 2.2.4 Evaluation 2.2 Case Study 2: Kudzu The solver is very successful for detecting client side code injection attacks. It was able to detect 2 new vulnerabilities Kudzu [13] is an automated tool for finding client side when tested on 18 popular web apps and also 9 known code injection vulnerabilities in JavaScript. It is the first vulnerabilities that had earlier been detected using manual attempt of this kind. In a language like JavaScript, the testing. The constraint solver is also highly efficient, when input is in the form of strings. This means that the solver a constraint is satisfiable, it can find the solution in under a should be able to solve string constraints. To enable this, second, when it is not it may take up-to 50s to report failure. Kudzu introduces an expressive constraint language and The only requirement is that the user must provide an upper a constraint solver Kaluza specifically geared for handling bound on the search space. string constraints, rather than numeric ones. 3. Exciting New Areas in Symbolic Execution 2.2.1 System Design Research For the rich web applications that are created using Several recent trends have emerged that attempt to AJAX, the input space is conceptually divided into 2: event carry out some form of hybrid analysis that tries to combine space - that deals with various event handler code e.g. mouse symbolic execution with other techniques in an attempt to clicks or form submissions, that may occur in any order improve the overall performance. Several such techniques - and secondly value space - these are the values provided have been proposed for testing critical applications such as in by the user in form fields, text areas, URL parameters, NASA [14] [15]. We survey some such trends in the present etc. Kudzu uses the concept of GUI exploration to handle section. vulnerabilities associated with the event space while using symbolic execution to handle the untrusted input. Next, it 3.1 Concolic Execution: Combining Symbolic and incorporates the powerful string constraint solver Kaluza. Concrete Analysis For this, the original JavaScript instructions are translated to a simple language - JASIL. A block diagram for the system A very concise tutorial on this technique is available in architecture appears in Figure 2.1.1. Kudzu is the automated [16]. Concolic execution combines both random testing as tool for finding security vulnerabilities in JavaScript code. well as symbolic execution. Whenever a concrete execution is performed using random inputs, the path conditions for 2.2.2 Constraint Language that particular path are also simultaneously constructed. In the next run the gathered constraints are used to generate The constraint language is very expressive for string new inputs, which will drive the execution along a different constraints. Various constraints are provided to check if a branch. For e.g. by negating one constraint at a branch point, given string matches a given regular expression, comparing the new input generated will then guide the program along the two strings for equality or concatenation, comparing the other branch. This is repeated until no new constraints can be length of two strings and checking the length of a string generated. Thus significant coverage is attained. At the same against some integer. All these facilities make the constraint time, if some constraints are too difficult for the solver, then solver more powerful than its contemporaries. Moreover in- the concrete input comes to rescue, by replacing some of the spite of its high expressiveness, the constraint language is symbolic variables with concrete values. A good case study simple, making the constraint solver efficient. of such a hybrid system is DART [17]. DART is an algorithm for dynamic test case generation using random testing and 2.2.3 Constraint Solver symbolic execution. It is one of the precedents in this area, Kaluza is a SAT based SMT constraint solver. It first of originally developed at Bell Labs. Here the major advantages all takes in as input the JavaScript code and translates it to of this approach are that the random test generation is fully the core constraint language using a DFA-based approach. automated, not requiring any manual intervention, thanks to Next, it solves for various constraints such as length and the path constraints generated by symbolic execution. Unlike integer constraints, etc. Finally it takes the input strings and random testing it is also more effective, since the program translates them into bit vector notation, by concatenating the execution may be directed along specific paths by using binary representation of consecutive characters in the string. the path constraints. It also tackles the difficulties of pure

CSI Journal of Computing | Vol. 2 • No. 4, 2015 CSI JOURNAL COMPUTING, VOL. XX, NO. XX, XXXXXXXXX 8

TABLE 1 Some recent tools that have used symbolic execution

Tools Description Symbolic (Java)PathFinder It is a test generation system based on the symbolic execution model, based on the NASA Java PathFinder (JPF), which initially used model checking. It implements generalized symbolic execution, that adds multi-threading and other capabilities in extension to classical symbolic execution. It has helped discover subtle bugs in several critical NASA systems. DART It tries to combine random testing with symbolic execution, to maximize the coverage, while being more efficient as well. It is an example of both symbolic and concrete execution or concolic execution. It was developed for C. CUTE and jCUTE It extends DART to handle multi-threaded programs. Can also perform pointer analysis and solves pointer constraints. CUTE is for C wheras jCUTE is for Java. CREST It is an open source tool designed for C. It is also extensible and has been used as a basis for several other tools. SAGE It is an automated whitebox fuzz tester, that uses compositional symbolic execution. It extends unit testing to whole application testing. Pex It is a test generation tool for .NET code. It was developed at Microsoft as a Visual Studio Tool. EXE It is a symbolic execution tool for C, written especially to check complex code and low level systems code. It introduces several optimizations, to achieve considerable speed up. KLEE It is the successor of KLEE which is based on the LLVM framework. It extends EXE by considering environmental interaction and better memory management for storing states. Dhiren Patel, et.al. R2 : 19 symbolic execution where constraints may be too difficult to 3.3 Automated Whitebox Fuzz Testing solve. Table 1 shows some recent tools that have been used symbolic execution. Fuzz testing is a heavily used random testing 3.2 Compositional Symbolic Execution method for detecting security vulnerabilities. This basically involves sending input data to Compositional symbolic execution is an attempt to extend concolic execution to make it more scalable by the application and then randomly mutating intelligently avoiding state space explosion. The seminal the input and then testing the results. It is a paper introducing this concept is [18]. Here the major concept form of black box testing. Whitebox fuzz test- used is to generate summaries for individual functions. ing [19] is an attempt to combine fuzz testing These include information such as the function pre and post conditions, etc. These summaries are generated in top down with symbolic execution, in an attempt to facil- manner in context of the caller-callee relationships between itate dynamic and automated test generation. the functions. Thus if a function f() calls a function g(), then Fig. 7 : Experimental comparison of the number of It leverages off DART and other predecessors. g()’s summary is generated first and used later when the runs with SMART and DART [18] Whitebox fuzz testing is much more efficient function f() is being analyzed. By using such an approach, Fig. 7. Experimental comparison of the number the overall complexity would be lesser than if both the of runs with SMART and DART [18] than the conventional black box methods. It can functions are analyzed together. With this new approach, 3.3 Automated Whitebox Fuzz Testing find several errors that are beyond the reach of the complexity may be curbed while at the same time not the black box fuzzers. sacrificing the code coverage. Fuzz testing is a heavily used random testing method for detecting security vulnerabilities. This basically involves Whitebox fuzzing introduces several new For this approach, the authors of [18] have proposed an algorithm called SMART. They have reported that SMART is sending input data to the application and then randomly concepts to extend symbolic testing to whole able to achieve the same level of coverage as the earlier DART mutating the input and then testing the results. It is a form of applications. It tries to deal with path explosion algorithm. The Fig. 7 shows an experimental comparison of black box testing. Whitebox fuzz testing [19] is an attempt to and other obstacles by introducing a novel algorithm.combine fuzz The testing Figure with symbolic 7 shows execution, an experimental in an attempt the number of runs with SMART and DART. SMART shows parallel state space search algorithm called gen- only a linear growth as compared to DART, which may show comparisonto facilitate dynamic of the numberand automated of runs test with generation. SMART It exponential scaling in the number of possible execution andleverages DART. off SMARTDART and showsother predecessors. only a linear Whitebox growth fuzz erational search algorithm. Also because of other paths. The major challenge is to generate the summaries astesting compared is much to more DART, efficient which than may the showconventional expo- black technological improvements such as more ad- intelligently, rather than naively generating all possible nentialbox methods. scaling It can in find the numberseveral errors of possiblethat are beyond execu- the vanced SMT solvers and concise constraint summaries, since we want to counter path explosion. reach of the black box fuzzers. tion paths. The major challenge is to generate representations, it can be applied to programs the summaries intelligently, rather than naively having millions of lines - such as large file Table 1 : Some recent tools thatgenerating have used symbolic all possible execution summaries, since we parsers, etc. Tools Description want to counter path explosion. The SAGE [20] fuzzer based on this tech- Symbolic (Java) It is a test generation system based on the symbolic execution model, based on the NASA Java PathFinder PathFinder (JPF), which initially used model checking. It implements generalized symbolic execution, that adds multi-threading and other capabilities in extension to classical symbolic execution. It has helped discover subtle bugs in several critical NASA systems. DART It tries to combine random testing with symbolic execution, to maximize the coverage, while being more efficient as well. It is an example of both symbolic and concrete execution or concolic execution. It was developed for C. CUTE and jCUTE It extends DART to handle multi-threaded programs. Can also perform pointer analysis and solves pointer constraints. CUTE is for C whereas jCUTE is for Java. CREST It is an open source tool designed for C. It is also extensible and has been used as a basis for several other tools. SAGE It is an automated whitebox fuzz tester, that uses compositional symbolic execution. It extends unit testing to whole application testing. Pex It is a test generation tool for .NET code. It was developed at Microsoft as a Visual Studio Tool. EXE It is a symbolic execution tool for C, written especially to check complex code and low level systems code. It introduces several optimizations, to achieve considerable speed up. KLEE It is the successor of KLEE which is based on the LLVM framework. It extends EXE by considering environmental interaction and better memory management for storing states.

CSI Journal of Computing | Vol. 2 • No. 4, 2015

R2 : 20 Static Analysis and Symbolic Code Execution

Whitebox fuzzing introduces several new concepts to for the currently executing path, if we ever get a cache hit, extend symbolic testing to whole applications. It tries to deal then obviously the execution of that path should be aborted. with path explosion and other obstacles by introducing a novel The concept of write set is used to store only the difference parallel state space search algorithm called generational of various concrete states from a common initial state, since search algorithm. Also because of other technological storing the complete states may be too expensive. The concept improvements such as more advanced SMT solvers and of read set is used to maintain the values that will be read concise constraint representations, it can be applied to beyond the current program point. All other values should programs having millions of lines - such as large file parsers, be discarded from comparison. Also the cache must be call etc. site sensitive as otherwise a state stored in the cache for one The SAGE [20] fuzzer based on this technique has function call may lead us to erroneously discard a path in become a highly popular tool and is used regularly at some completely unrelated function call. Microsoft Corporation. It has saved millions of dollars and found several bugs. It represents the largest ever use of a 4.1.3 Evaluation SMT solver. We examine SAGE in more detail in section 5. The tool EXE [10] was run, both with RWSet and without it on some benchmarks. It was found that RWSet significantly 4. Symbolic Execution : Major Challenges and reduced the number of paths explored to achieve the same Proposed Solutions coverage. In all cases less then half the number of paths were There are several challenges that are an area of concern explored with RWSet, in a few cases the number dropped for symbolic execution. Some of these are enlisted below: to as low as just 11% of the original EXE. The overhead of 1) How to extend symbolic execution to handle complex and enabling RWSet was also very less; approximately just 4%. recursive data structures and loops? 4.2 Parallel Symbolic Execution 2) How to handle multi-threaded and nondeterministic programs with symbolic execution? Parallelizing the symbolic execution has been proposed 3) How to extend the constraint solver to solve string in [22]. constraints and support other native code features? 4.2.1 Approach A good survey of these and other challenges is presented in [8]. Here we focus on the two major issues that have become The paper proposes to reduce the time spent in a point of focus for researchers. These are: exploring paths of the symbolic execution tree, by means 1) How to curb the path space explosion? of parallelization. The authors use a set of pre conditions to partition the execution tree and distribute the symbolic 2) How to optimize constraint solving? execution amongst different workers. The partitioning is done Here we present 3 different approaches that attempt to such that each worker can independently execute its subtree overcome one or both of the above mentioned hurdles. without any need of interprocess communication (IPC). This 4.1 RWSet Analysis Based Path Pruning is important because otherwise the overhead of IPC may completely negate the gains from parallelization. Initially a The idea of RWSet analysis for countering path explosion shallow form of symbolic execution is run to generate the pre- was proposed in [21]. conditions which will help partition the execution tree. The authors have named this as Simple Static Partitioning. 4.1.1 Approach Here the authors propose to make symbolic execution 4.2.2 Implementation more scalable by using the concept of Read-Write Set (RWSet) The authors attempt to parallelize symbolic execution analysis, to discard all paths in the execution tree that in a generic manner to make it applicable to different will produce the same effect as a previously executed path. environments such as multi core computers, grid or clouds. Here two major ideas are used to prune the paths. Firstly The concept of Simple Static Partitioning is used to distribute whenever an execution reaches a program point in exactly work amongst different workers. For this first of all an initial the same state as a previous execution then it should be set of constraints is generated via shallow symbolic execution. pruned. Secondly if an execution differs from a previous It is important to obtain a set of constraints that are disjoint, execution only in values that are never going to be written, complete and useful. This ensures that each worker will do then it cannot cause a different execution flow, so even it some useful work that does not overlap with any of the other should be discarded. By truncating suffixes of all such paths, workers. The different partitions that are thus obtained significant gains are expected, because at each branch point are stored in a queue. The user can control both the depth these suffixes could spawn an exponential number of paths. of the initial symbolic execution and the queue size. Both of these are critical factors that determine the success of the 4.1.2 Implementation algorithm. The size of the constraint queue will determine The constraint cache is used to record the states how effectively the load balancing will be whereas the initial corresponding to which a program state was reached. Now depth will determine the quality of the generated partitions.

CSI Journal of Computing | Vol. 2 • No. 4, 2015 Dhiren Patel, et.al. R2 : 21

4.2.3 Evaluation see how useful they actually are in practice.

While the paper cites several detailed metrics for 5.1 The FindBugs Experience evaluation, the results in a nutshell show that the speed up is possible in all systems. The results also show that the number FindBugs is an open source static analysis tool that was of parallel workers (NPW) and the constraint queue size affect developed at the University of Maryland. In the year 2009, the speed up. There was 90x speedup in analysis time with Google held a large scale FindBugs Fixit. This involved over NPW = 128 and 70x speed up in automatic test generation, 300 engineers, reviewing thousands of warnings. This was with NPW = 64. an attempt to evaluate how effective FindBugs would be in practice. Over the course of two days, it was found that static 4.3 Memoized Symbolic Execution analysis does catch mistakes. However most of these are not very important. Static analysis may at best catch about 5% In the final approach we present, the authors [12] use to 10% of software quality problems. However it is cheaper. memoization to counter both the path explosion problem and If used at an earlier stage static analysis can reduce the the constraint solving cost. development cost. Overall however the results of the review 4.3.1 Approach were disappointing, as most of the issues reported by the tool were low on priority and did no cause any significant The main idea is that usually symbolic execution is run misbehavior. several times, for instance whenever an error is detected, the program is modified and then re-run. The authors attempt 5.2 Coverity: Static Analysis in Real World to leverage from the information gained during earlier runs. In this popular paper [23] that was downloaded a record To this end they maintain a trie structure, that helps in number of times from the CACM website, the authors provide recording efficiently the states of a particular execution. The some interesting analyses of why static analysis tools fail to computations can be re-used during the next run. For example do well in practice. paths that were not useful in prior runs may be pruned and the constraint solver may be turned off for constraints that Firstly to make a tool popular some sort of trial run or were previously solved. At each run the trie structure must demo must be given to customers. Many a times the also be appropriately modified. customer’s do not allow the tool to access some of their code for safety and privacy concerns. Thus the tool is 4.3.2 Implementation unable to find the serious or important bugs and fails to convince the potential customers. The trie data structure is used to store the information Many a times the compilers that the company uses to from each run. A new trie node is created whenever a branch write the code are in-fact buggy. They often deviate from is encountered during the symbolic execution. The node is the laid down standards for languages like C or Java. very lightweight - storing just the method, and instruction This leads to totally unexpected behavior in some cases offset and the choice taken. The trie is first of all initialized. when the tool is used along with such compilers. However During the first execution the trie is constructed to guide the companies are unwilling to make any changes in future executions. The nodes in the trie are split into bound their development environment which forces some ugly nodes and unsatisfiable nodes. This trie can later be searched workarounds. using either BFS or DFS. During the memoized analysis phase, the trie is loaded into the memory. It guides the The programmers often don’t take the bugs reported by execution; it is appropriately modified and also compressed to the tool seriously. If they can’t understand a reported remove irrelevant features in the current context. Finally in bug they pass it off as a false positive, rather than trying the trie merging phase the compressed trie may be optionally to analyze the code base. merged with the older trie to obtain the complete trie. Often the bugs reported by two or more static analysis tools are different. Often the static tools themselves are 4.3.3 Evaluation erroneous so that the developers don’t have much faith in them. It was found that the savings obtained depend on how the program is changed and where it is changed, during for e.g. If the tool reports false positives of more than 30% then regression analysis. During each execution, however, it was people generally tend to ignore the tool. observed that by memoization there was a dramatic drop in Often the programmers write up any sort of junk for the the number of calls to the solver. However while this did not code. The tool must be designed to take into consideration reflect significant time gains for simple-to-analyze programs, some of the foolishest errors. the authors predict that memoization may be very useful for critical and complex programs. 5.3 SAGE : Whitebox Fuzz Tester In this paper [20] again from the CACM, the picture 5. How applicable is Static Analysis in the real is remarkably different and rosy from the previous two world? experiences. We have already described white box fuzzing In this section, we discuss three static analysis tools to inan earlier chapter. SAGE the tool designed, based on this

CSI Journal of Computing | Vol. 2 • No. 4, 2015

R2 : 22 Static Analysis and Symbolic Code Execution has made a massive impact in the development scenario at symbolic execution for software testing and analysis,” Int. J. Microsoft. We list some of these below: Softw. Tools Technol. Transf., vol. 11, no. 4, pp. 339-353, Oct. 2009. SAGE has been running 24/7 on 100+ machines since 2008. [9] C. Cadar, D. Dunbar, and D. Engler, “Klee: unassisted and automatic generation of high-coverage tests for complex systems Since SAGE has extended testing to whole application programs,” in Proceedings of the 8th USENIX conference on level, it has found hundreds of bugs in apps, media Operating systems design and implementation, ser. OSDI’08. players, etc that were missed by everything else. USENIX Association, 2008, pp. 209–224. It has found 33 [10] C. Cadar, V. Ganesh, P. M. Pawlowski, D. L. Dill, and D. R. Due to SAGE millions of dollars lost in bugs and patches Engler, “Exe: Automatically generating inputs of death,” ACM have been saved. Trans. Inf. Syst. Secur., vol. 12, no. 2, pp. 10:1-10:38, Dec. 2008. It holds the record for the largest computational use of [11] C. C, “Klee : Effective testing of systems programs,” PowerPoint, Stanford University, 16th April 2009. [Online]. Available: www. the SMT solvers. doc.ic.ac.uk/ cristic/talks/kleestanford-2009.ppsx SAGE is today used daily in many groups at Microsoft. It [12] G. Yang, C. S. Păsăreanu, and S. Khurshid, “Memoized symbolic is easy to deploy and fully automated. execution,” in Proceedings of the 2012 International Symposium This shows that when static analysis is combined with on Software Testing and Analysis, ser. ISSTA 2012. ACM, 2012, symbolic execution, there may possibly be dramatic rise in pp. 144-154. applicability. [13] P. Saxena, D. Akhawe, S. Hanna, F. Mao, S. McCamant, and D. Song, “A symbolic execution framework for javascript,” in 6. Conclusion & Research Directions Security and Privacy (SP), 2010 IEEE Symposium on, may 2010, pp. 513-528. In this paper, we discuss program testing and verification, focusing on symbolic execution. We explore tools - KLEE and [14] R. Majumdar, I. Saha, K. C. Shashidhar, and Z. Wang, “Clse: closed-loop symbolic execution,” in Proceedings of the 4th Kudzu that apply symbolic execution, next we look at the international conference on NASA Formal Methods, ser. upcoming paradigms in symbolic execution - concolic and NFM’12. Springer-Verlag, 2012, pp. 356-370. compositional execution. Finally, we have highlighted major [15] C. Păsăreanu, P. Mehlitz, D. Bushnell, K. Gundy-Burlet, M. challenges to symbolic execution and three possible solutions Lowry, S. Person, and M. Pape, “Combining unit-level symbolic namely - parallelization, memoization and pruning, to execution and system-level concrete execution for testing nasa address them. Research directions to go from here are toward software,” in Proceedings of the 2008 international symposium improving the scalability of symbolic execution, and heuristic on Software testing and analysis. ACM, 2008, pp. 15-26. searches that will help to explore more interesting program [16] K. Sen, “Concolic testing,” in Proceedings of the twenty-second states. Another interesting area would be to develop more IEEE/ACM international conference on Automated software generalized and powerful constraint solvers. engineering. ACM, 2007, pp. 571-572. [17] P. Godefroid, N. Klarlund, and K. Sen, “Dart: directed References automated random testing,” in Proceedings of the 2005 ACM [1] J. C. King, “Symbolic execution and program testing,” Commun. SIGPLAN conference on Programming language design and ACM, vol. 19, no. 7, pp. 385-394, Jul. 1976. implementation, ser. PLDI ’05. ACM, 2005, pp. 213-223. [2] F. Nielson, Principles of program analysis. Berlin New York: [18] P. Godefroid, “Compositional dynamic test generation,” in ACM Springer, 1999. SIGPLAN Notices, vol. 42, no. 1. ACM, 2007, pp.47-54. [3] S. S. Muchnick, Advanced compiler design and implementation. [19] P. Godefroid, M. Levin, D. Molnar et al., “Automated whitebox Morgan Kaufmann, 1997. fuzz testing.” NDSS, 2008. [4] P. Cousot, “Abstract interpretation in a nutshell,” howpublished, [20] P. Godefroid, M. Y. Levin, and D. Molnar, “Sage:Whitebox 7th October 2012. [Online]. Available: http://www.di.ens.fr/ fuzzing for security testing,” Queue, vol. 10, no. 1, pp. 20:20- cousot/AI/IntroAbsInt.html 20:27, Jan. 2012. [5] F. J, “Symbolic execution,” University of Maryland at College [21] P. Boonstoppel, C. Cadar, and D. Engler, “Rwset: Attacking Park. [Online]. Available: http://www.cs.umd.edu/class/fall2012/ path explosion in constraint-based test generation,” Tools and cmsc498L/materials/se.pdf Algorithms for the Construction and Analysis of Systems, pp. 351-366, 2008. [6] E. J. Schwartz, T. Avgerinos, and D. Brumley, “All you ever wanted to know about dynamic taint analysis and forward [22] M. Staats and C. Pˇasˇareanu, “Parallel symbolic execution symbolic execution (but might have been afraid to ask,” in In for structural test generation,” in Proceedings of the 19th Proceedings of the IEEE Symposium on Security and Privacy, international symposium on Software testing and analysis, ser. 2010. ISSTA ’10. ACM, 2010, pp. 183-194. [7] C. Cadar, P. Godefroid, S. Khurshid, C. S. Păsăreanu, K. Sen, [23] A. Bessey, K. Block, B. Chelf, A. Chou, B. Fulton, S. Hallem, N. Tillmann, and W. Visser, “Symbolic execution for software C. Henri-Gros, A. Kamsky, S. McPeak, and D. Engler, “A few testing in practice: preliminary assessment,” in Proceedings of billion lines of code later: using static analysis to find bugs in the the 33rd International Conference on Software Engineering, ser. real world,” Commun. ACM, vol. 53, no. 2, pp. 66-75, Feb. 2010. ICSE ’11. ACM, 2011, pp. 1066-1071. [24] A. Chaudhuri and J. S. Foster, “Symbolic security analysis of [8] C. S. Păsăreanu and W. Visser, “A survey of new trends in ruby-on-rails web applications,” in Proceedings of the 17th ACM

CSI Journal of Computing | Vol. 2 • No. 4, 2015 Dhiren Patel, et.al. R2 : 23

conference on Computer and communications security, ser. CCS [26] B. Johnson, “A study on improving static analysis tools: why are ’10. ACM, 2010, pp. 585-594. we not using them?” in Proceedings of the 2012 International Conference on Software Engineering, ser. ICSE 2012. IEEE [25] P. Godefroid, A. Kiezun, and M. Y. Levin, “Grammar Press, 2012, pp. 1607-1609. based whitebox fuzzing,” in Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and [27] N. Ayewah and W. Pugh, “The google findbugs fixit,” in implementation, ser. PLDI ’08. New York, NY, USA: ACM, Proceedings of the 19th international symposium on Software 2008, pp. 206-215. testing and analysis, ser. ISSTA ’10. New York, NY, USA: ACM, 2010, pp. 241-252.

About the Authors

Dr. Dhiren Patel is currently a Professor of Computer Engineering at NIT Surat. He carries over more than 20 years of experience in Academia, R&D and Secure ICT Infrastructure Design. Prof. Dhiren has authored a book “Information Security: Theory & Practice” published by Prentice Hall of India (PHI) in 2008. He chaired IFIP Trust Management VI conference in 2012. He has been a Visiting Professor at University of Denver Colorado USA (2014) and at IIT Gandhinagar India (2009-11).

Madhura Parikh did her B.Tech in Computer Engineering from NIT Surat (2009-13) and her Masters in Computer Science from The University of Texas at Austin (2013-14). Her interests are in Computer Systems, Machine Learning and Artificial Intelligence. She is also an open source enthusiast.

Reema Patel, Ph.D in Computer Engineering Department, NIT Surat. She received her B.E., and M.Tech., degrees in Computer Engineering from NIT Surat, in 2006 and 2010 respectively. Her main areas of research interests are Formal Verification, Modeling and Analysis Methods for Stochastic Systems and Information Security.

CSI Journal of Computing | Vol. 2 • No. 4, 2015 LFTL: A multi-threaded FTL for a Parallel IO R3 : 24 Flash Card under Linux LFTL: A multi-threaded FTL for a Parallel IO Flash Card under Linux Srimugunthan1, K. Gopinath2 and Giridhar Appaji Nag Yasa3

1 Indian Institute of Science, Email: [email protected] 2 Indian Institute of Science, Email: [email protected] 3 NetApp, Email: [email protected]

New PCI-e flash cards and SSDs supporting over 100,000 IOPs are now available, with several usecases in the design of a high performance storage system. By using an array of flash chips, arranged in multiple banks, large capacities are achieved. Such multi-banked architecture allow parallel read, write and erase operations. In a raw PCI-e flash card, such parallelism is directly available to the software layer. In addition, the devices have restrictions such as, pages within a block can only be written sequentially. The devices also have larger minimum write sizes (>4KB). Current flash translation layers (FTLs) in Linux are not well suited for such devices due to the high device speeds, architectural restrictions as well as other factors such as high lock contention. We present a FTL for Linux that takes into account the hardware restrictions, that also exploits the parallelism to achieve high speeds. We also consider leveraging the parallelism for garbage collection by scheduling the garbage collection activities on idle banks.We propose and evaluate an adaptive method to vary the amount of garbage collection according to the current I/O load on the device.

1. Introduction Flash memory scaling: Flash memory technologies include NAND flash, NOR IC fabrication processes are characterized by a feature flash, SLC/MLC flash memories, and their hybrids. Flash size, the minimum size that can be marked reliably during memory technologies, due to the cost, speed and power manufacturing. The feature size determines the surface area characteristics, can be used at different levels of the memory of a transistor, and hence the transistor count per unit area of hierarchy. silicon. As the feature size decreases, the storage capacity of However, flash memory has some peculiarities such a flash memory chip increases. But as we reduce the process as that a write is possible only on a fully erased block. feature size, there are problems like, the bits can not be Furthermore, a block can be erased and rewritten only a stored reliably. Sub-32 nm flash memory scaling has been limited number of times. Currently, the granularity of an a challenge in the past. Process innovations and scaling by erase (in terms of blocks) is much bigger than the write or storing multiple bits per cell have enabled bigger flash sizes read (in terms of pages). As the reuse of blocks depends on the per chip. As of this writing, the minimum feature size achieved lifetimes of data, different blocks may be rewritten different is 19nm, resulting in an 8GB flash chip storing 2 bits per number of times; hence, the reliability of a block can vary cell. But storing multiple cells per chip adversely affect the unless special wear leveling algorithms are used. endurance of the flash[1] bringing down the maximum erase Flash memory chips were dominantly used in embedded cycles per block in the range of 5000 to 10,000. It has been systems for handheld devices. With recent advances, Flash shown that scaling by packing more bits per cell 2 degrades has graduated from being only a low performance consumer the write performance[10]. Also the bus interface from a technology to also being used in the high-performance single NAND chip is usually a slow asynchronous interface enterprise domain. Presently, Flash memory chips are used operating at 40MHz. So in order to achieve larger capacities as extended system memory, or as a PCI express card acting and faster speeds, multiple flash memory chips are arrayed as a cache in a disk based storage system, or as a SSD drive together and they are accessed simultaneously. These kind of completely substituting disks. architectures are presently used in SSDs[7]. Multiple terabyte-sized SSD block devices made with SSD architecture and interface: An SSD package multiple flash chips are presently available. New PCI-e flash internally has its own memory (usually battery backed DRAM), cards and SSDs supporting high IOPS rate (eg. greater than a flash controller and some firmware that implements some 100,000) are also available. Due to limitations in scaling of basic wearlevelling algorithms and exposes a block interface. the size of flash memory per chip, parallelism is an inherent Hard disks can be readily substituted by SSDs, because the feature in all of these large flash storage devices. internal architecture of SSDs is hidden behind the block

CSI Journal of Computing | Vol. 2 • No. 4, 2015 Srimugunthan R3 : 25 interface. Because of the blackbox nature of SSDs, some Wearlevelling is a technique, that ensures that all the changes have been proposed in the ATA protocol interface blocks utilise their erase cycle budget at about the same such as the new TRIM command, that marks a block as rate. Associated with wearlevelling is garbage collection. garbage as a hint from an upper layer such as a filesystem Wearlevelling and garbage collection have three components: to the SSDs[2]. Out of place updates (referred to as Dynamic wearlevelling): A rewrite of a page is written to a different Raw flash card vs SSDs: page on the flash. The old page is marked as invalid. Though most flash is sold in the form of SSDs, there To reclaim invalidated pages (referred to as garbage are also a few raw flash cards used in some scenarios. Such collection): To make the invalid pages writable again, the flash cards pack hundreds of gigabytes in a single PCI-e card block with invalid pages has to erased. If the block has with an SSD-like architecture but without any firmware for some valid pages, it has to be copied to another block. wearlevelling or any on-board RAM. Move older unmodified data to other blocks of similar The flash card that we studied is made of several channels lifetimes (referred to as Static wearlevelling) or interfaces. Several of flash chips make a bank and several Because of the out-of-place update nature of the flash of a banks make a channel or interface. It is possible to do media, there is a difference between a logical address that writes or reads simultaneously across banks. The flash card a filesystem uses and a physical address that is used for have an in-built FPGA that supports DMA write size of 32KB writing to the flash. This logical address to physical address and DMA read size of 4KB. Pages within a block can only be translation is provided 3 by the flash translation layer. written sequentially one after another. Note that both garbage collection and static wearlevelling It is important to mention that we have used a raw flash incur some extra writes. The metric, write amplification, card and not a SSD for our study. quantifies the overhead due to garbage collection and static wearlevelling. Write amplification is the ratio between the Flash translation layer: number of user writes and the total writes that happened on The management of raw flash can either be implemented the flash. A good wearlevelling algorithm minimises the write within a device driver or other software layers above it. The amplification as much as possible. layer of software between the filesystem and the device driver When Garbage collection and static wearlevelling are that does the flash management is called Flash translation done during idle time when there is no I/O happening on the layer (FTL). The main functions of the FTL are address system, the overhead is only the extra writes and erases. But, mapping, wearlevelling and garbage collection. Working in under a heavy, continuous IO load, when the amount of free the context of Linux, we have designed a flash translation space drops low, it is necessary to invoke garbage collection layer for a raw flash card which exploits the parallel IO as part of a user write, to create a fresh supply of free blocks. capabilities. In addition to the main functionalities our FTL In this case garbage collection increases the latency of every also does caching of data, before it writes to the flash. write request. A policy decides when the garbage collection The following are the main contributions of this work We is to be triggered and how much space should be freed in one present a design of a Flash translation layer(FTL) under invocation. Linux, which can scale to higher speeds. Our FTL also Based on these goals several interesting algorithms have copes up with device limitations like larger minimum been proposed. The reference[9] is an excellent survey of flash write sizes by making use of buffering inside the FTL. memory related algorithms. We exploit the parallelism of a Flash card with respect to block allocation, garbage collection and for initial device 2.1 FTLs and Flash file systems scanning. The initial scan is time consuming for larger The FTL exposes a block device interface so that any disk flash; hence exploiting parallelism here is useful. file system can be used. The FTL does out-of-place updates on We give a adaptive method for varying the amount of the flash by maintaining a table of logical to physical address on-going garbage collection according to the current I/O translation entries. There are different FTLs based on how load. the translation table is maintained. Some well known FTLs Section 2 presents the background with respect to flash are NFTL[3], FAST FTL[14], and DFTL[11]. memories, flash filesystems, FTLs and the flash card used. Instead of using FTLs, there are flash specific filesystems Section 3 describes the design of our FTL. Section 4 presents like yaffs2, logfs and UBIFS that implement some or all of the the results. Section 5 is a discussion of some issues in our FTL functionalities like wearlevelling and garbage collection FTL and also about future directions of work and Section 6 in a filesystem. Popular flash filesystems are Yaffs2, logfs and concludes. UBIFS. We next discuss the Flash card that we used. 2. Background Flash memory has a limited budget of erase cycles per 2.2 Flash Card Architecture block. For SLC flash memories it is in hundreds of thousands The flash card, studied is of the type that is used for while for MLC flash it is in tens of thousands. I/O acceleration, by caching recently used user data and

CSI Journal of Computing | Vol. 2 • No. 4, 2015 LFTL: A multi-threaded FTL for a Parallel IO R3 : 26 Flash Card under Linux metadata in a SAN/NAS storage system . The flash card is 2.3 Parallelism in the flash device of size 512GB. A flash based SSD has many levels of parallelism. The Flash card is made of 4 interfaces or channels, 16 Firstly, there are many channels. Each channel is made banks per interface and 8 data chips + 1 parity chip per bank. of flash packages, each of which consists of several flash Chips are SLC nand chips with 1GB size and a maximum of chips. Within a single flash chip, there are multiple dies and 100,000 erase cycles as endurance limit. finally there are several flash planes within a single die. The The Card has a built-in FPGA controller that provides flash chips are mostly connected using a 40MHz bus in our DMA channels for reads, writes and erases. A single DMA device. The operating bus speed increases, as we move from write size is 32 KB that is striped equally among all 8 chips inner flash chips to outer flash packages. The flash chips can in a bank with the parity generated in the ninth chip. The be simultaneously read to internal buffers. The transfer from DMA read size is 4KB and it reads the striped data across these internal buffers happen at faster bus speeds. Because 8 chips. DMA erase makes 2MB of data across 8 chips to of the hierarchical architecture of channels, flash packages, become all ones. For every 512 bytes, BCH checksum of 8 chips, dies and planes, there are several levels of parallelism bytes is generated by hardware. An extra 8 bytes per 512 available. bytes is available for user spare data. So there is as much as At the lowest level, a feature called multiplane command 512 spare bytes per 32KB page available for storing any user allows 2 operations of the same type (read, write or erase) on metadata. The device has programmable registers through planes within a die to happen concurrently. Two dies within a PCI interface. We have written a driver for this device that chip share the IO bus in an interleaved fashion. Interleaving exposes a 512GB mtd device and with a write size of 32KB command executes several page read, page write, block erase and an erase size of 2MB. and multi-plane read/write/erase operations in different The device offers parallel read, write and erase support dies of the same chip in a pipelined fashion. At the highest across banks. There are 32 DMA read queues, 4 DMA write level, two requests targeted to 2 different channels or flash queues and 4 DMA erase queues. For write and erase queues, packages can execute simultaneously and independently. the device driver assigns each of the four queues to each of the Although the parallelism is known to NAND controller 4 interfaces and queues requests correspondingly. inside a SSD, the amount of parallelism that is exposed to The hardware FPGA controller services these queues software directly is very limited. Usually, SSD parallelism is in round robin fashion. For read queue, the device driver exploited indirectly, like writing in larger sizes. assigns 8 queues per interface. Each queue is assigned for In the flash card that we used, there is a hierarchy of 2 successive banks of an interface. Each queue can hold 256 channels, banks and chips. The hardware FPGA employs requests. When a request is completed, the device responds DMA read or write that stripes the data within the chips. by writing a completion descriptor in a completion queue. After the driver enqueues the read/write/erase requests in The completion queue is of size 1024 entries. When requests the DMA queues, the hardware FPGA services these queues proceed in parallel, it is possible to receive completion in a Roundrobin fashion. So the exposed parallelism is at the descriptors in out of order fashion. bank and channel levels. The driver also has a rudimentary bad block management. We first wrote a driver for the flash card that exposes Bad blocks are identified initially and stored in an on-disk the linux-mtd interface. Over the mtd driver we made file in the host system. The driver during initialisation reads measurements writing to all the banks from multiple threads. from this file and constructs its in memory bad block table. 4

Fig.Fig. 1: 1 Flash : Flash card card hardware hardware architecture architecture architecture of channels, flash packages, chips, dies compared to when the writes are targeted to banks and planes, there are several levels of parallelism within a channel. available. In our design for FTL, for simplicity and gener- CSIAt Journal the lowest of Computing level, a | feature Vol. 2 called• No. 4,multiplane 2015 ality, we assume only bank level parallelism in flash command allows 2 operations of the same type card. (read, write or erase) on planes within a die to happen concurrently. Two dies within a chip share 3 Design of the Flash Translation the IO bus in an interleaved fashion. Interleaving Layer command executes several page read, page write, We first discuss the performance bottlenecks in block erase and multi-plane read/write/erase opera- the present design of linux FTLs. We then discuss tions in different dies of the same chip in a pipelined the design we adopted and write and read algorithms fashion. At the highest level, two requests targeted along with buffer management to make use of to 2 different channels or flash packages can execute parallelism of the flash card. We further exploit the simultaneously and independently. parallelism for garbage collection activity. Although the parallelism is known to NAND controller inside a SSD, the amount of parallelism 3.1 Performance Bottleneck in Linux FTLs that is exposed to software directly is very limited. Usually, SSD parallelism is exploited indirectly, like Current implementations of FTL in linux, like writing in larger sizes. NFTL and mtdblock use the mtd_blkdevs infrastruc- In the flash card that we used, there is a hierarchy ture. The mtd_blkdevs basically registers for a new of channels, banks and chips. The hardware FPGA block device and a corresponding request processing employs DMA read or write that stripes the data function. This mtd_blkdevs infrastructure is used by within the chips. After the driver enqueues the any FTL which have their own implementations for read/write/erase requests in the DMA queues, the reading and writing sectors from the device. hardware FPGA services these queues in a Round- robin fashion. So the exposed parallelism is at the bank and channel levels. We first wrote a driver for the flash card that exposes the linux-mtd interface. Over the mtd driver we made measurements writing to all the banks from multiple threads. Our measurements showed that the read and write speeds can go more than 300MB/sec when the banks are utilised simultaneously. When the writes are targeted to banks across channels, we get more read and write speed Fig. 2: mtdblkdevs infrastructure in Linux 4

Fig. 1: Flash card hardware architecture architecture of channels, flash packages, chips, dies compared to when the writes are targeted to banks and planes, there are several levels of parallelism within a channel. available. SrimugunthanIn our design for FTL, for simplicity and gener- R3 : 27 At the lowest level, a feature called multiplane ality, we assume only bank level parallelism in flash command allows 2 operations of the same typeOurcard. measurements showed that the read and write speeds layer. The merging and coalescing optimisations also are not (read, write or erase) on planes within a die tocan go more than 300MB/sec when the banks are utilised useful because, our FTL uses a list of buffers that are used happen concurrently. Two dies within a chip sharesimultaneously.3 Design When of the the Flashwrites Translationare targeted to banks to accumulate the IOs to larger sizes. In addition to those the IO bus in an interleaved fashion. InterleavingacrossLayer channels, we get more read and write speed compared reasons, using a single request queue, means that we have command executes several page read, page write,to when the writes are targeted to banks within a channel. to take a lock for dequeueing the IO request. The Figure 3 is We first discuss the performance bottlenecks in the code snippet that is implemented in mtd_blkdevs.c and it block erase and multi-plane read/write/erase opera- In our design for FTL, for simplicity and generality, we assumethe present only bank design level parallelism of linux FTLs. in flash We card. then discuss shows the acquiring and releasing of request queue locks. The tions in different dies of the same chip in a pipelined the design we adopted and write and read algorithms problem of request queue lock contention as a hindrance to fashion. At the highest level, two requests targeted 3. alongDesign with of the buffer Flash management Translation to Layer make use of high IOPS devices has been discussed in the linux community to 2 different channels or flash packages can execute [4]. These reasons motivated us to not use the block layer’s parallelismWe first ofdiscuss the flash the card. performance We further bottlenecks exploit the in the simultaneously and independently. request queue management. presentparallelism design forof linux garbage FTLs. collection We then activity.discuss the design Although the parallelism is known to NANDwe adopted and write and read algorithms along with buffer In the code snippet of Fig. 3 a global device lock (dev->lock) is used assuming that requests to the flash device can only be controller inside a SSD, the amount of parallelismmanagement3.1 Performance to make use Bottleneck of parallelism in of Linux the flash FTLs card. We that is exposed to software directly is very limited.further exploit the parallelism for garbage collection activity. done serially one after another. With the existing mtdblkdevs infrastructure, the maximum IO bandwidth that we were Usually, SSD parallelism is exploited indirectly, like Current implementations of FTL in linux, like 3.1NFTL Performance and mtdblock Bottleneck use the in mtd_blkdevs Linux FTLs infrastruc- able to obtain was only upto 55MB/sec. The large capacity writing in larger sizes. flash cards can process several IO requests simultaneously on Current implementations of FTL in linux, like NFTL In the flash card that we used, there is a hierarchy ture. The mtd_blkdevs basically registers for a new several banks. Since banks are the minimal unit of parallel and mtdblock use the mtd_blkdevs infrastructure. The block device and a corresponding request processing IO, the global device level locks should be split in to per bank of channels, banks and chips. The hardware FPGAmtd_blkdevs basically registers for a new block device and a function. This mtd_blkdevs infrastructure is used by locks. employs DMA read or write that stripes the datacorresponding request processing function. This mtd_blkdevs within the chips. After the driver enqueues theinfrastructureany FTL which is used have by theirany FTL own which implementations have their own for We incorporated these changes in the design of our FTL read/write/erase requests in the DMA queues, theimplementationsreading and writing for reading sectors and from writing the sectors device. from the and Figure 4 shows the design that we followed. In our FTL, device. we maintain multiple FIFO Fig. 3: mtdblkdevs code showing hardware FPGA services these queues in a Round- lock contention queues of our own and each of these queues robin fashion. So the exposed parallelism is at the has an associated kernel thread. The kernel threads processes bank and channel levels. I/O requests from its corresponding queue. So these multiple We first wrote a driver for the flash card that ‘FTL I/O kernel threads’ provide parallel I/O capability. 5 exposes the linux-mtd interface. Over the mtd driverIt is to be noted that the NFTL and mtdblock we made measurements writing to all the banksare pseudo block devices. In Linux a block device’s from multiple threads. Our measurements showedrequest processing function is not supposed to sleep. that the read and write speeds can go more thanHowever, the threads that call mtd apis can sleep, 300MB/sec when the banks are utilised simulta-and so the mtd_blkdevs does the request processing neously. When the writes are targeted to banksfrom a separate kthread. As shown in Figure 2, Fig.Fig. 2 2:: mtdblkdevs mtdblkdevs infrastructure infrastructure in in Linux Linux across channels, we get more read and write speedthe kernel thread receives the IO requests from the blockIt is to layer’s be noted request that the NFTL queue and and mtdblock calls are NFTL pseudo or mtdblock’sblock devices. write/readIn Linux a block sector device’s code. request The processing actual function is not supposed to sleep. However, the threads that reading/writingcall mtd apis can of sleep, the IOand requestso the mtd_blkdevs is done by does the the NFTLrequest or processing mtdblock’s from write a separate or read kthread. sector As shown code. in TheFigure kernel 2, the threads kernel sleep thread when receives the the request IO requests queue from is empty.the block Whenlayer’s request a new queue request and calls is NFTL enqueued or mtdblock’s the write/read sector code. The actual reading/writing of the IO mtd_blkdevsrequest is done request by the processing NFTL or mtdblock’s function write wakes or up read thesector kernel code. thread, The kernel which threads does sleep the when actual the processing. request queue Theis empty. block When layer a new provides request one is enqueued request the queue mtd_blkdevs for a registeredrequest device. processing The function block wakes layer’s up request the kernel queue thread, is which does the actual processing. Fig. 3 : mtdblkdevs code showing lock contention used byThe the block IO scheduling layer provides algorithms. one request The queue request for a Fig. 3: mtdblkdevs code showing lock contention queueregistered is a device. doubly The linked block layer’s list andrequest it queue is used is used for by The data structure corresponding to block layer’s I/O coalescingthe IO scheduling and merging algorithms. the bios The request (block queue IO requests) is a doubly queuesrequest of is ourcalled own bio. andThe entry each point of thesein the queueslinux block has I/O to muchlinked list larger and it requests. is used for The coalescing IO scheduling and merging algo-the bios anlayer associated is the function kernel bio_submit() thread. The which kernel receives threads the bio. pro- We (block IO requests) to much larger requests. The IO scheduling cessesintercept I/O these requests block fromI/O requests its corresponding and direct them queue.to one of rithmsalgorithms of the blockof the layerblock layer are specifically are specifically targeted targeted for for Soour these FIFO multiple queues. The ‘FTL intercepting I/O kernel of the threads’ block I/O providerequests is hardhard disks disks to to minimise minimise seekseek time. time. Flash Flash memories memories don’t also done for RAID devices and the kernel exports an API for don’tbenefit benefit from from the the IO IO scheduling scheduling optimisations optimisations of the block parallelregistering I/O capability.our custom make_request function. of the block layer. The merging and coalescing The data structure corresponding to block layer’s optimisations also are not useful because, our FTL I/O request is called bio. The entry point in the linux block I/O layer is the function bio_submit() uses a list of buffers that are used to accumulate CSI Journal of Computing | Vol. 2 • No. 4, 2015 the IOs to larger sizes. In addition to those reasons, which receives the bio. We intercept these block using a single request queue, means that we have I/O requests and direct them to one of our FIFO to take a lock for dequeueing the IO request. The queues. The intercepting of the block I/O requests Figure 3 is the code snippet that is implemented is also done for RAID devices and the kernel exports in mtd_blkdevs.c and it shows the acquiring and an API for registering our custom make_request releasing of request queue locks. The problem of function. request queue lock contention as a hindrance to The mtd API for read and write are synchronous high IOPS devices has been discussed in the linux API calls. The use of multiple FIFO queues and community [4]. These reasons motivated us to not multiple kernel threads, in addition to removing the use the block layer’s request queue management. lock contention, also facilitates several IOs to be in- In the code snippet of Figure 3 a global device flight simultaneously, even though the mtd read and lock (dev->lock) is used assuming that requests to write calls are synchronous. The per-queue kernel the flash device can only be done serially one after thread calls the read/write sector functions of our another. With the existing mtdblkdevs infrastructure, FTL. the maximum IO bandwidth that we were able to Unlike NFTL we have exposed a 4KB block obtain was only upto 55MB/sec. The large capacity device size, but the minimum flash writesize is flash cards can process several IO requests simul- 32KB. So FTL uses a set of buffers and accumulates taneously on several banks. Since banks are the the eight consecutive logical 4K data to a single 32K minimal unit of parallel IO, the global device level buffer and finally writes it to the flash. The per- locks should be split in to per bank locks. buffer size and the number of buffers in the FTL We incorporated these changes in the design of are module parameters. The per-buffer size should our FTL and Figure 4 shows the design that we be atleast the flash page size. The number of buffers followed. In our FTL, we maintain multiple FIFO is only limited by the amount of RAM in the host LFTL: A multi-threaded FTL for a Parallel IO R3 : 28 Flash Card under Linux

The mtd API for read and write are synchronous API but the minimum flash writesize is 32KB. So FTL uses a set calls. The use of multiple FIFO queues and multiple kernel of buffers and accumulates the eight consecutive logical 4K threads, in addition to removing the lock contention, also data to a single 32K buffer and finally writes it to the flash. facilitates several IOs to be inflight simultaneously, even The perbuffer size and the number of buffers in the FTL are though the mtd read and write calls are synchronous. The module parameters. The per-buffer size should be atleast the per-queue kernel thread calls the read/write sector functions flash page size. The number of buffers is only limited by the of our FTL. amount of RAM in the host system. Currently we have used Unlike NFTL we have exposed a 4KB block device size, 256 buffers and the buffer size is 32K. 6

Fig.Fig. 4: 4 : FTLFTL design design system. Currently we have used 256 buffers and the BankInfo: The bankinfo stores the number of • buffer size is 32K. free blocks and number of valid pages in the CSI Journal of Computing | Vol. 2 • No. 4, 2015 bank. We used linux atomic bitmap operations 3.2 Write and Read algorithm and atomic variable updates for modifying blk- The main data structures, that are used in our info and bankinfo. MapTable: The maptable stores 4 bytes/entry FTL, are listed below. • FreeBlksBitMap: The FreeBlksBitMap indi- for every page in the flash. For 512GB flash • cates whether a given block is free or occupied. and 32K page size, the map table is of 64MB The Freeblks Bit-map datastructure is protected size. We have used bit-locking, with a single by per-bank locks. bit protecting per-entry of the maptable. We BufferLookUp Table: The BufferLookup Table used the linux atomic bitmap operations for the • stores the logical page number that is stored in bit-locking. The space for bitmap need not be a buffer. There is no lock taken for reading or allocated separately, as we can steal one bit writing the buffer Lookup table. from the map table entry. AllocBufBitmap: The AllocBufBitmap indi- BlkInfo: The blkinfo stores a bitmap indicating • • the valid pages in the block and the number of cates that we are trying to allocate a buffer for valid pages in the block. a logical page. It is used for mutual exclusion 7 to prevent allocation of two buffers to the same 3.2.1 FTL Buffer management logical page number. The buffers are either empty or they could be allo- The write algorithm is given in Algorithm 1. cated to some lpn. The buffers which are allocated to Initially we search the buffer lookup table, if the some lpn can be either fully dirty or partially dirty. given lpn (logical page number) is present in any We maintain 2 queues, one that stores buffer indexes of the FTL cache buffers. If the lpn is available, for empty buffers and another that stores buffer we take the per buffer lock and again check if the indexes for full buffers. The buffer indexes that are buffer corresponds to lpn. This is similar to double- not in empty or full buffer queues are partially full checked locking programming idiom[13]. If the lpn buffers. We attempt to first allocate an empty buffer is not in the FTL cache, then we have to allocate for an lpn. If that fails then we select a full buffer. one of the buffers to this lpn. We set the bit in If that fails too, then we allocate a partially full AllocBufBitmap for this lpn to indicate that we are buffer. We make use of the RCU based lock free trying to allocate a buffer for this lpn. After taking queue implementation from liburcu library [5] for the bit-lock, it is necessary that we check again, that the maintenance of buffer index queues. there is no buffer for this lpn. With the use of lock free queue and by reducing the granularity of locks to the smallest units, there is Algorithm 1 FTL write algorithm no single central lock contention point in our FTL. procedure FTLWRITESEC- The evicted full or a partially full buffer has to be TOR(Logical sector number) written to flash. For a partially full buffer, we have to lpn Corresponding logical page number do a read-modify-write by merging with the existing bufnum← Search bufLookUpTable for lpn content from flash and write to another page. ← if Search was SUCCESS then We also have a background thread similar to WRITELOCK BUF[BUFNUM]; buffer flush daemon, that writes the buffers to the Confirm BUF[bufnum] corresponds to lpn flash if it is not read or written for more than 60 Write to BUF[bufnum]; secs. WRITEUNLOCK BUF[BUFNUM]; else 3.2.2 Physical page allocation ET IT IN AllocBufBitmap Srimugunthan S B ; R3 : 29 When we flush a buffer to the flash, a physi- Confirm lpn is not in bufLookUpTable; cal page to write to flash is selected. Depending bufnum Select a Empty/Full/Partially on the number of banks, we maintain a ‘current- 3.2 Write and Read algorithm ← Full Buffer; writing block’ for each bank. Within the current The main data structures, that are used in our FTL, are WRITELOCK BUF[BUFNUM]; writing block the pages are written sequentially. The listed below. change bufLookUpTable[bufnum] ; get_physical_page in Algorithm 1, selects one of FreeBlksBitMap: The FreeBlksBitMap indicates if BUF[bufnum] is not an Empty Buffer whether a given block is free or occupied. The Freeblks the banks and returns the next page of the ‘current- Bit-map data structure is protected by per-bank locks. then writing block’ in that bank. The selection of banks newtempbuf malloc(buffer size) BufferLookUp Table: The BufferLookup Table stores the ← is based on the activities happening on the bank and logical page number that is stored in a buffer. There is no Swap( newtempbuf, BUF[bufnum]) ; is described below. lock taken for reading or writing the buffer Lookup table. Set flush = 1; Finally, after the evicted buffer is written to flash, BlkInfo: The blkinfo stores a bitmap indicating the valid write to BUF[bufnum]; the map table is updated. pages in the block and the number of valid pages in the WRITEUNLOCK BUF[BUFNUM]; The read algorithm is similar. It proceeds by first block. UNSET BIT IN AllocBufBitmap ; checking if the logical sector is in FTL cache. If not, BankInfo: The bankinfo stores the number of free blocks and number of valid pages in the bank. We used linux if flush == 1 then it reads from the flash the corresponding sector. merge with flash, if BUF[bufnum] was atomic bitmap operations and atomic variable updates 3.3 Garbage collection for modifying blkinfo and bankinfo. partially FULL; MapTable: The maptable stores 4 bytes/entry for every ppn get_physical_page() We have implemented a greedy garbage col- page in the flash. For 512GB flash and 32K page size, Write←newtempbuf to ppn ; lection algorithm that acts locally within a bank. the map table is of 64MB size. We have used bit-locking, Change Maptable ; We select dirty blocks with less than a minimum with a single bit protecting per-entry of the maptable. threshold of valid pages. The valid page threshold We used the linux atomic bitmap operations for the end procedure 7 to prevent allocation of two buffers to the same 3.2.1 FTL Buffer management depends on the amount of available free blocks in a bit-locking.logical page The number.space for bitmap need not be allocated separately, as we can steal one bit from the map table The buffers are either empty or they could be allo- bank. There are three garbage collection levels with Theentry. write algorithm is given in Algorithm 1. cated3.2.1 to FTL some Buffer lpn. Themanagement buffers which are allocated to Initially AllocBufBitmap: we search The the AllocBufBitmap buffer lookup indicates table, that if the we some lpnThe canbuffers be are either either fully empty dirty or orthey partially could be dirty. allocated givenarelpn trying(logical to allocate page a buffer number) for a logical is present page. It in is anyused Weto maintain some lpn. 2The queues, buffers one which that are stores allocated buffer to some indexes lpn can of thefor mutual FTL cacheexclusion buffers. to prevent If theallocation lpn is of available,two buffers be either fully dirty or partially dirty. We maintain 2 queues, forone empty that stores buffers buffer and indexes another for empty that buffers stores and buffer another weto take the thesame per logical buffer page lock number. and again check if the The write algorithm is given in Algorithm 1. Initially we indexesthat stores for fullbuffer buffers. indexes The for full buffer buffers. indexes The buffer that areindexes searchbuffer the corresponds buffer lookup to lpn.table, This if the is given similar lpn to(logical double- page notthat in emptyare not in or empty full buffer or full queuesbuffer queues are partially are partially full full number)checked is locking present in programming any of the FTL idiom[13]. cache buffers. If If the the lpn lpn buffers.buffers. We We attempt attempt toto firstfirst allocateallocate anan empty buffer buffer for an isis available, not in the we take FTL the cache, per buffer then lock we and have again to check allocate if the forlpn. an If lpn. that If fails that then fails we then select we a full select buffer. a fullIf that buffer. fails too, bufferone ofcorresponds the buffers to lpn. to thisThis is lpn. similar We to set doublechecked the bit in Ifthen that we fails allocate too, a then partially we full allocate buffer. a We partially make use full of the lockingAllocBufBitmap programming for idiom[13]. this lpn If to the indicate lpn is not that in wethe areFTL RCU based lock free queue implementation from liburcu buffer.library We [5] for make the maintenance use of the RCUof buffer based index lockqueues. free cache,trying then to allocatewe have to a bufferallocate forone thisof the lpn. buffers After to this taking lpn. We set the bit in AllocBufBitmap for this lpn to indicate that queueWith implementation the use of lock from free liburcu queue and library by reducing [5] for the wethe are bit-lock, trying to it allocate is necessary a buffer that for this we lpn. check After again, taking that the thegranularity maintenance of locks of bufferto the smallest index queues. units, there is no single bit-lock,there is it no is necessary buffer for that this we lpn. check again, that there is no Withcentral the lock use contention of lock point free in queue our FTL. and by reducing buffer for this lpn. the granularityThe evicted of full locks or a topartially the smallest full buffer units, has to there be written is Algorithm 1 FTL write algorithm noto single flash. central For a lock partially contention full buffer, point in we our have FTL. to do aread- modify-write by merging with the existing content from flash procedure FTLWRITESEC- Theand write evicted to another full or page. a partially full buffer has to be TOR(Logical sector number) writtenWe to also flash. have For a background a partially fullthread buffer, similar we to have buffer to flush lpn Corresponding logical page number dodaemon, a read-modify-write that writes the bybuffers merging to the withflash the if it existingis not read or ← bufnum Search bufLookUpTable for lpn contentwritten from for more flash than and 60 write secs. to another page. Search← was SUCCESS We also have a background thread similar to if then 3.2.2 Physical page allocation WRITELOCK BUF[BUFNUM]; buffer flush daemon, that writes the buffers to the Confirm BUF[bufnum] corresponds to lpn flash ifWhen it is we not flush read a orbuffer written to the forflash, more a physical than 60 page to secs.write to flash is selected. Depending on the number of banks, Write to BUF[bufnum]; we maintain a ‘currentwriting block’ for each bank. Within WRITEUNLOCK BUF[BUFNUM]; the current writing block the pages are written sequentially. else 3.2.2The get_physical_page Physical page in allocation Algorithm 1, selects one of the SET BIT IN AllocBufBitmap ; Whenbanks we and flushreturns a the buffer next page to of the the flash,‘currentwriting a physi- block’ in that bank. The selection of banks is based on the activities Confirm lpn is not in bufLookUpTable; cal page to write to flash is selected. Depending bufnum Select a Empty/Full/Partially happening on the bank and is described below. ← on theFinally, number after of the banks, evicted we buffer maintain is written a ‘current- to flash, the Full Buffer; writing block’ for each bank. Within the current WRITELOCK BUF[BUFNUM]; writing block the pages are written sequentially. The change bufLookUpTable[bufnum] ; get_physical_pageCSI Journal inof Computing Algorithm 1,| Vol. selects 2 • No. one 4, of 2015 if BUF[bufnum] is not an Empty Buffer the banks and returns the next page of the ‘current- then writing block’ in that bank. The selection of banks newtempbuf malloc(buffer size) ← is based on the activities happening on the bank and Swap( newtempbuf, BUF[bufnum]) ; is described below. Set flush = 1; Finally, after the evicted buffer is written to flash, write to BUF[bufnum]; the map table is updated. WRITEUNLOCK BUF[BUFNUM]; The read algorithm is similar. It proceeds by first UNSET BIT IN AllocBufBitmap ; checking if the logical sector is in FTL cache. If not, if flush == 1 then it reads from the flash the corresponding sector. merge with flash, if BUF[bufnum] was 3.3 Garbage collection partially FULL; ppn get_physical_page() We have implemented a greedy garbage col- Write←newtempbuf to ppn ; lection algorithm that acts locally within a bank. Change Maptable ; We select dirty blocks with less than a minimum threshold of valid pages. The valid page threshold end procedure depends on the amount of available free blocks in a bank. There are three garbage collection levels with LFTL: A multi-threaded FTL for a Parallel IO R3 : 30 Flash Card under Linux map table is updated. active garbage collecting threads. During heavy I/O, atmost The read algorithm is similar. It proceeds by first only one garbage collector thread is active. checking if the logical sector is in FTL cache. If not, it reads To implement the adaptive garbage collection, one from the flash the corresponding sector. of the garbage collecting threads take on the master role. The amount of I/O happening in the system is determined 3.3 Garbage collection by the number of FTL I/O threads that are asleep and the We have implemented a greedy garbage collection number of threads that are active. So the status of the FTL algorithm that acts locally within a bank. We select dirty I/O kthreads are monitored. Depending on the range of FTL blocks with less than a minimum threshold of valid pages. I/O threads active, we define the range for the number of The valid page threshold depends on the amount of available garbage collection threads that can be active. For example, free blocks in a bank. There are three garbage collection when all the FTL I/O kthreads are active, the number of levels with different valid page thresholds for corresponding active garbage collector threads will be one and only the free block thresholds. master garbage collector thread will be running. But when the I/O in the system drops and fall into a different range, the For the first level, the garbage collection algorithm only master garbage collector thread will wake up a few of other selects victim blocks with zero valid pages. For the subsequent garbage collector threads. Every garbage collector thread will levels the “number of valid pages in a block threshold” is perform one round of garbage collection on some available increased. For the first level of garbage collection, there is only banks, before they check the status of the FTL I/O kthreads. erase of victim blocks, that need to be done. For subsequent For the current number of active I/O threads, if the number levels we need to do both copying and erase. The greater of GC threads is more than the permissible threshold, then a the level, the greater the copying that need to be done. Our non-master garbage collector thread will put itself to sleep. garbage collection algorithm acts locally, in the sense that, The master garbage collector thread makes sure that the valid pages are copied to another block within the same a bank with very few free blocks, is compulsorily garbage bank. collected in the near future. This is done by flagging the bank Garbage collection can be performed as exclusiveGC and this will prevent the get_physical_page() 1. before a write request is started: Before servicing a write algorithm from selecting the bank for writing. This way, we request, when a new block has to be allocated, we trigger priorities one of the banks for garbage collection. the on-demand garbage collection, which creates a fresh supply of free blocks. 3.5 Checkpointing 2. after a write request: In the case of some SSDs, garbage Similar to yaffs2, our FTL writes the in-memory data collection is scheduled to happen after the current structures to the flash when we unload the module and they write request is completed. This way, the current write are read back when it is again loaded. We write the main request doesn’t suffer the additional garbage collection data structures like the maptable, freeblks map, the blkinfo latencies. But if there are any subsequent write requests, and bankinfo data structures to the flash. The blocks that those requests are delayed due to the garbage collection hold these metadata information are differentiated from overhead. the blocks that store the normal data by storing a flag in 3. as a background thread to run in idle time: the out-of-band area indicating the type of block. When the We do the garbage collection as separate kernel threads module is loaded, the flash blocks have to be scanned for the which runs alongside the FTL I/O kernel threads. Garbage checkpointed blocks. This block level scanning can be sped-up collecting threads try to select a bank for garbage collection by parallelising the scanning, by reading the flash through which is not being written currently. Similarly when the get_ several kernel threads during module load. physical_page() algorithm selects a bank, it tries to select a Actually we don’t scan all the blocks, but only a top bank that is not garbage collected. So the garbage collection few and bottom few blocks of every bank. We maintain the is scheduled in such a way that it doesn’t interfere much with checkpoint blocks as a linked list on the flash, with one the I/O. checkpoint block pointing to the next checkpoint block. We impose a restriction that the first checkpoint block is stored 3.4 Adaptive Garbage collection in the top few or bottom few blocks of any of the banks in Parallelism of the flash card is exploited, when the the FTL. So the parallelised scanning only reads these few garbage collection is performed from multiple threads. Each blocks and identifies the first checkpoint block. Since blocks garbage collecting thread can work on any one of the bank. in a bank are written sequentially, we are more likely to With more threads active, more dirty blocks can be garbage find a free block in the top or bottom few blocks. In case if collected. We control the amount of garbage collection none of the banks has a free block on the top or bottom few performed by controlling the number of simultaneously active blocks, then we create a free block by moving the contents of garbage collector threads. When the system is relatively idle, a occupied block and write the first checkpoint block there. garbage collection is performed with full force, with all the So during module load time, after finding the first garbage collector threads being active. When I/O happens, we checkpoint block, the rest of the checkpoint blocks are found tone down the garbage collection by reducing the number of by following the linked chain.

CSI Journal of Computing | Vol. 2 • No. 4, 2015 Srimugunthan R3 : 31

4. Performance evaluation 4.3 Module load measurement In this section, we evaluate the resultant improvements TABLE 3: Initial scanning time of our parallelism modified algorithms. The flash card is Scanning method used Time taken mounted in a PCI express slot in HP workstation machine, during module load with four core intel i5 processors running at 3.2GHz. The parallel page level scan ~30 min system has 4GB of memory. We used the linux kernel version 2.6.38.8 and the operating system is Ubuntu 10.04. linear block level scan of checkpoint ~2 min parallel block level scan of checkpoint ~28 sec 4.1 Number of Queues vs FTL Block device speed bounded parallel block level scan of ~2-3 sec We first give the measurement over the FTL exposed checkpoint block device without any filesystem. Table 1 shows the Table 3 gives the initialisation measurements for a 512 measurement of read and write speed for various number of GB flash card. We need to do a scanning of a few blocks in the FTL queues/threads. Having a single thread servicing from beginning of each bank to find the first checkpoint block. As a single request queue suppresses the hardware parallelism the checkpoint blocks are maintained as a linked list on flash, considerably. Increasing the number of FTL queues and after finding the first checkpoint block we follow the chain. associated threads we were able to get write speeds upto Our FTL load time is only a few seconds. Any mechanism 360MB/sec. Due to caching effects, write speeds are greater that requires a linear block level scan like UBI in 2.6.38.8 than the read speed. kernel, takes 2 minutes. Table 1: Number of FTL Queues vs read-write speed For Yaffs2, the time taken for module load depends upon the amount of data on flash. We observed 16 secs of module Number of Read speed Write speed load time, when the flash is more than 80% full. For failure Queues scenarios, for our FTL we need to do a page level scan of the 1 53 MB/sec 45 MB/sec flash. Doing a parallel scan of the pages on the banks still 2 99 MB/sec 91 MB/sec takes 30 minutes. A linear scan will take an inordinately long 4 163 MB/sec 170 MB/sec time, for a few hours. 8 210 MB/sec 282 MB/sec 4.4 Results for Garbage collection 16 223 MB/sec 364 MB/sec The garbage collection measurements are measured for three policies: NPGC: Before a write is done on a bank, if the number of 4.2 Filesystem measurement free blocks is less, we garbage collect the particular bank, Table 2 gives the filesystem read and write speed. We and then subsequently do the write. mounted XFS filesystem over our FTL. The filesystem PLLGC: Garbage collection is always performed by measurements are taken using tiobench. The results parallel co-running garbage collector threads. The write correspond to writing from 16 threads with each thread algorithm tries, if possible, to select a bank, which is not writing 1GB. We see that the existing flash filesystems being garbage collected. Otherwise the write algorithm perform poorly for multithreaded workloads. Yaffs2 for single selects a random bank. thread was able to get upto 40MB/sec, but with increasing PLLGC+Adaptive: Similar to PLLGC policy, the the number of threads, performance deteriorated. UBIFS garbage collection is always performed by parallel co- was better, but still didn’t achieve the speeds, the device can running garbage collector threads. In addition, the support. We obtain good performance with our FTL used with number of garbage collector threads are varied according XFS filesystem. We have also shown results, by using our to the amount of I/O. FTL with the existing mtd_blkdevs infrastructure to receive We use 64 FTL I/O kernel threads for the following block IOs. measurements. First, we compare the the NPGC policy with PLLGC Table 2 : Tiobench results policy. To avoid the long time to start garbage collection, Filesystem used Write speed Read speed we restrict the size of flash as 8GB by reducing the number of blocks per bank to 64. Our garbage collection tests were yaffs2 5 MB/sec 4 MB/sec done after modifying the FTL data structures to reflect a long UBIFS 60MB/sec 55MB/sec flash usage period. We artificially made the blocks dirty by XFS + FTL using 42MB/sec 54MB/sec modifying the FTL’s data structures by an IOCTL. The result mtdblkdevs framework of the data structure modification is shown in Figure 5 and Figure 6. The Figure 5 shows the distribution of free blocks XFS+ FTL using modified 297MB/sec 245MB/sec per bank, which is roughly normally distributed. The Figure multi-queued framework 6 shows the distribution of valid pages in a block. After the

CSI Journal of Computing | Vol. 2 • No. 4, 2015 10 For Yaffs2, the time taken for module load depends 35 upon the amount of data on flash. We observed 16 30 secs of module load time, when the flash is more than 80% full. For failure scenarios, for our FTL 25 we need to do a page level scan of the flash. Doing a parallel scan of the pages on the banks still takes 20

30 minutes. A linear scan will take an inordinately 15 long time, for a few hours. 10 4.4 Results for Garbage collection

The garbage collection measurements are mea- 5 0 10 20 30 40 50 60 70 sured for three policies: LFTL: A multi-threaded FTL for a Parallel IO : Before a write is done on a bank, if the Fig. 5: Distribution of free blocks per bank. The x- R3• :NPGC 32 Flash Card under Linux number of free blocks is less, we garbage col- axis is the bank ids and the y-axis is the number of lect the particular bank, and then subsequently free blocks in the bank modifications of the data structures, the garbage collection 70 measurementsdo the write. were taken. The maximum number of GC : Garbage collection is always per- 60 thread• PLLGC is set to 1 for this measurement. formedFor this measurement, by parallel we co-running do over-write garbage of 1GB of col-data 50 uptolector 16 times threads. from a single The threaded write application. algorithm We tries, show the if 40 latenciespossible, of write to requests select from a bank, the single which application is not thread. being The application program does a fsync after every 4K to ensure 30 thatgarbage the data reaches collected. the FTL Otherwise device. The the Table write 4 compares algo-

the rithmNPGC selectsand PLLGC a random policy showing bank. the scatter-plot of 20 writePLLGC+Adaptive latencies, for every written: Similar 32KB, to in PLLGCmicroseconds. policy, The • 10 x-axisthe in garbagethe scatter collection plot is the isrequest always ids and performed the y-axis by is the write latency for that request. 0 parallel co-running garbage collector threads. 0 500 1000 1500 2000 2500 3000 In addition, 35 the number of garbage collector Fig. 6: Distribution of valid pages in occupied blks threads 30 are varied according to the amount of Fig. 6: Distribution of valid pages in occupied blks I/O. program does a fsync after every 4K to ensure The scatter plot shows the number of requests that cross 25 We use 64 FTL I/O kernel threads for the follow- thatthe the2ms databoundary reaches is much the more FTL in device. NPGC Thepolicy Tablebecause 4

ing measurements. 20 comparesof the GC overhead. the NPGC When and we PLLGCdo the garbage policy collection showing in First, we compare the the NPGC policy with theparallel, scatter-plot the GC ofoverhead write latencies,is considerably for everyremoved. written It is also important to notice that we were able to garbage collect PLLGC policy. 15 To avoid the long time to start 32KB,roughly in the microseconds. same number Theof blocks x-axis without in the incurring scatterplot the garbage collection, we restrict the size of flash as is the request ids and the y-axis is the write latency 10 additional GC overhead, by scheduling the garbage collection 8GB by reducing the number of blocks per bank foractivity that on request. a parallel bank. The elapsed time is also shorter in the PLLGC version. The completion of the test took 546 to 64. Our 5 garbage collection tests were done after 0 10 20 30 40 50 60 70 The scatter plot shows the number of requests that modifying the FTL data structures to reflect a long seconds when garbage collection is done in the write-path and crossonly 400 the seconds 2ms for boundary PLLGC policy. is much more in NPGC flash usageFig. 5 period.: Distribution We artificially of free blocks made per thebank. blocks policy because of the GC overhead. When we do The xaxis is the bank ids and the y-axis is the number Next, we provide the measurements comparing dirty by modifyingof free the blocks FTL’s in the data bank structures by an thePLLGC garbage with collectionPLLGC+Adaptive in parallel, policy thefor the GC scenario overhead of IOCTL. The result of the data structure modification ismultithreaded considerably applications. removed. It is also important to is shown in Figure 5 and Figure 6. The Figure notice that we were able to garbage collect roughly 5 shows the distribution of free blocks per bank, the same number of blocks without incurring the 11 whichTABLE is roughlyTable 4: Comparison 4 : Comparison normally of distributed. of garbage garbage collection Thecollection Figure done doneadditional before before a write GCa write and overhead, parallel and parallel by garbage scheduling garbage collection the collection garbage 6 shows the distribution of valid pages in a block. collection activity on a parallel bank. The elapsed After the modifications of theNPGC data policy structures, the time is alsoPLLGC shorter policy in the PLLGC version. The garbage collection measurements were taken. The completion of the test took 546 seconds when maximum number of GC thread is set to 1 for this garbage collection is done in the write-path and only measurement. 400 seconds for PLLGC policy. For this measurement, we do over-write of 1GB Next, we provide the measurements comparing of data upto 16 times from a single threaded ap- PLLGC with PLLGC+Adaptive policy for the sce- plication. We show the latencies of write requests nario of multithreaded applications. from the single application thread. The application

Scatterplot of write latencies Number of blocks 8312 blocks 8392 blocks garbage collected Elapsed time 546 sec 400 secs

5 are [15], [8], [12]. Compared to these previous PLLGC+Adaptive PLLGC papers, we have used a Raw Flash Card, instead

4 of an SSD. While FTLs in SSDs are implemented in Flash controller, in contrast, we have considered

3 implementation of a software FTL, which is a layer above the device driver. In the context of Linux 2 operating system, we have clearly shown the inad-

Normalised Average latency equacy of the present FTLs in Linux, with respect 1 to performance. We next discuss some weaknesses and modifica- 0 0 20 40 60 80 100 120 140 tions required in our present FTL implementation. Thread With each written page on the flash, we store Fig. 7: Normalised Average write latency for the some metadata in the out-of-band area(spare bytes). 128 threads. The x-axis is the thread ids. For every We store the type of block (blocks that store check- thread we calculate the average latency. The plot is point data or normal data), the corresponding logical made after normalising with respect to the minimum page number, and a sequence number. The sequence average latency. number is similar to the one used in yaffs2 and is used for failure scenarios, when the FTL module is The plot in the Figure 7 shows the normalised unloaded incorrectly. After an incorrect unload, the average latency as seen from the 128 application FTL has to do a page-level scan of the flash. Failure pthreads. The adaptive version performed better handling is important in battery operated embedded than the non-adaptive parallel garbage collector. devices. But there are several usecases, other than 11 TABLE 4: Comparison of garbage collection done before a write and parallel garbage collection

NPGC policy PLLGC policy

Scatterplot of write latencies Number of blocks 8312 blocks 8392 blocks garbage collected Elapsed time 546 sec 400 secs

We set the maximum number of garbage collector Also the total elapsed time for the test was 217 threads to be 8 and we did the evaluation by also secs for PLLGC+Adaptive policy while PLLGC reducingSrimugunthan the number of banks on the flash to 8. policy took 251 secs to completion.R3 This: 33 shows We used a simulated workload with 128 application that by adaptively minimising the garbage collection pthreadsWe set having the maximum a fixed number think time. of garbage Each applicationcollector yaffs2specific and is used read for failure and writes,scenarios, when when the the FTL IO module load is high, pthreadthreads to sleepsbe 8 and for we 20msdid the afterevaluation every by also 32K reducing write andis unloadedthere isincorrectly. lesser interferenceAfter an incorrect and unload, the averagethe FTL latency the number of banks on the flash to 8. We used a simulated has to do a page-level scan of the flash. Failure handling is 10secworkload after with every128 application 2MB write pthreads and having each a pthread fixed think writes importantper IO in battery request operated can also embedded be minimised. devices. But there 4MBtime. Each of dataapplication 16times. pthread So sleeps in total, for 20ms we writeafter every 8GB ofare several usecases, other than embedded systems, for the data32K write and and we 10sec measured after every the 2MB average write and write each pthread latency askind 5of Raw Discussion Flash cards, that we used. For failure handling, writes 4MB of data 16times. So in total, we write 8GB of data we also should implement the block layer barrier commands observed from each pthread. and we measured the average write latency as observed from [6]. WhenRecent the filesystem works ingives parallelism a barrier request, with respectwe have to to SSDs each pthread. empty the queues in our FTL and we also have to flush all of 5 our FTLare buffers [15], to [8],the flash. [12]. Compared to these previous PLLGC+Adaptive PLLGC Thepapers, FTL cache we havesize is used a module a Raw parameter Flash and Card, it is instead limited only by the amount of RAM in the system. So if we 4 of an SSD. While FTLs in SSDs are implemented expect very heavy random write workload, we can load the FTL inmodule Flash with controller, larger cache. in It contrast,is also possible we havethat if consideredwe 3 can implementationdetect random IO behaviour, of a software the FTL FTL, can adaptively which is a layer increaseabove the FTL the cache device size. We driver. can also In increase the context or decrease of Linux the per buffer size. This kind of tunable control is not possible 2 with operatingusing the SSDs system, directly, we but haveit is possible clearly with shown using the the inad- Normalised Average latency raw flashequacy card ofand the a software present FTL. FTLs in Linux, with respect 1 Weto have performance. been liberal in the use of memory in our FTL. But it is possibleWe to next reduce discuss the memory some consumption weaknesses considerably. and modifica- The bits necessary for bit locking in our write algorithm can 0 0 20 40 60 80 100 120 140 be stolentions from required the maptable in our entries. present Though FTL we have implementation. used a Thread fully pageWith based each FTL in written our implementation, page on it the is possible flash, to we store Fig.Fig. 7: 7: Normalised Normalised Average Average write write latency latency for the for theextend the purely page based FTL with DFTL[11] algorithm 128 threads. The x-axis is the thread ids. For every some metadata in the out-of-band area(spare bytes). 128 threads. The x-axis is the thread ids. For everyand reduce memory consumption of the mapping table. thread we calculate the average latency. The plot is ThereWe store are many the directions type of for block future (blocks work. In addition that store to check- threadmade after we calculatenormalising the with average respect latency.to the minimum The plot isdoingpoint the garbage data or collection normal in data),parallel, the it is corresponding also possible to logical made after normalisingaverage withlatency. respect to the minimumdo otherpage activities number, like prefetching, and a sequence static wear-levelling number. The or de- sequence average latency. duplication in parallel without affecting the write latencies. The plot in the Figure 7 shows the normalised average Wenumber used a isFIFO similar buffer toreplacement the one in used our FTL, in yaffs2which and is latency as seen from the 128 application pthreads. The is notused ideal forwhen failure there is scenarios, temporal locality when in the workload. FTL module is adaptiveThe plotversion in performed the Figure better 7 than shows the thenon-adaptive normalised One unloadedfuture work incorrectly.is to consider Afterother locality an incorrect aware cache unload, the parallel garbage collector. Also the total elapsed time for replacement policies. It would be interesting to consider averagethe test was latency 217 secs as for seen PLLGC+Adaptive from the 128 policy application while how FTL the nonblocking has to do buffer a page-level management scan schemes of the like flash. [16] Failure pthreads.PLLGC policy The took 251 adaptive secs to completion. version performedThis shows that better comparehandling against isconventional important LRU in schemes. battery operated embedded by adaptively minimising the garbage collection specific In our design of the FTL, we have avoided using the thanread and the writes, non-adaptive when the IO parallelload is high, garbage there is collector.lesser devices. But there are several usecases, other than block layer’s request merging, coalescing and IO scheduling interference and the average latency per IO request can also optimisations. One direction of future work is to look in to be minimised. the scheduling policies, that can be used in our multi-queued design. It has been shown that by allocating blocks based on 5. Discussion hotness of data, we can improve garbage collection efficiency Recent works in parallelism with respect to SSDs are and hence the long term endurance of the flash. We have to [15], [8], [12]. Compared to these previous papers, we have explore further if the selection of banks in which the blocks used a Raw Flash Card, instead of an SSD. While FTLs in are allocated, should also be based on hotness. SSDs are implemented in Flash controller, in contrast, we have considered implementation of a software FTL, which 6. Conclusion is a layer above the device driver. In the context of Linux In conclusion we have shown how a FTL can be designed operating system, we have clearly shown the inadequacy of for a multi-banked flash card. The modifications leverage the present FTLs in Linux, with respect to performance. the parallelism available in these devices to achieve higher We next discuss some weaknesses and modifications performance. We show that garbage collection overhead can required in our present FTL implementation. be considerably removed from the write path because of the With each written page on the flash, we store some available parallelism. In addition to using parallelism for metadata in the out-of-band area(spare bytes). We store the greater speed, we also show that initial scanning time that is type of block (blocks that store checkpoint data or normal required in flash can also be reduced due to parallelism. We data), the corresponding logical page number, and a sequence have also proposed an adaptive garbage collection policy, to number. The sequence number is similar to the one used in vary the amount of garbage collection, according the IO load,

CSI Journal of Computing | Vol. 2 • No. 4, 2015 LFTL: A multi-threaded FTL for a Parallel IO R3 : 34 Flash Card under Linux by controlling the number of active garbage collector threads. [9] E. Gal and S. Toledo. Algorithms and data structures for flash memories. ACM Computing Surveys, 2005. 7. Acknowledgements [10] L. M. Grupp, J. D. Davis, and S. Swanson. The Bleak Future of NAND Flash Memory. FAST, 2012. Thanks to authors of open source flash filesystems [11] A. Gupta, Y. Kim, and B. Urgaonkar. DFTL: A Flash Translation (Yaffs2, UBIFS and logfs), for mailing list discussions. Special Layer Employing Demand-based Selective Caching of Page- thanks to Joern Engel. level Address Mappings. ASPLOS, 2009. [12] S. Park and K. Shen. FIOS: A Fair, Efficient Flash I/O Scheduler. FAST, 2012. References [13] D. C. Schmidt and T. Harrison. Double-Checked Locking: An [1] http://www.anandtech.com/show/4043/micron-announces- Optimization Pattern for Efficiently Initializing and Accessing clearnand-25nm-with-ecc. Thread-safe Objects. 3rd Annual Pattern Languages of Program [2] http://www.eggxpert.com/forums/thread/648112.aspx. Design conference, 1996. [3] Linux kernel source, drivers/mtd/nftlcore.c. [14] S. won Lee, W. kyoung Choi, and D. joo Park. FAST : An [4] http://lwn.net/Articles/408428. Efficient Flash Translation Layer for Flash Memory. Lecture [5] http://lttng.org/urcu. Notes in Computer Science, 2006. [6] Linux kernel source, Documentation/block/barrier.txt. [15] S. yeong Park, E. Seo, J.-Y. Shin, S. Maeng, and J. Lee. [7] N. Agrawal, V. Prabhakaran, T. Wobber, J. D. Davis, M. Exploiting Internal Parallelism of Flash-based SSDs. Computer Manasse, and R. Panigraphy. Design Tradeoffs for SSD Architecture Letters, 2010. Performance. USENIX Conference, 2008. [16] M. Yui.M, J. Uemura, and S. Yamana. Nb-GCLOCK: A [8] F. Chen, R. Lee, and X. Zhang. Essential roles of exploiting nonblocking buffer management based on the generalized internal parallelism of flash memory based solid state drives in CLOCK. ICDE, 2010. high-speed data processing. HPCA, 2011. nnn

About the Author

Srimugunthan completed his Masters from Department of Computer Science and Automation, Indian Institute of Science, Bangalore. He holds a Bachelor’s degree in Computer Science form Madras Institute of Technology, Chennai. His research interests are in the areas of Storage systems, Big Data and Machine Learning. He is currently working with Akamai Technologies.

K. Gopinath is a Professor in Department of Computer Science and Automation, Indian Institute of Science, Bangalore. His research interests are in the areas of Operating Systems, Storage Systems, Systems Security, Verification. He is an associate editor of ACM transactions on storage.”

Giridhar Appaji Nag Yasa holds A Bachelors Degree in Computer Science from Indian Institute of Technology, Guwahati. His research interests are in the areas of Storage systems, Scalable software systems architectures and Network and Internet Protocols He was part of Advanced Technology Group in NetApp. He is also an official Debian Developer.

CSI Journal of Computing | Vol. 2 • No. 4, 2015 N. V. Narendra Kumar, et. al. R4 : 35 Towards An Executable Declarative Specification of Access Control

N. V. Narendra Kumar1 and R. K. Shyamasundar2

1 Tata Institute of Fundamental Research. E-mail: [email protected] 2 Indian Institute of Technology Bombay. E-mail: [email protected]

Logic plays a vital role in the design and analysis of security policies and models. Several security languages exist in the literature based on either Horn-clauses fragment of the first-order logic or some form of modal logics. SACL is a security language influenced by the SPKI/SDSI PKI, with several features designed to suit specification and verification of authorization in large, open, distributed systems. Although several logic based semantics exist for the naming scheme of SPKI/ SDSI, there is not one that completely captures its features like authorization tags and delegation. In this paper, we provide a Logic Programming (henceforth referred to as LP for short) model for SACL which completely captures its features including groups, structure of authorization tags and delegation. One of the unique features of our approach is that we provide detailed algorithms for tag intersection based on the structure of the input tags. Naturally, the approach not only leads to declarative specification but also executional specifications. Thus, the approach leads to a clear understanding of concepts like delegation, groups, roles etc, and can be used to verify and validate properties of SACL specifications. We compare our approach with other related approaches. Using the vast body of knowledge available on LP, we can enhance the language with additional features while keeping it tractable.

Further, we show how lattice-based information flow can be specified in SACL. Such a specification allows us to specify fine-grained information flow policies which play vital role in a variety of applications ranging from secure execution of third party software in operating systems to limiting unwanted information disclosures in heterogeneous information systems where data comes from a variety of sources each with their own policies. This model subsumes several well studied security models like Role-Based Access Control, the Bell-LaPadula model, the Chinese-Wall model and the Biba integrity model, since all these models are special cases of the lattice-based flow model. Thus, SACL provides a uniform framework for specifying and analyzing heterogenous policies made up of several policy concepts.

I. Introduction Access control is a very important aspect of systems for a clear understanding. Clarke et al.[10] presented a security. In distributed systems the problem is referred to semantics for names by considering the name certificates as as trust management [7]. PolicyMaker[8], KeyNote[6] and rewrite rules. They extend their semantics to also consider SPKI/SDSI[14], [29], [15] are prominent trust management authorization certificates and present an algorithm for languages. SD3[19], Binder[13], SecPAL[4], Delegation discovering a certificate chain that proves the authorization Logic[22] and RT[25] are examples of logic-based distributed of a client to access a resource. authorization languages. Abadi[2] presents an interesting Howell and Kotz[18] extend the ABLP logic[21], [3] with survey and discussion of logical foundations for access control a restricted form of delegation and provide a semantics for the and their applications to programming security policies. SPKI/SDSI. Abadi[1] introduced a logic based semantics for SPKI/SDSI has several novel features like linked local the SPKI/SDSI. Halpern and van der Mayden[16] developed name spaces and delegation of authorization. The naming the Logic of Local Name Containment and also extended scheme of SPKI/SDSI has generated a lot of interest in the the logic to deal with authorization [17]. Li and Mitchell[24] research community and several semantics were proposed provide several semantics for the SPKI/SDSI naming scheme

CSI Journal of Computing | Vol. 2 • No. 4, 2015 R4 : 36 Towards An Executable Declarative Specification of Access Control

based on string rewriting, set theory, LP and FOL. They members. We interpret auth tag as a set of permissions. We extend their FOL semantics to handle authorization. They say that an auth tag is more restrictive than another if the also showed the advantages of their semantics over several interpretation of the first tag is a subset of the interpretation others. of the second tag. Authorization being requested can also be In [27], we introduced the Secure Access Control encoded as a tag. We will respect the request if the requested Language (SACL) influenced heavily by SPKI/SDSI PKI, auth tag is more restrictive than the tag permitted for that formally modelled SACL specifications as Finite State principal. Transition Systems and have provided model checking The basic units of an SACL specification are name certs algorithms for verifying/validating several properties of and auth certs. SACL. We also presented an approach to enforce such Name certs allow a principal to bind principals/ names specifications using Security Automata [32]. to a local name in its name space. This binding can be In this paper, we give a LP model for SACL specifications extended to a binding of principals to arbitrary names in and show the advantages of our semantics over Li’s FOL a straightforward way [10], [24] semantics [24] for the SPKI/SDSI. Our model has the Auth certs allow a principal to authorize the members advantages that it is declarative and leads naturally to an of a name to specified operations on a resource and to execution model. Further, we can analyze the policy for possibly redelegate this authorization. required properties. We also show how the naming scheme Each certificate has a validity field specifying the of SACL can be interpreted as inducing a hierarchy over conditions under which the certificate is valid (typically range the name space. This hierarchy can then be used to model of dates). For the purposes of this paper, we assume that we several policy concepts and security models. In particular, are dealing with currently valid certificates, so we ignore the this allows us to model information flow policies which lead to validity filed of the certificates in this discussion. Wenow fine grained access control mechanisms in operating systems. describe the format of the certificates. This enables a uniform framework for combining mandatory A name cert in SACL is of the form (u, a, n) where, and discretionary policies. principal u is the issuer of this certificate 222 The rest of the paper is organized as follows: in Section 2, modelsmodelsmodelswe give and and and the policy policy policyLP model concepts concepts concepts of SACL can can can and be be be modelleddiscuss modelled modelled its inproperties. in in SACL. SACL. SACL. We We WeIn AnAnAn(u auth auth, auth a) is cert cert certthein name in in SACL SACL SACL that has has has is thedefined the the form form form by((u,( u,thisu, n, n, n, d,certificate d, d, t t) t))where,where,where, and provideprovideprovideSection concluding concluding concluding3, we show remarks remarkshow remarks several in in in Section Section Sectionwell studied 4. 4. 4. security models this certificate states that all members of the namen are alsoprincipalprincipalprincipal membersuuuis isisof the the the issuer issuername issuer of( ofu of, thisa this this). certificate certificate certificate and policy concepts can be modelled in SACL. We provide ••• t denotes the authorization being granted by u to concluding remarks in Section 4. • Antt authdenotesdenotes cert thein the SACL authorization authorization has the form being being (u, n granted granted, d, t) where, by by uu toto 2 LP MODELMODEL OF OFSACL •• n 22 LP LP MODEL OF SACLSACL principalmembersmembersmembers u ofis of of the the the the issuer name name name ofnnn this certificate In [27], we introduced the secure access control language if d =0, it means that principals receiving this au- InIn2. [27], [27], LP we we Model introduced introduced of SACL the the secure secure access access control control language language • t ififdenotesdd =0=0 ,,the it it means meansauthorization that that principals principals being granted receiving receiving by this this u au- au-to SACL, with several features designed to suit authorization •• thorization are not allowed to redelegate it. If d =1, SACL,SACL,SACL,In with with with [27], severalseveral severalwe introduced features features features the designed designed designed secure access to to to suit suit suit control authorization authorization authorization language membersthorizationthorizationthorization of the are are are name not not not allowedn allowed allowed to to to redelegate redelegate redelegate it. it. it. If If Ifddd=1=1=1,,, specifications in large, open, distributed systems. In this it means that principals receiving this authorization specificationsspecificationsspecificationsSACL, with inseveral in in large, large, large, features open, open, open, designed distributed distributed distributed to suit systems. systems. systems. authorization In In In this this this if ititd means means= 0, that thatit means principals principals that receiving receivingprincipals this this receiving authorization authorization this section, we give a Horn-clause model for SACL and discuss are allowed to redelegate it section,section,section,specifications we we we give give give a a ain Horn-clause Horn-clause Horn-clause large, open, model model model distributed for for for SACL SACL SACL and and andsystems. discuss discuss discuss In this authorizationareareare allowed allowed allowed toare to to redelegate redelegate redelegatenot allowed it it it to redelegate it. If d = 1, its properties. itsitsitssection, properties. properties. properties. we give a Horn-clause model for SACL and discuss it means that principals receiving this authorization are AnAn access access control control specification specification in in SACL SACL consists consists of of a a set set its properties. Anallowed access to control redelegate specification it in SACL consists of a set ofofof name name name and and and auth auth auth certs. certs. certs. Given Given Given an an an SACL SACL SACL specification specification specification and and and 2.12.1 Overview Overview of of SACL SACL An access control specification in SACL consists of a set 2.12.1 OverviewOverview ofof SACLSACL ananan access access access request request request we we we want want want to to to decide decide decide whether whether whether this this this access access access of name and auth certs. Given an SACL specification and an InInIn this this this section, section, section, we we we provide provide provide a a a brief brief brief overview overview overview of of of the the the language language language shouldshould be be permitted. permitted. To To compute compute the the answer answer to to this, this, we we In this section, we provide a brief overview of the shouldaccess request be permitted. we want To to decide compute whether the answer this access to this,should we SACL.SACL.SACL. SACL SACL SACL is is is derived derived derived from from from the the the SPKI/SDSI SPKI/SDSI SPKI/SDSI PKI, PKI, PKI, by by by treating treating treating define the notion of certificate composition which is the only language SACL. SACL is derived from the SPKI/SDSI PKI, definedefinebe permitted. the the notion notion To compute of of certificatecertificate the compositionanswer composition to this,whichwhich we isdefine is the the only onlythe aaa set set set of of of SPKI/SDSI SPKI/SDSI SPKI/SDSI certs certs certs as as as a a a policy policy policy specification specification specification for for for access access access rule of inference in SACL. by treating a set of SPKI/SDSI certs as a policy specification rulerulenotion of of inference inferenceof certificate in in SACL. SACL. composition which is the only rule of control.control. The The design design goal goal of of SACL SACL was was to to extend extend the the features features control.for access The control. design goalThe design of SACL goal was of toSACL extend was the to featuresextend inference in SACL. of SPKI/SDSI for use in a flexible access control specification 1)1)1) GivenGivenGiven name name name certs certs certs ((u,(u,u, a, a, a,((u(uu1,b,b,b1,b,b,b2,, ...,, ..., ..., b bm b )))))) ofofthe SPKI/SDSI SPKI/SDSI features of for for SPKI/SDSI use use in in a a flexible flexible for use access access in a flexible control control specificationaccess specification control 11 11 22 mm 1) Givenand name( certsu ,b (,u(,u a, ,c(u1,c, b1, b ...,2, ..., c )) bm)) andwe (u1, b1, ( canu2, languagelanguagelanguage (flexi-ACL (flexi-ACL (flexi-ACL [28]). [28]). [28]). For For For the the the purposes purposes purposes of of of this this this paper, paper, paper, andandand ((u(uu111,b,b,b111,,,((u(uu222,c,c,c111,c,c,c222,,, ..., ..., ..., c cl c))ll)))) wewewe can can can specification language (flexi-ACL [28]). For the purposes of c , c , ..., c )) we can compose them to obtain the name wewe consider consider only only that that fragment fragment of of SACL SACL which which includes includes the the 1 composecomposecompose2 l them them them to to to obtain obtain obtain the the the name name name cert cert cert wethis consider paper, onlywe consider that fragment only that of SACLfragment which of SACL includes which the cert (u, a, (u , c , c , ..., c , b , ..., b )). If we can derive the name and auth certs as defined in SPKI/SDSI. ((u,(u,u, a, a, a,((u(uu2,c2,c,c1,c1,c,c22,, ...,, ..., ..., c cl cl,b,b,b22,, ...,, ..., ..., b bm b ))))))... If If If we we we can can can derive derive derive namenameincludes and and the auth auth name certs certs and as as definedauth defined certs in in as SPKI/SDSI. SPKI/SDSI. defined in SPKI/SDSI. name cert (u22, a11, u 2)2 from thell 2 2given msetm of certificates using the name cert1 (u, a, u ) from the given set of TheTheThe basic basic basic components components components of of of SACL SACL SACL are: are: are: thethethe name name name cert cert cert ((u,(u,u, a, a, a, u u u111))) fromfromfrom the the the given given given set set set of of of The basic components of SACL are: composition then we say that u1 is a member of (u, a) certificatescertificates using using composition composition then then we we say say that thatuuu1 set of principals certificates using composition then we say that 11 setsetsetset of of of of principalsprincipalsprincipalsprincipals 2) Given an auth cert (u, (u1, b1, b2, ..., bm), d, t) and a name ••• U isisis a a a member member member of of of((u,(u,u, a a a))) set of name identifiersUUU , used to construct names cert (u , b , (u , c , c , ..., c )) we can compose them to • setsetsetset of of of of namenamenamename identifiers identifiers identifiers identifiers , ,,used used used used to to to to construct construct construct construct names names names names 2) Given1 an1 auth2 1 cert2 (u, (ul ,b ,b , ..., b ), d, t) and a ••• A 2)2)2) GivenGivenGiven an an an auth auth auth cert cert cert((u,(u,u,((u(uu111,b,b,b111,b,b,b222,,, ..., ..., ..., b bm bmm)),),, d, d, d, t t) t))andandand a a a set of names = AA∗, constructed from principals obtain the auth cert (u, (u , c , c , ..., c , b , ..., b ), d, t) • setsetsetset of of of namesnamesnames === ∗∗∗,,, constructed constructed constructedconstructed from fromfrom principals principalsprincipals name cert (u ,b , (u ,c2 ,c1 ,2 ..., c ))l we2 canm compose ••• NNN UAUAUA namenamename cert cert cert((u(uu111,b,b,b111,,,((u(uu222,c,c,c111,c,c,c222,,, ..., ..., ..., c cl c))ll))))wewewe can can can compose compose compose usingusingusing zero zero zero zero or or or ormore more more more name name name name identifiers identifiers identifiers identifiers 3) Given auth certs (u, u , 1, t ) and (u , n, d, t2) we can them to obtain the auth1 1 cert ((u,u,(1(uu2,c,c1,c,c2,, ..., ..., c cl,b,b2 set of , used to represent sets of themthem to to obtain obtain the the auth auth cert cert (u, (u22,c11,c22, ..., cll,b22 setsetset of of ofauthorizationauthorizationauthorization tags tags tags ,,, used used used to to to represent represent represent sets sets sets of of of compose them to obtain the auth cert (u, n, d, TI (t1, t2)), ••• set of authorization tags T , used to represent sets of ,, ...,, ..., ..., b bm b )),), d,, d, d, t t) t)) authorizations TTT where TImm is the tag intersection operator i.e. the set of authorizationsauthorizationsauthorizations 3) Given auth certs (u, u1, 1,t1) and (u1,n,d,t2) 3)3)permissionsGivenGiven auth auth authorized certs certs by((u,u, both u u111,, 11the,t,t111 tags)) andand ((uu111,n,d,t,n,d,t2)2) EveryEvery principal principal in in SACL SACL has has its its own own local local name name space.space. wewewe can can can compose compose compose them them them to to to obtain obtain obtain the the the auth auth auth cert cert cert EveryEvery principal principal in in SACL SACL has has its its own own local local name name space. space. If we can derive the auth cert (u, u , d, t) from the given These name spaces can be linked using extended names. The (u, n, d, TI (t ,t )), where TI is1 the tag intersection TheseTheseThese name name name spaces spaces spaces can can can be be be linked linked linked using using using extended extended extended names. names. names. The The The ((u,(u,u, n, n, n, d, d, d,TITITI((t(t1t11,t,t,t222)))))),,, where where whereTITITIisisis the the the tag tag tag intersection intersection intersection interpretation of names is a group of principals who are its set of certificates, where u is the owner of the resource t then interpretationinterpretationinterpretation of of of names names names is is is a a a group group group of of of principals principals principals who who who are are are its its its operatoroperatoroperator i.e. i.e. i.e. the the the set set set of of of permissions permissions permissions authorized authorized authorized by by by members.members.members. We We We interpret interpret interpret auth auth auth tag tag tag as as as a a a set set set of of of permissions. permissions. permissions. We We We bothbothboth the the the tags tags tags saysaysay that that that an an an auth auth auth tag tag tag is is is more more more restrictive restrictive restrictive than than than another another another if if if the the the If we can derive the auth cert (u, u1, d, t) from the given interpretationinterpretationinterpretationCSI Journal of of of the theComputing the first first first tag tag tag is is is| a a aVol. subset subset subset 2 • of of ofNo. the the the 4, interpretation interpretation interpretation2015 IfIf we we can can derive derive the the auth auth cert cert ((u,u, u u111,, d, d, t t)) fromfrom the the given given set of certificates, where u is the owner of the resource t then ofofof the the the second second second tag. tag. tag. Authorization Authorization Authorization being being being requested requested requested can can can also also also be be be setset of of certificates, certificates, where whereuuisis the the owner owner of of the the resource resourcettthenthen we say that u1 can access t. encodedencodedencoded as as as a a a tag. tag. tag. We We We will will will respect respect respect the the the request request request if if if the the the requested requested requested wewe say say that that uu111 cancan access access tt.. We now present an example SACL specification which authauthauth tag tag tag is is is more more more restrictive restrictive restrictive than than than the the the tag tag tag permitted permitted permitted for for for that that that WeWeWe now now now present present present an an an example example example SACL SACL SACL specification specification specification which which which will be used as a running example for the rest of this principal.principal.principal. willwillwill be be be used used used as as as a a a running running running example example example for for for the the the rest rest rest of of of this this this section. Consider an SACL specification , which con- TheTheThe basic basic basic units units units of of of an an an SACL SACL SACL specification specification specification are are arenamenamename certs certs certs section.section. Consider Consider an an SACL SACL specification specification ,, which which con- con- sists of the following set of currently validCCC certificates: andandandauthauth certs certs... sistssistssists of of of the the the following following following set set set of of of currently currently currently valid valid valid certificates: certificates: certificates: auth certs C =(u ,a,u ) C =(u , b, u ) C =(u , a, (u , a, b)) CCC111=(=(=(uuu111,a,u,a,u,a,u222)),),,CCC222=(=(=(uuu222,,, b, b, b, u u u333)),),,CCC333=(=(=(uuu111,,, a, a, a,((u(uu111,,, a, a, a, b b)) b)))),,, Name certs allow a principal to bind princi- • NameName certs certs allowallow a a principal principal to to bind bind princi- princi- CCC44=(=(=(uuu11,, a,, a, a, u u u44)),),,CCC55=(=(=(uuu33,c,u,c,u,c,u55)),),,CCC66=(=(=(selfselfself,,(,(u(uu11,a,a,a)),),1,11,t,t,t11)),),, •• pals/names to a local name in its name space. This 4 1 4 5 3 5 6 1 1 pals/namespals/namespals/names to to to a a a local local local name name name in in in its its its name name name space. space. space. This This This CCC7 =(=(=(uuu3,,(,(u(uu3,c,c,c)),),0,00,t,t,t2)),),, where where wherett1t =(=(=(tagtagtag((purchase(purchasepurchase((( rangerangerange 77 33 33 22 11 ∗ bindingbindingbinding can can can be be be extended extended extended to to to a a a binding binding binding of of of principals principals principals to to to le 200) ( set item1 item2 item3 ))) and t2 =(tag (purchase∗∗ lele 200)200) (∗ ( setset item1 item1 item2 item2 item3 item3))))))andandt2t22 =(=(tagtag ((purchasepurchase arbitraryarbitraryarbitrary names names names in in in a a a straightforward straightforward straightforward way way way [10], [10], [10], [24] [24] [24] ( range∗∗ le∗ 150) ( set item1 item2 ))). (∗( rangerange le le 150)150) (∗ ( setset item1 item1 item2 item2)))))).. AuthAuthAuth certs certs certsallowallowallow a a a principal principal principal to to to authorize authorize authorize the the the members members members ∗∗∗Given this access∗∗∗ specification, should we permit u ••• GivenGivenGiven this this this access access access specification, specification, specification, should should should we we we permit permit permituuu555 ofofof a a a name name name to to to specified specified specified operations operations operations on on on a a a resource resource resource and and and t =(tag (purchase 100 item2 ))? accessaccessaccess to to tott3t33=(=(=(tagtagtag((purchase(purchasepurchase100100100item2item2item2))?))?))? tototo possibly possibly possibly redelegate redelegate redelegate this this this authorization. authorization. authorization. 1) We compose C with C to obtain C = Each certificate has a validity field specifying the condi- 1)1)1) WeWeWe compose compose compose CCC333 withwithwith CCC111 tototo obtain obtain obtain CCC888 === EachEachEach certificate certificate certificate has has has a a a validity validity validity field field field specifying specifying specifying the the the condi- condi- condi- (u , a, (u ,b)) and compose C with C to obtain tions under which the certificate is valid (typically range of ((u(uu111,,, a, a, a,((u(uu222,b,b,b))))))andandand compose compose composeCCC888withwithwithCCC222tototo obtain obtain obtain tionstionstions under under under which which which the the the certificate certificate certificate is is is valid valid valid (typically (typically (typically range range range of of of C =(u , a, u ) dates). For the purposes of this paper, we assume that we CCC999=(=(=(uuu111,,, a, a, a, u u u333))) dates).dates).dates). For For For the the the purposes purposes purposes of of of this this this paper, paper, paper, we we we assume assume assume that that that we we we 2) We compose C with C to obtain C = are dealing with currently valid certificates, so we ignore 2)2)2) WeWeWe compose compose compose CCC666 withwithwith CCC111 tototo obtain obtain obtain CCC101010 === areareare dealing dealing dealing with with with currently currently currently valid valid valid certificates, certificates, certificates, so so so we we we ignore ignore ignore (self ,u , 1,t ). Composing C with C yields C = the validity filed of the certificates in this discussion. We ((self(selfself,u,u,u222,,,111,t,t,t111)).).. Composing Composing ComposingCCC666withwithwithCCC444yieldsyieldsyieldsCCC111111=== thethethe validity validity validity filed filed filed of of of the the the certificates certificates certificates in in in this this this discussion. discussion. discussion. We We We (self ,u , 1,t ) now describe the format of the certificates. ((self(selfself,u,u,u444,,,111,t,t,t111))) nownownow describe describe describe the the the format format format of of of the the the certificates. certificates. certificates. 3) Composing C successively with C , C and C A name cert in SACL is of the form (u, a, n) where, 3)3)3) ComposingComposingComposingCCC666 successivelysuccessivelysuccessively with with withCCC333,,,CCC111 andandandCCC222 AAA name name name cert cert cert in in in SACL SACL SACL is is is of of of the the the form form form((u,(u,u, a, a, a, n n n)))where,where,where, C =(self ,u , 1,t ) yieldsyieldsyieldsCCC121212=(=(=(selfselfself,u,u,u333,,,111,t,t,t111))) principal u is the issuer of this certificate 4) We compose C7 with C5 to obtain C13 = • principalprincipal uu isis the the issuer issuer of of this this certificate certificate 4)4) WeWe compose compose CC777 withwith CC555 toto obtain obtain CC131313 == •• (u, a) is the name that is defined by this certificate (u ,u , 0,t ) • ((u,(u,u, a a a)))isisis the the the name name name that that that is is is defined defined defined by by by this this this certificate certificate certificate ((u(uu333,u,u,u555,,,000,t,t,t222))) ••• C C C = andandand 5)5)5) FinallyFinallyFinally we we we compose compose composeCCC121212withwithwithCCC131313tototo obtain obtain obtainCCC141414=== this certificate states that all members of the name n (self ,u , 0,t ). Notice that the intersection of tags t • thisthisthis certificate certificate certificate states states states that that that all all all members members members of of of the the the name name namennn ((self(selfself,u,u,u555,,,000,t,t,t222)).).. Notice Notice Notice that that that the the the intersection intersection intersection of of of tags tags tagstt1t11 ••• (u, a) t t t t areareare also also also members members members of of of the the the name name name((u,(u,u, a a a)).).. andandandtt2t22yieldsyieldsyieldstt2t22sincesincesincett2t22isisis strictly strictly strictly a a a subset subset subset of of oftt1t11 2 2 models and policy concepts can be modelled in SACL. We An auth cert in SACL has the form (u, n, d, t) where, models and policy conceptsprovide concluding can be modelled remarks in in SACL. Section We 4. An auth cert in SACL has the form (u, n, d, t) where, principal u is the issuer of this certificate provide concluding remarks in Section 4. • principal u is the issuert denotes of this the certificate authorization being granted by u to • • 2 LP MODEL OF SACL t denotes the authorization being grantedn by u to • members of the name 2 LP MODEL OFInSACL [27], we introduced the secure access controlmembers language of the nameif dn=0, it means that principals receiving this au- • In [27], we introduced the secure access control language if d =0, it means that2 principals2 receiving this au- d =1 SACL, with several features designed to suit authorization• thorization are not allowed to redelegate it. If , models andmodels policy conceptsand policy canSACL, concepts be modelled with can several be in modelled SACL.specifications features We in designed SACL. inAn large, We to auth suit open, cert authorizationAn in distributedauth SACL cert has in the systems.SACL form hasthorization(u, In the n, this d, form t) arewhere,(u, not n, d, allowedit t means) where, to that redelegate principals it. If receivingd =1, this authorization provide concludingprovide concludingremarks inspecifications Sectionremarks 4. in Section in large,section, 4. open, we distributed give a Horn-clause systems. model In this for SACL andit means discuss that principalsare allowed receiving to redelegate this authorization it section, we give a Horn-clause model forprincipal SACL andu principalis discuss the issueru is of the this issuerare certificate allowed of this certificateto redelegate it its properties. • • its properties. t denotes thet denotes authorization the authorization being granted being byAn grantedu accessto by controlu to specification in SACL consists of a set • • 2 LP MODEL2 LP OFMODELSACL OF SACL members ofmembers the name ofn theAn name accessn controlof specification name and auth in SACL certs. consists Given an of aSACL set specification and2 2.1 Overview of SACL In [27], weIn introduced [27], we introduced the secure access the secure control access language control languageif d =0, itif meansd =0 that, itof means principals name that and receiving principals auth certs.an thisaccess receiving Given au- request an this SACL au- we wantspecification to decide and whether this access 2.1 Overview of SACLmodels and policy• concepts• can be modelled in SACL. We An auth cert in SACL has the form (u, n, d, t) where, SACL, withSACL, several with features several designed features to designed suit authorizationIn to this suit section, authorization we providethorization a briefthorization are overview not allowed arean ofaccess not the to redelegate allowedlanguage request to it. redelegateweshould If wantd =1 be, to it. permitted. decide If d =1 whether, To compute this access the answer to this, we In this section, we provideprovide a concluding brief overview remarks of the in language Section 4. specificationsspecifications in large, open, in large, distributed open, distributed systems.SACL. In systems. this SACL is In derived thisit means from thethatit SPKI/SDSI means principals thatshould PKI, receiving principals by be treatingpermitted. this receiving authorizationdefine To this compute theprincipal authorization notion theu ofis answercertificate the issuer to composition of this, this we certificatewhich is the only 2 SACL. SACL is derived from the SPKI/SDSI PKI, by treating • section, wesection, give a Horn-clause we give a Horn-clause model for SACL model and fora set discuss SACL of SPKI/SDSI and discussare certs allowed as a policyare to redelegate allowed specificationdefine to it redelegate thefor accessnotion it ofrulecertificate of inferencet denotes composition in the SACL. authorizationwhich is the only being granted by u to a set of SPKI/SDSI certsmodels as a policy and policy specification concepts for can access be modelled in SACL. We • An auth cert in SACL has the form (u, n, d, t) where, its properties.its properties. control.2 LP TheMODEL design OF goalSACL of SACL was to extendrule of the inference features in SACL. members of the name n control. The design goalprovide of SACL concluding was to extend remarks the featuresin Section 4. 1) Given name certs (u, a, (u ,b ,b , ..., b )) ofIn SPKI/SDSI [27], we introduced forAn use access in a the flexible controlAn secure access access specification access control control control specification specificationin SACL language consists in SACL of a consists setif dprincipal=0 of, a it set meansu is the that issuer principals of this certificate receiving1 1 2 thism au- of SPKI/SDSI for use in a flexible access control specification 1) Given name• • certsand (u,(u a,,b(u,1(,bu1,c,b2,c, ...,, b...,m)) c )) we can languageSACL, with (flexi-ACLof several name features [28]). andof auth name For designed thecerts. and purposes Givenauth to suit certs. an of authorization SACL this Given paper, specification an SACL specification andthorizationt denotes and are1 the1 not authorization allowed2 1 2 to redelegatel being granted it. If d by=1u, to 2.1 Overview2.1 of Overview SACL of SACLlanguage (flexi-ACL [28]). For the purposes of this paper, and (u1,b1, (u•2compose,c1,c2, ..., c theml)) towe obtain can the name cert 2 wespecifications2 consider LP MODELan only in access that large, OF fragment requestanSACL open, access distributedwe of SACL request want which to we systems. decide want includes whether toIn decide thisthe this whether accessit meansmembers this access that of principals the name receivingn this authorization In this section,In this we section, provide we a brief providewe consider overview a brief only of overview the thatname language fragment of and the auth languageof SACL certs as which defined includes in SPKI/SDSI. the compose them( tou, a, obtain(u2,c1,c the2, ..., c namel,b2, ..., b certm)). If we can derive section,Inmodels [27], we we giveshould and introduced a policy Horn-clause be permitted.should concepts the securemodel be can Topermitted. be for accesscompute modelled SACL control To and the compute in answerdiscuss language SACL. the to We this, answer weareAn toif allowed authd this,=0 cert we, to it in redelegatemeans SACL that has it principalsthe form (u, receiving n, d, t) where, this au- SACL. SACLSACL. is derived SACL from is derived thename SPKI/SDSI from and the auth SPKI/SDSIPKI, certs by as treatingThe defined PKI, basic by in componentstreating SPKI/SDSI. of SACL are: (u, a, (u2,c1,c2, ...,• cthel,b2, name ..., bm)) cert. If( weu, a, can u1) derivefrom the given set of itsSACL, properties.provide withdefine concluding several the notionfeaturesdefine remarks of the designedcertificate in notion Section toof composition 4. suitcertificate authorization compositionwhich is thewhich only thorization is the only are not allowed to redelegate it. If d =1, a set of SPKI/SDSIa set of SPKI/SDSI certs as a policy certsThe specificationas basic a policy components specification for access of SACL forrule access are: of inferencerule of in inference SACL. in SACL. the name cert (u,certificates a, u1) from using the composition given set of then we say that u1 specificationsset of principals in large, open, distributed systems. In this An accessitprincipal means control thatu specificationis principals the issuer receivingin of SACL this certificate consists this authorization of a set • certificates using composition• then we say that u control. Thecontrol. design Thegoal design of SACL goal was of to SACL extend was the to features extend the features U of nameis and at memberdenotes auth certs. of the(u, Given authorizationa) an SACL1 being specification granted and by u to set of principals section,set of wename give identifiers a Horn-clause, used model to construct for SACL names and discuss • are allowed to redelegate it of SPKI/SDSIof SPKI/SDSIfor use in a flexible for use in access• a flexible control access specification2.1 controlU• 2 Overview specificationLP MODEL1) of SACLGiven OF SACL1)A nameGiven certs name(u,is certs a, a( memberu1,b1(,bu,2an a,, of ...,(2) access(uu, b1m,b a))Given)1,b request2, ..., an bm auth we)) certwant(u, to(u decide1,b1,b2 whether, ..., bm), d,this t) accessand a set of name identifiersits properties.set, of usednames to construct= ∗ names, constructed from principals members of the name n • • and (uand,b , (u ,c(u,c,b2),, ...,(uGiven c,c)) ,c an, ...,we auth c )) cert can(u,we(u ,b ,b can, ..., b ), d, t) and a language (flexi-ACLlanguage [28]).(flexi-ACL For the [28]). purposes For the of purposes thisIn this paper,In section, of [27],A this we paper, we introduced provideN UA a the brief1 secure overview1 2 access11 of21 the control language2l 1 language2 shouldl Anname be access permitted.if 1 certd =01 control(u2,1,b it To1 means, specification(mu compute2,c1 that,c2, principals ...,the in cl answerSACL)) we receivingcan consists to compose this, of this we a set au- set of names = using∗, constructed zero or more from name principals identifiers • we considerwe only consider that fragment only that of fragment• SACL which of SACL includesNSACL. whichUA the SACL includes is derived thecompose from thecompose them SPKI/SDSI to them PKI, obtain byname to treating the obtain cert name(u1,b the1, ( certuthem2 name,c1,c to2, obtain..., cert cl)) we the can auth compose cert (u, (u2,c1,c2, ..., cl,b2 using zero or moreSACL, nameset of with identifiersauthorization several features tags , designed used to represent to suit authorization sets of defineof name the notionthorization and auth of certificate certs. are not Given composition allowed an SACL towhich redelegate specification is the it. only If d and=1, name and authname certs and as auth defined certs in as SPKI/SDSI. defined in SPKI/SDSI.a set2.1• of SPKI/SDSIOverview of certs(u, SACL a, as(u2 a,c policy1(,cu,2 a,, ..., specification(u c2l,c,b12,c, ...,2, ..., bm forthem c))l,b. access If2, to ..., we obtain bm can)). derive If the we, auth ..., can bm cert) derive, d,( tu,) (u2,c1,c2, ..., cl,b2 set of authorizationspecifications tagsauthorizations, used in to large, represent open,T sets distributed of systems. In thisrulean of access inferenceitrequest means in SACL. that we principalswant to decide receiving whether this authorization this access The basic componentsThe basic components of SACL• are: of SACL are:control.In this The section, design we goalthe provide of name SACL athe briefcert was name overview(u, to a, extend u cert1) offrom the(u, the, ...,features a, language the u bm1)), givenfrom d, t) the3) set of givenGiven set auth of certs (u, u1, 1,t1) and (u1,n,d,t2) authorizations section,T we give a Horn-clause model for SACL and discuss should beare permitted. allowed to To redelegate compute it the answer to this, we of SPKI/SDSISACL.Every principalSACL for is use derived incertificates in SACL a flexible from has usingcertificatesthe itsaccess SPKI/SDSI own composition control local using3) namePKI, specification compositionGiven then by space. treating we auth say then certsthat1) weuweGiven1(u, say ucan1 that, 1,t compose name1u)1 and themcerts(u1,n,d,t to(u, obtain2) a, (u1,b the1,b auth2, ..., b certm)) set of principalsset of principals its properties. define the notion of certificate composition which is the only • • U EveryU principalThese inlanguage SACLa set name of has (flexi-ACL SPKI/SDSI spaces its own canis [28]). local a becerts member linkedname For as ais the using policy of space. a( member purposesu, extended a specification) of of( names.u, thiswe a) for paper, can The access compose( themAnandu, n, access d, toTI( obtainu control(t11,b,t12,))( theu specification,2 where,c auth1,c2TI, ..., certis c inl2 the)) SACL tag intersectionwe consists can of a set set of name identifiersset of name identifiers, used to construct, used to names construct names rule of inference in SACL. • • AThese nameA spaces caninterpretationwecontrol. be consider linked The onlyusing of design2) names that extendedGiven fragment goal is a an2) of group names. SACLauthGiven of SACL of cert The was principals an(u, whichto auth( extendu1,b cert includes who1,b( the(u,2u,, are ...,n, features(u d,the b1 itsm,bTI)1,,b d,(t21 t,,t) ...,and2of)) b,m nameoperator wherecompose a), d, t) andTIand i.e.is auth thema the the certs. tag set to intersection of Given permissions obtain an SACL the authorized specification name cert by and 2 set of namesset of=names∗,models constructed= and∗, constructed frompolicy principals concepts2.1 from Overview can principals be modelled of SACL in SACL. We An auth cert in SACL has the form (u, n, d, t) where, • • N UA interpretationN UA of namesmembers.nameof is SPKI/SDSI and a group We auth interpret of certs for principalsname use as auth defined in cert a flexibletag who(uname in as1 are,b SPKI/SDSI. a1 access set, its cert(u2 of,c( controlu permissions.11,c,b21,, ...,( specificationuoperator c2l,c))1we,c We2, can i.e...., composec thel))anwe set1)both access( canu, ofGiven a, composepermissions the(u request2 tags,c1,c name2, we ..., authorized cwantl,b certs2, ..., to b decidem by())u,. a, If whether( weu1,b can1,b2 derivethis, ..., baccessm)) using zero orusing more zero name orprovide more identifiers name concluding identifiers remarks in Sectionmodels 4. and policy concepts can be modelled in SACL. We An auth cert in SACL has the form (u, n, d, t) where, members. We interpretsaylanguageThe auththatIn this basic an tag section, auth as (flexi-ACL components a tagset wethem of is providepermissions. more [28]). toofSACL obtain restrictive For athem brief the are: the We to overviewpurposes auththan obtain anothercert the of of(bothu, the auththis( ifu language the2the paper,,c cert1 tags,c(2u,, ...,(ushould c2l,c,b1the2,cand2, be name ..., permitted. cl,b cert2(u1,b(1u,, To( a,u2u compute,c1)1,cfrom2, ..., the c thel)) answer givenwe to set this, of can we set of authorizationset of authorization tags , used tags to represent, used to sets represent of setsprovide of concluding remarks in Sectionprincipal 4. u is the issuer of this certificate • • , ..., b ), d, t,) ..., b ), d, t) • If we can derive the auth cert (u, u1, d, t) from the given sayT that an authT taginterpretation iswe moreSACL. consider restrictive SACL of only the is derived first thatthan tagfragmentm another from is a subset the if of them SPKI/SDSI SACL of the which interpretation PKI, includest denotes by treating the the authorizationdefinecertificatescompose the notion being using themof grantedcertificate compositionprincipal to by composition obtainuuisto then the issuer wethewhich say of name2that this is the certificateu1 cert only 2 authorizationsauthorizations set of principals If• we can deriveset the of auth certificates, cert (u, u where, d,• t)ufromis the the owner given of the resource t then interpretation2 LP MODEL of the OFof first theSACLname•a tag second set isand of a SPKI/SDSI subset tag.auth3) Authorization certs ofGiven theU as certs interpretation defined3) auth asGiven being a certspolicy in SPKI/SDSI. requested auth( specificationu, u1 certs, 1 can,t1)( alsou,and for u1 be, access1(,tu1),n,d,tandrule2)is( ofu a(1 inferenceu, member,n,d,t a, (u122),c of in1,c( SACL.u,2, a ...,) cl,b2, ..., bm)). If we can derive models and policyset ofmodels conceptsname identifiers and can policy be modelled,concepts used to construct in can SACL. be modelled names Wemembers inAn ofSACL. auth the name certWe inn SACLAn auth has the certt formdenotes in SACL(u, n, the has d, t authorization) thewhere, form (u, n, being d, t) where, granted by u to • we can composewe can them composeset of to certificates, obtain them the to where obtainwe auth sayu certis the that the authu owner1 can cert accessof the• resourcet. t then Every principalEvery in principal SACL hasof inIn the its SACL [27], secondown we has local introducedtag. its name ownAuthorizationencoded localspace.control. theThe nameas secure basic a beingThe tag. space. components access designWe requested2 will LP control goal respectMODELA canofSACL language the also requestOF be wasare:SACL to if extend the requestedif thed =0 features, it means2) thatGiventhe principals an name auth certcert receivingmembers(u, ( a,u1 thisu,b1) of1,b au-from the2, ..., name bthem),n givend, t) and set a of provide concludingset ofprovidenames remarks concluding in= Section∗, constructedremarks 4. inwe Section from say• principals that 4. u can access t. These nameThese spaces name can be spaces linkedencoded can using be linked extended as a tag. using We names.auth extended will•of tag respectThe SPKI/SDSI is names. more the request Therestrictive forN(u, use n, ifUA ind, the thanTI a flexiblerequested((tu,1 the,t n,2)) tag d,access,TI where permitted(t control1,tTI2)),is wherespecificationfor the that1 tagTI intersectionis theWe tag now1)name intersectioncertificates presentGiven cert (u1 an,b usingname1 example, (u2 composition,c1 certs,c SACL2, ..., c specification(l))u, thenwe a, (u can we1,b compose say1,b which2 that, ..., bum1)) SACL, with several featuresusingset designed zeroof principals orIn to more [27], suit name we authorization introduced identifiers the securethorization access control areprincipal not language allowedu is the to issuerprincipal redelegate ofif thisdu it.is=0 certificate theIf ,d it issuer=1 means, of thatthis certificate principals receiving this au- • operator i.e.operator the set i.e. of permissions theWe set now of present permissions authorized• will an example be authorized by usedand SACL as• a by runningspecification(•u ,b , ( exampleu ,c which,c , for ..., cthe)) restwe of this can interpretationinterpretation of names is of a group namesauthspecifications oftag is principals a is group more in of restrictive wholarge, principalsprincipal. arelanguage open, its than who distributed the (flexi-ACL are tag itsSACL, permitted systems.U [28]).with several Forfor In that the this features purposes designed ofit means this topaper, that suitt denotes principalsauthorizationthemis the areceiving to member authorization obtaint denotes this of the1 thorization(1u, authorization auth the abeing)2 certauthorization1 granted are2(u, ( notu2l,c byallowed1,c beingu2,to ..., to granted cl redelegate,b2 by u it.to If d =1, setset of ofauthorizationname identifiers tags ,, used used to to construct represent names sets of• • members. Wemembers. interpret We auth interpret tagprincipal. as autha set tag of permissions. as a set of•we permissions.• We consider only Weboth that the fragment tagsboth of the SACL tagswill which be used includes as a thesection. running2), Consider ..., exampleGiven bcomposem), d, an tan for) auth SACL the them cert rest specification(u, to of(u1 this obtain,b1,b2, ...,, theb whichm), named, t con-) and cert a section, we give2LP a Horn-clauseMODELThe basic OF2 model unitsSACL LPspecifications offorMODEL an SACL SACL OF andAT specificationSACL in discuss large, open,2 are name distributedare allowed certs systems.members to redelegate In of this the it namemembersn it means of the name that principalsn C receiving this authorization authorizationsset of names = ∗, constructedsection. from Consider principals ansists SACL of the specification( followingu, a, (u ,c set,c,, whichof ..., currently c ,b con-, ..., b valid)). If certificates: we can derive say that ansay auth that tag an is authmore tag restrictiveits isThe properties. more basic than restrictiveIn units another [27],and of wethan an ifauthname introducedSACL• the another certs andIn specification. [27], auth if the thesection, we certs secureNN. introduced as are V. we definedUA accessNarendraname give certsa the control in Horn-clause SPKI/SDSI. secureKumar, language access et. model al. control for SACL languageif d3) and=0Given discuss, itname means auth certif that2d(u certs=01,bC principalsare12,, ( itu,u allowed means2 u,cl11,,c1 receiving2,t2 that,to1 ...,) redelegatemand c principalsl)) thiswe(u canau-1,n,d,t it receivingcompose2) this au- R4 : 37 usingIf zero we or can more deriveIf wename the can identifiers auth derive certsists the(u, auth of u1, the d, cert t) followingfrom(u,• u1C the,1 d,=( tgiven set) fromu of1,a,u thecurrently2• given), C2 =( validu2, certificates: b,(u, u3 a,), uC)3 =(u1, a, (u1, a, b)), models and policy conceptsinterpretation can be modelledinterpretation of the in first SACL. tag of the is Weand a first subsetauth tagAncerts of is auththe a. subset interpretation cert of inEvery the SACL interpretationThe principal has basic the components form in SACL(u, n, hasd, of t) SACL itswhere, own are: localAn name access space. control specificationwethem canthe in to namecompose SACL obtain cert consists the them auth of to cert1a obtain setfrom(u, (u the the2d,c=11 auth,c given2, ..., cert c setl,b2 of d =1 SACL, with severalNamesetSACL,set of features certs ofauthorization certificates, withitsallow properties. designedset several of a tagscertificates, where principal features to suit,u usedis authorization the designed where toC to owner represent bind=(u isu to of the,a,u princi-suit the owner sets resource authorization), ofCC ofthorization=( the=(t thenu resourceu ,, b, a, u are u),)t notCthenCthorization allowed=(=(u ,,c,u a, to( areu redelegate), nota,C b))=( allowed, self it. If, ( tou redelegate,a),, 1,t ) it. If , provide concluding remarksof in the Section secondof 4. tag. the secondAuthorization tag. Authorization being requested being can requestedThese also• • name be can spaces also be can be linked usingT extendedof1 name names.1 and The2 auth2 certs.4 Given2(1u,, n, ...,certificates3 and,4 b,mTI SACL)3,5( d,t1,t tAn)2 specificationusing))13 access, where composition51 control, TI6 andis specificationthe then tag intersection1 we say in SACL1 that, u consists1 of a set Namespecifications certs allowpals/names in aauthorizations large,set principalspecifications of principals open, to a to local distributed in bind large, name princi- inopen, systems. its name distributed In space. this systems.This it In means this that principalsit means receiving that principals this authorization receiving this authorization 2.1• Overviewprincipal of SACLu is the• issuerwe sayof this that certificateweu1 saycan say that thataccess uu 1cantcan. accessC access4 =( t. ut.1, a, u4), C5C=(7 =(u3u,c,u3, (u53)is,c, C a),6 member0=(,t2)self, where of, (Au( 1u,tag,at1 a()u,) =(,can1 u,ttag ,1either1),,t(purchase) be (tag (((u))range,n,d,t or (tag2) tag-expr). Auth tag encoded asencoded a tag. We as will a tag. respect We will the request respect• section, ifthe the request requestedinterpretation we give if the a requestedsection, Horn-clause of names we give ismodel aU a group Horn-clause for SACL of1 principals and modelan discuss access who for SACL are request its and weare discuss allowedwant3)operator toGiven to decide redelegate i.e. authareof whether the name allowed certs set it and of this to permissions auth redelegate access1 certs.1 Givenitand authorized an∗1 SACL by specification and pals/namest denotes to thea localbinding authorization nameset ofcanWe inname2.1 beits now being extended name identifiers Overview presentWe granted space. now to an a, of This usedbinding examplepresentby SACLu totoC construct of an SACL=( principals exampleu , specification( namesu ,c SACLto), 0,tle specification)200), which where ( sett item1=( whichtag item2(purchase item3(tag (()))))range denotesand t =( thetag set( purchaseof all permissions and must be auth tag isauth more tag restrictive is more than restrictiveIn the this tag• section, than permitted the we tag providemembers. for permitted thatEvery• a brief We principal for interpret overview that in auth ofSACLWe the tag now language has asA a itspresent set own of permissions. should7an local example name3 be permitted.3 SACL space.We specification2 To compute2)bothwe1Given thewhich can tags thean an compose answer authaccess cert request to them( this,u, (u to1 we,b obtain21,b want2, ..., to the bm decide), auth d, t) whetherand cert a this access 2 LP MODEL OF SACL bindingits can properties. be extendedsetits to properties.a of bindingnames of principals= ∗, constructed to from principals ∗ ∗ SACL. SACLmembers is derived of the fromarbitrary• name thewill SPKI/SDSIn names be used inwill PKI, a as straightforward be bya usedrunning treating as aexamplele way running200) [10], ( forset example[24]the item1 rest( item2 forrange of the this item3 le rest150)name))) ofand ( this certsett2 (=( item1u ,btagused, item2((purchaseu with,c))) ,ca .lot, ..., of ccare)) wein delegations can compose principal. principal. sayThese that an name auth spaces tagInis this canwill moreN section,be be linked restrictiveUAused we as using providea running than extendeddefine another a briefexample the∗ names. overview if notion for the The Anthe of access ofrestcertificate the of language controlthis( section. compositionu, n, specificationAn d, accessTIshould(twhich1,t1 control2 be))1 in, permitted.is whereSACL2 the specification1 only consistsTI2 is To thel compute of in tag a SACL set intersection the consists answer of a to set this, we In [27], we introduced the secure access control language arbitraryif d =0 names, it in means aAuth straightforward thatusing certssection. principalsallow zero Consider or asection. wayprincipal more receiving [10], name an Consider to[24] SACL authorize this identifiers au-( specification anrange the SACL members le 150) specification, ( whichset∗ item1If con- we, whichcan item2 derive))) con-∗. the auth cert (u, u , d, t) from the given The basic unitsThe basicof an units SACL ofa specification set an of SACL• SPKI/SDSI specification are nameinterpretation certsinterpretation• certs as are aname policy of certs the of specificationSACL. names firstConsider tag SACL is is a afor group is subset an derived access SACL of of principals the from specificationrule interpretation the of inference whoSPKI/SDSIof are, namewhich its in PKI, SACL. and Givenconsists by auth treating thisoperator ofthemof certs. name accessthe to Givendefine i.e. and obtain specification, the authan theA setthe SACLtag-expr notion certs.of auth permissions specification should ofGivencan cert1 certificate be(u, one an we(u SACLof2 authorized,c permitand compositionbyte-string,1,c specification2, ...,u5 c by lsimple-expr,which,b2 and is the set-expr, only Auth certs allow a principalof a nameset to of authorize toauthorization specified the operations members tags , used on∗a to resource represent andC sets∗ of C SACL, with several featuresand designedauth certs toand. suitauth authorization certs. control.• Thethorization2.1 design Overviewof goal themembers. are of second not• SACL of allowed2.1 SACLsists We tag. was Overviewinterpret ofAuthorizationa to to set the extend redelegatefollowingsists of following auth SPKI/SDSI of theof SACL tag beingtheset features it. asTof If set following d acurrently requestedcerts set=1 of of currently, asGiven permissions. avalid set policycanof certificates: also this valid currently specificationan be access accessWe certificates:accessset specification, ofvalid request certificates, forto t certificates:3accessboth=(an we, ...,tag access thewant b shouldm where(rule) tagspurchase, d, requestto oftu) wedecide inferenceisprefix-expr the permit100 we owner whetheritem2 want inu or SACL.5 of ))?range-expr to the this decide resource access whethert then this access of a name to specifiedto possibly operationsauthorizations redelegate on a resource this authorization. and specifications in large, open, distributed systems. In thisof SPKI/SDSIitIn means for this useencoded section, that insay a principals flexible that we as aInan provideC tag.1 thisaccess auth=( receiving We section,control.utag a controlwill1 brief,a,uC is respect1 thismore we2 The overview=() specification, provideC authorization designu2 restrictive1 the,a,u=( request ofu2 goala2) the, brief b,Caccess thanof u2 language if3 overview=( SACL) the,1)C another tou requested32Givent,=( was3 b,=( u ofshouldu3 toif1)tag the, thea,extendC3 name( languagewepurchaseu=( be1, saya, the permitted.u b1)), that3) features certsa,, 100(shoulduuGiven11item2,can a, To( bu,)) access becompute))? auth,a, ( permitted.u 1t,b certs.A1 ,b thebyte-string2,( ..., answeru, To ubm1 compute,)) 1is,tto 1any) this, andword the we answer(overu1,n,d,t an tounderlying2) this, we alphabet Name certsNameallow certs a principalallowto apossibly to principal bind redelegate princi- to bind thisC authorization. princi-=(u , a, u ) C =(u ,c,u ) C =(self , (u ,a),1)1,t We) compose C3 with C(1u, uto, d, obtain t) C8 = section, we give a Horn-clause• model for SACL• and discusslanguage (flexi-ACLareSACL. allowedauth SACL [28]).interpretationEach to tag isredelegateEvery certificate derivedFor isSACL. more the4 principalfrom purposesof restrictiveit SACL hasof the1 SPKI/SDSI the a firstC in validityis4 SPKI/SDSI4derivedSACL of=( tag than, thisu5 is1 forfield, thehas a a, paper, from subsetuse u tagPKI,4 its3 specifying), in the ownC permittedby ofa55 SPKI/SDSIflexible=( thetreating, localu6 interpretation the3,c,uand name accessfor condi-5 PKI,thatdefine), space.C control6 by1(=(u the1 treating,bselfWe specification1 notion, (If,1u now( we2u,,c1,a of1define canwe,c present)certificate,21, derive,t ...,can1 the) c,l an)) compose1) notion the composition example authGivenS. weA of byte-string certthemcertificate SACLwhich name can to specification1 compositionobtain isrepresents the certsfrom only the theatomic which auth(u, given a, (permissions certisu1 the,b1,b only2, ..., that bm ))can pals/namespals/names to a local name to a local in its name name in space. its name This space. This 1) We compose C3 (uwith, a, (uC1,b))toand obtain composeC8 C= with C to obtain its properties. weEach consider certificatea only set ofthat hastionsprincipal. SPKI/SDSIof afragmentThese validityunder the second namea whichC ofcerts setfield7 =( SACL tag.spaces of asspecifyingthelanguageu SPKI/SDSI3 Authorization a, which( certificate policyu canC37,c=( be), includes(flexi-ACL the0 specification linkedu,t certs3 is2, condi-)(,u validbeing where3 as using,c the) a, [28]). (typically0 requestedpolicy,tt extendedfor12=() For, access where specificationtag the rangecompose can( names.purchaset purposes1 alsorule=( of beThetag of forwill inference( them( of accesspurchasesetrangebe this of certificates,used to1 paper, inrule( SACL.u, obtain asrange n, of2 ad, inference whererunningTI the(t1,tandunot2 in nameis)) example be, theSACL. where further owner(u cert18TI for,b broken1 of,is the( theu the2,c resourcerestdown2 tag1,c2 intersection of ,and ..., thist calsothenl)) form wethe basis can from binding canbinding be extended can be to extended a bindingAn access to of a principals binding control of specification to principals to in SACL consists of a set (u , a, (u ,b))∗andC compose=(∗u ,C a, uwith) C to obtain tionsname under and auth whichcontrol. certs thedates). The asencoded certificateTheinterpretation defined design For basic thecontrol. aslegoalin is units a200) purposes validSPKI/SDSI. tag. ofwe of(of The We SACL (typically namesset consideranle will design of item1SACL200) was thisrespect is a( goalrangeonly to item2 specificationpaper, groupset extend the of that item1 of SACLitem3requestwe of fragment the principals assume item2))) featureswas are ifand thename to item3 of that( extendtu,requested1 who2 SACL=( a, certs))) we( aretagu2and thewhich,c its(section.1purchase features,ct2we2=( includes, ..., saytag c Consider9l,b that(purchase2operator the, ...,u11 bcanm8 an))3 i.e.access. SACL If the wecomposewhich2t.set can specification complex of derive permissions them tags ,are to which authorizedconstructed obtain con- the by name cert arbitrary namesarbitrary in a straightforwardnamesof in a name straightforward wayand auth[10], [24] certs. way Given[10], [24] an∗ SACL specification∗ and C =(u , a, u ) C The basic components of SACL( range are: le (150)range ( set le item1150) ( item2set item1))). the9 item2 name11)))).sists certGiven32) of(Weu,We the a, now name uboth compose1 following)1) presentfrom theGiven certs tags theC an set6 ( examplesimple-expr givenu,with of( nameu, a, currently( a,uC set(u,c1 SACL ,bis certs,cofto a,b ,validbyte-string obtainspecification...,, ..., c(,bu, bcertificates: a,,C)) ..., 10(followedu b,b which=)),b. Ifby, ..., wezero b can or)) more derive tag- 2.1 Overview of SACL dates). For theof purposesSPKI/SDSIareandauth dealingauthmembers. of for tag this certs useof withis paper,∗. SPKI/SDSI morein We a currently interpretflexiblename we restrictive assume∗ and for access valid auth use auth∗ than that incontrol tag certificates, certs a the weflexible as aas tagspecification∗ set defined permitted access of so permissions. we in control ignore SPKI/SDSI. for thatspecification We 2 11 12 2 l m2 1m 1 2 m Auth certs allowAuth a certs principalallow to aan principal authorize access requestto the authorize members we want the members to decide whether this access 2) We compose C6 (withself ,u2C,11,tto1). Composing obtain C10C6=with C4 yields C11 = • • language (flexi-ACLsay thatlanguage an [28]).Given auth For (flexi-ACL tagThe this the is basicGiven accessGiven more purposes components [28]). restrictivethis this specification, Foraccess of access this the thanof specification,specification, paper, purposesSACL should anothercertificates are:we of if this shouldthe permitshouldC usingand1 paper,will=( we u compositionu be51we,a,u permit( usedu 1,bpermit2) 1u,, as5(Cu uand52 then2 a,c=( running1,c weu2theexpr2,, ...,( b,sayu.1 uname c,bsimple-expr example3l)) that)1,,C(u3 certu2,c1=(we1 for,c is(uu,21 used,, the a,..., a, u can( c u1 resttol)))1 ,represent a,from of b)) this,we the permissions given can set over of a In this section, we provide a brief overview of the languageare dealingset of withprincipals currentlytheprincipal. validity valid filed certificates, of the certificates so we ignore in this discussion. We of a name toof specified a name to operations specified•should onoperations be a permitted. resource on and aName To resource compute certs andallow theaccess answer a to principal t to = ( this,tag (purchase we to bind 100(self princi- item2,u2, 1))?,t1). Composing(Ifself we,u can4C, 16 derive,twith1) C the4 yieldsparticular authC cert11 resource=(u, u1, d, t) from the given the validity filedwe consider ofnow the•Uinterpretation describe certificates onlyThe basicthatweaccess the consider fragment units in format ofto this thet3 ofaccess=( only discussion.of firstof an thetag SACL SACL that tag to certificates.(purchase3t is fragment3 whichspecification a=( We subsettag includes100( ofpurchase of SACLitem2 the are the interpretationis))? whichname100 a memberitem2 certs includesC))?compose4 ofsection.=((u, theu a1), a, Consider u them4), C5compose=( to anu3 obtain SACL,c,ucertificates them5), specificationC the6 =( using to nameself obtain composition, (u,1 cert,a which), the1,t1 then) con-, name we say cert that u1 SACL. SACL is derived from the SPKI/SDSIto possibly PKI, redelegateto possibly by treating this redelegate authorization.defineset this ofthename authorization. notion identifiers of certificatepals/names, used composition to to construct a localsetwhich of name namesprincipals is thein its only name space.(self ,u This4, 1,t1) 3)setComposing of certificates,C where6 successivelyu is the owner with ofC3 the, CC1 resourceand C2t then • name and auth certsname as defined and auth• 1) in certs SPKI/SDSI.We compose as defined C3 with in SPKI/SDSI.2) C1 toGiven obtain an C8 auth C=( 7(u,usists=(1 cert,a, au(, u(3 ofu(2,u,2,c(,u theb(13u)),c,c 1and2,b) following,1 ...,0,b ,t(u, c2l,),b, ...,a, where2(, bu ...,mis setset-expr2,c) a, bm d,1t member of1,c)) t=()2. currently,hasand If...,tag we cthe ofla,b(purchase can(2structureu,, ..., valid a derive) bm))( certificates:. setrange If wefollowed can deriveby zero or more a set of SPKI/SDSI certs as a policy specification for accessnow describeset of the formatandA=of of name theauth theA second certificates.cert, certs constructed in. tag. SACL Authorization is from of the principals form being(u,U a, requested n) where, can also be ∗ rule of inferencenames in SACL.binding∗ can1) beWe extended composeset1) of Wename to aC binding compose identifiers3 with of3) principalsC,13 usedComposingtowith to obtain to constructC1 CtoC6 8wesuccessively obtain names=yields say thatCC812u1 with=(=canself accessC3,u, C3,11t.,tand1) C2 • The basic componentsThe basicof SACL• componentscompose are: C of8 with SACL C2 to are: obtainname C9 cert= (u(1le,u athe1200),,bC u131), name=(( (u2set,cu11,a,u cert,c item12,2 ...,()u,the, item2 cC a,l2)))2 u namewe=(1) item3Giventag-expr canfromu2, cert composeb,))) an u. the 3andset-expr() authu,, C givena,t32 certu=( 1is) usedtag(u setfromu,1,( a,purchaseuto of1( group,b theu11,,b a, given2 permissions/ b,)) ...,, bm set), d, of t )resourcesand a control. The design goal of SACLEach was certificate to extendEach has certificatethe a features validity has fieldA a name validityusing specifying cert zero field in SACL theor specifyingN more condi-encoded isUA of name the the as formidentifiers condi- a tag.(u, We a, n will) where, respect the requestA if the requested ∗ principalarbitraryNameu names certsis( theu1, inallow issuer a,set a(u straightforward2 of,b of a(names))u this1 principal,and a, certificate(u compose2,b=)) way toand∗ [10],,bindCyields constructed8 composethemwith [24] princi-C to12C obtain2C=((to8 fromselfwithrangeC4) obtain the principals,u=(WeWeC3 leauth2,u1150)to now,t, compose a,1 certobtain) u ( present),(setu,C ( item1Cu=(7 an,ctogetheruwith example,c,c,u item2, ...,C )for))) c,5 C,b SACL .succinctto=( obtainself specification presentation, (uC,a13), 1= which,t ), tions undertions which under the certificate which the is certificate valid (typically is valid range (typically• auth• of tag range is more of restrictive• 2) We compose than the C tag6 with permitted C1 to obtain for thatC10 certificates= (self4 , u12, using1, t14). compositioncertificates5 2 name13 2 using then cert5 (l composition weu612,b say1, (u that2,c1u then,c11 2, ..., we cl say1)) we that canu compose1 of SPKI/SDSI for use in a flexible access control specification 1)set ofGivenauthorization name tags certs, used( tou, a, represent(u1,b1,b sets2, ...,N of bm))UA ∗ ∗ • principal u isset the of issuerprincipals(Authu,pals/names a of) certsis this theallow certificateset nameC9 of to a=(principals principal ausing thatu localComposing1, a, isC zero uname defined93 to=() authorizeor uC in more1, with byits a, u namethis name3 4)C) the yields certificate membersWe space. identifiers, ..., C compose b This)=, (self d, t) ,C uCwill,7 =(1,( withut beu3),u, ( used5u,C0,c5,t2) as,to)0,t a obtain) runningC13t example=(=tag ( forpurchase the rest( range of this dates). Fordates). the purposes For the of purposes this paper,• of wethis assume paper,• we that••principal. assume we T that•U we U 6 4 11m isGiven a74 member1 this3 of3 access(u,is a) a specification, member2 , wherethemprefix-expr of to1( obtainu, should has a) the the westructure auth permit cert (prefixuu,5 (u2 followed,c1,c2, ...,by cal ,bbyte-2 language (flexi-ACL [28]). For the purposes of this paper, authorizationsand set(u of1,bname1, (u identifiers2,c1,c2set, ..., of c,lname used))set of toidentifiersauthorizationwe construct can, names used tags to construct, used to names represent sets of ∗ (u, a) is the• name thatandof abinding is name defined•2) to canWe specified by•be extendedcompose this2) certificateoperationsWe toC compose a6 bindingwith on a resourceC3)C of16 principals(utoGivenwith3,u obtain and5, 0C auth,t1 to2)toCle105) certssection. obtain200)=Finally( (u,setC u Consider10 we, item11,t compose=) item2and anstring SACLC( item3u12,n,d,t.with prefix-expr specification))) Cand2)13 tto represents=( obtaintag, which(Cpurchase14 the= set con- of all words which are dealingare with dealing currently with valid currently• certificates, valid certificates, so we ignore soThe we basic ignore unitsA3) of anComposing SACL specification CA6 successivelyT are withname C2) certs3, Caccess1Given and toC an2 tyields3 auth=(12) tagC cert12 (Given1purchase(u, (u an1,,b ..., auth1100,b b1m2),item2, cert..., d, bt)m())?u,),( d,u21 t),band1,b2 a, ..., bm), d, t) and a we consider only that fragment of SACL which includes the composeset of themnames to= obtainset(self of∗,names,u constructedauthorizations the, 1,t name)= from cert∗, principals constructedC fromC principalsC =C ∗ C C = C Everyand principal• in SACLthisto possiblyarbitrary has certificate its• own redelegate names states local in2 that= name a this(self straightforward( all1self authorization.,. space.u members Composing,u, 1,2 ,t1),t1) of5). wayComposing the6Finallywith namewe [10], can [24]4n weCyields6 compose composewithname( sists11Crange4 cert(yieldsself them of12( le,uu thewith150)C5,b to,110 following,,t obtain(=name (u213).3)set,c Noticeto,c cert item1 obtain theGivenhave, set ...,( thatu auth the item2cof,b)) theauth14 currentlybyte-string,we( certu intersection))) can,c certs. ,c compose validas,( ...,u, their of u c )) tagscertificates:, 1 prefix.we,tt)1 canand composeThis( ucan,n,d,t be used2) to the validitythe filed validity of the filed certificates of the certificatesin this(u, discussion. a, in(u this,c• and,c discussion. We,auth ..., cN certs,b , We...,. UA b )). If weN can3UA derive1 1 1 2 1 2 1 l 1 2 1 2 l1 1 1 name and auth certs as defined in SPKI/SDSI. this certificateusing2 states1 zeroare2 thatAuth also or alll more memberscerts members2 using nameallow(selfm zeroof,u of aidentifiers4 the principal the, 1 or,t name( name more1self) ,u to(u, namen4 authorize, a1),t. 1 identifiers) the(self members,u5, 0,t2). Notice∗ and thatt yields the intersection∗t sinceC t ofis tags strictlyC t1 a subset ofCt = These• name spaces can be linked using extended4) We names. compose The C with C to( obtainu, n, d, CTI =(t them(1u,t1),C 2u))1Given to,,We =(0, whereobtain t u)2 compose1 this,a,uTI the access2themis), auth the2C2 totag3specification,=( certwesuccinctly obtainwith intersection2u( can2u,, b,(u the u compose 213represent,c), auth1 shouldtoC,c32 obtain,=( cert ..., themaccess wecul(1,bu,, a,2 permit( to u18(tou2,c1 obtain,hierarchically1 a,,cu b25)), ...,, the cl,b auth2 arranged cert now describenow the describe format of the the format certificates. of the certificates.the nameEach cert• certificate(u, a, u ) hasfromEvery a validity the principal given field set specifyingin SACL7 of has5 the its condi- own13 local name3 5 space.2 The basic components of SACL are: are also membersset of ofauthorization theofName name a name3)1(u,set certs tagsComposing toa of). specifiedauthorizationallow,3) usedComposing aC operations to6 principal representsuccessively tags onC, used6 sets to asuccessively resource withandbind of to representt2Cyields princi-3, andC with1 tand sets2 sinceC3C of,(2uCt21,is a,and strictly(u ,bC2)) aand subsetresources compose of t1 likeC directorywith C structuresto obtain interpretation of• names is• a group• ofThese principals5) name Finally who spaces arewe cancompose its be linked C with usingoperator C extended to obtain i.e., ..., theaccessC names.C b4m set=( )=, to d,1(self ofut Thet13) permissions, =( a,, u2 u,tag4 ,), ...,C(purchase5 bm=() authorized(,u, d,u n,3 t,c,u) d,100TI5)item2,( bytC816,t=(2))?)),self where2 , (u1TI,a)is, 1 the,t1) tag, intersection A name certA in name SACL cert is ofin theSACL form is of(u, the a,certificates n form) where,tions(u, a, using under n) where, composition which the certificate thenT wesay is valid that (typicallyuT1 12 range13 of 14 5 authorizationstopals/names possiblyauthorizationsyields redelegate toC a12 local=(yields this nameself authorization.C,u12 in3,=(1 its,t1self name) ,u3 space., 1,t1) This C9 =(u1, a, u3) set of principals members. We interpret auth tag as a set of permissions.0, t ). Notice We that the intersectionboth the 3)of tags tagsGiven tC and7 =( autht yieldsu3, (3) certsu 3t,c Given),(0u,,t u2)1operator,range-expr auth, where1,t1) certstand1 i.e. =(has( theu,(tag uthe1 u,n,d,t set1( ,purchase1structure,t of12)) permissionsand ( rangerange(u1,n,d,t authorizedfollowed2) byby the3 • is a memberdates). of For(u, athe) purposesinterpretation of this2 paper, of names we is assume a group that of principalswe 1 who are2 its 2 ∗ U principal u principalis the issueru is of the this issuer certificate of this certificate binding4) canWe be compose extended4) WeC tocompose7 awith bindingCC5 of7 principalstowith obtainC5 totoC132) obtain1)=WeWeC compose13 compose= C CwithwithC Ctoto obtain obtainC C = = set of name identifiers ,• used to construct• names say that an authEvery tag is principal more restrictive inEvery SACLmembers. principalthan has its anothersince We own in interprett SACL is local if strictly the namehas auth a its subset space. tag own as of locala t set name of permissions.we space.le can200) compose ( Weset item1we them can item26 bothordering to3 compose obtainitem3 the, tags1the))) the them 1andlower-limit autht2 to=( obtaincerttag and10(purchase the8 the upper-limit auth cert. ordering • 2) Given anare auth dealingEach cert certificate( withu, (u1 currently,b1 has,b2, a..., validityvalid bm),2 d,certificates, tfield) and specifying a so we the ignore1 condi- ∗ A (u, a) is the(u, name a) is that the is name defined that by is defined this certificate by this certificatearbitrary( namesu3,u5, in0,tSince a2() straightforwardu3,u the5, tag0,t2t)3 isIf more way we can restricted[10], derive [24] than the( autht2range,( weself cert(u can,u1 le,(2 a,u,150) con-, 1( uu,t121,,b () d,.)) Composing tset)andNow,fromspecifies item1 compose thewe item2present givenCthe6 with typeC)))8 some. Cwithof4 data;yields examplesC 2possibilitiestoC11 obtain= for better arealpha, understand- numeric, set of names = ∗,• constructed• from principalsinterpretationThese of the namethe firsttions validity spaces tag under isThese a can filed subset which be name ofsay linked of the the that the spaces certificates certificate using interpretation anSincecan auth extended the be is tagtag in linked valid this ist names.is more (typicallymoreusing discussion. restrictive Therestricted extended range We than than names. of t( anotheru,, we n, The can d, TI conclude if( thet1,t2)) (,u, where n, d, TITI(tis1,t the2)) tag, where intersectionTI is the tag intersection • N UA and and name cert (u1,b1, (uAuth2,c5)1 certs,c2,Finally ...,allow cl))clude5) a wewe principal composecanFinally from composeC to14 we3C authorizethat composesetuwith5 ofshould certificates,C theC members betowith allowed obtain whereC2 C accessto∗u is obtain= the to t owner3 Ci.e. u=5 of∗ theing resource ofbinary, the tags. timet then and date. using zero or more name identifiers of the second tag. Authorization• being requestedfrom C can that also u beshould12 be allowed1312 access tooperator 13t i.e.14 uGiven(self shouldC i.e.9,u=( the this4 14,be1u,t set1operator access,1 a,)If of u we3 permissions) specification, can i.e. derive the set authorized the of should auth permissions cert we by(u, permit u authorized1, d, tu)5from by the given theminterpretation tonow obtaindates). describe of theFor namesinterpretation auth the the isformat cert purposesinterpretation a group(u, of( ofu the of names,c14 principalsthiscertificates.,c of, paper,theis ...,5 a firstc group,b who we tagassume of are is principals a its subset that of we who the interpretation are3 its5 this certificatethis states certificate that all states members that all of members the name ofn theof name a namen(self to,u specifiedshould52, 0,t1(2self) be2. operations Notice allowed,u5,l0,t that2we2 to) on.purchasethe Noticesay a intersection resourcethat thatuitem1 thecan and2 intersectionworth ofaccess tags3) 1002)ttComposing.1. We of tags composet1C successivelyC with C withtouC obtain, C andC C = t set of authorization tags • , used to• represent sets ofencoded as a tag. We will respect the requestallowed if the requestedto purchase item2 worth 100. bothaccess the tags to t3 =(setbothtag of6 (purchase thecertificates,Now,6 tagsTag we(100tag present where(item2ftp1 xyzsome))?is3.com the examples1 ownerabc)) 10says for of2 better the that resource subject understanding canthen do • ,members. ..., b ), d,areA Wet) name dealing interpretmembers. cert with in auth SACLof currently the Wetag second is interpretas of a the set valid tag. formof auth permissions.Authorization certificates,(u, tag a, as n) awhere, set so We being ofwe permissions. ignore requested can We also be • T are also membersare also of members the name of(u, the a) name. m(u, a). to possiblyand redelegatet2 yieldsForandt thisa2 moretsince2 authorization.yields detailedt2 ist2 strictlyWesince exposition nowt a2 subsetis present strictly of of SACL, ant a1 subset exampleyields we(self ofrefertC SACL1,u the=(, 1,t specificationself).,u Composing, 1,t which) C with C yields C = authorizations auth tag is moresay that restrictive an auth than tagsay isthe that more tagencoded an permitted restrictive auth asFor tag a tag.a is for thanmore more We that another willdetailed restrictive respect ifexposition the the than request another of SACL, if the if the requestedwe refer the122 we sayof1 the that 3tags.ftpu1to1can the access host6 xyzt. .com4 as user11abc 3) Given auththecerts validity(u, filed u1, 1 of,t1 the) interestedand certificates(u1,n,d,t reader in2) this to [27]. discussion. We principal. principal u is the issuer of this certificatewill be used asIf we a running can4) derive1)We( exampleselfWe compose theIf,u we auth compose4, 1 forcan,tWe cert1C) derivethe7 now(Tagu,withC restTag3 u (1present thetag,with d,( oftagC auth( tftp5) thisfrom( toxyz.comhttpC an cert1 example obtain the httpto(u, givenabc u://obtain1,))abcC d, SACL 13says t.)comfromC =that8 specification/xyz the= subject)) givengives can whichsubject do ftp weinterpretation can composenow• ofEach describe theinterpretation them certificate first the tagauth to format isobtaininterested hastag a of subset aof is the validity the themore first of reader certificates. auth the restrictive tag field interpretation isto cert a[27]. specifying subset than of the the tag interpretation condi- permitted for that • Every principal in SACL has its own local name space. (u, a) is the name that is defined bysection. this certificate Considerset of certificates, an SACL3)(u Composing where,uset specification(u1, of,0 a,,t certificates,u(is)u2 the,bC)) owner,andsuccessively wherewhich composeof theu con-is resource the withC owner8 withCt then, ofC 2 theandto resource obtainC t then The basicof units the second of• antions tag.SACL under Authorizationof specification the which secondprincipal. the tag. being are certificate Authorizationname requested certs is valid can being (typically also requested be range can of also be 3 5 2will be6 to used the host permission as a xyz.com running to as access 3user example 1 abc the web for2 page the rest at the of given this These name spaces can be linked using extended names. The (u, n, d, TI (t1A,t2 name)), where cert inTI SACLis the is tag of intersectionthe form (u, a, n) where, C and auth certs. and 2.2 2.2Modelling Modelling basic basicsists elements elements of the inwe following inLP say LP that5)u set1 Finallycan ofyieldsweC access currently9 say we=(Csection.12 thatut compose.1,=( a, validu u1self3canTag) Consider,uCURI. certificates:12 access 3(,tagwith1 To,t 1( givethttp) an.C13 accessSACLhttp://abc.com/xyzto obtain to specification theC entire14 = tree)) gives, we which can subject use con- the the interpretation of names is a group of principals who are its operatorencoded i.e.asdates). a the tag. set Weencoded For of will the permissions respect purposes asThe a tag. the basic We requestofauthorized willunits this respect paper, if of the an by requested wethe SACL requestassume specification if that the we requested are name certs C thisprincipal certificateu is states the issuer that all of members this certificateC1 of=( theu name1,a,un2We), C now2 =( present4)u2)(2self, b,WeWe u,u3 anWe)5 compose,, 0 composeCexample,tsists now3 2=(). Notice present ofu1C SACLpermission, theC7 a,tag6 that(withu followinganwith(1 specificationtag, the a, example( bCtohttp intersection))C5 ,access1 to( setto SACLprefix obtain which of obtainthecurrently of http specificationweb tagsC://C 13 page10tabc1 = valid. com= at whichthe/ certificates:xyz given/))) URI. bothauth the tag tags is•are more• dealing restrictiveauth withtagand is than currently moreauthIn theIn certs restrictive this this tag. valid section, section, permitted certificates, than wewe theshow forshow tagthat sohow how permitted we the the ignore basic basic forelements elements that of of SACL SACL ∗ members. We interpret auth tag as a set of permissions. We Name certs alloware a( alsou, principal a) membersis the name to of bindthe that name is princi- defined(u, a).C by4 =( thisu1 certificate, a, u4), C5 =(u3and,c,u(ut(5self),u,yieldsC,u6, 0=(2,t,t1)self,tsince1),.( ComposinguTagt1,ais)(, strictlytag1,t1()pkpfsC, 6 awith subset//Cftp4 of.yieldsabct .netC/11xyz=/spki.txt ( set • principal. • principal. are modelled using Horn clauses.will be used as awill32 running5 beC used122=( example• asuTo1,a,u agive2 running for2 access), theC2 =(to restexample theu2 of entire, b, this u for3 ) ,tree1C the3 we=( rest canu1 of ,use a, this (theu1, tag a, b )) (tag, say that an auth tag is more restrictive than another if the pals/names to athe local validity name filedin its of nameare the modelled space.certificates using This inHorn this clauses. discussion. We ∗ and Name certs allowC7 =( a principalu3, (usection.3,c), to0,t2 Consider bind), where5) princi-Finallysection.t( an1self=( SACL,utag weC4 Consider4, 1( compose=(purchase,t specification1)u(1http, a,read an uC((412) writeSACLprefix,rangeCwith5 ,)))=( whichC specificationhttpgivesu133:// ,c,utoabc.com con- obtainsubject5), C /6xyzC the=(,14 /))) whichself permission= , (u con-1,a), to1,t read1), If we canThe derive basicnow the units auth describe of certThe an( the SACLu, basic u format•1, d, units specification tprincipal) offrom of the an the certificates. SACLu givenare is name specificationencoded certs using are thename unary certs predicate ∗ C C interpretation of the first tag is a subset of the interpretation binding can be extendedthis to certificate a binding states ofpals/names principals thatprincipal all membersto to au localleis200) encoded of name (theset namein using item1 its namen the item2 unaryspace. item33) ( predicateselfThisComposing))),uand5C, 0,tt2=(2=()C.u Notice6tag,successively(andu(purchase,c that write), 0,t the the) intersection withfile spkiCt3.txt, =(C ofin1 tag tags theand( publicpurchasetC1 2 key protected( range and auth certs•. Au nameand auth cert incerts SACL. is• of the formt (u, a, n) where,sists of the followingsists of set the7 of followingcurrently3Tag 3 (tag valid set (pkpfs2 of, certificates: where currently //ftp .abc.net1 valid /xyz certificates: /spki .txt ( set read of the second tag. Authorization being requested can also be set ofarbitrary certificates, names where in a straightforwardis the owner of way theisPrincipal resource [10],isPrincipal [24] (uthen) (u) ∗ file system ∗ are also membersbinding of the canname be( extendedu,( a).range to le a binding150)C1 =( ( setu of1,a,u principals item12),and item2Cyields21 tot=(2)))yieldsleCu.2112200),,a,u b,=(t u23 (2selfsince)write),,setCC,u32))) item1t3=(2=(, givesis1,tu strictlyu112 item2), ,subject a, b,( uu3 a1) item3, subset a,theC3 b)) permission)))=(, ofandut11, a,t2 (=(u to1,tag a,read b())purchase and, write we say that u1 can access t. name identifier a ∗is encodeda using ∗the unary predicate ∗ Tag (tag (purchase ( range le amount)( set encoded as a tag. We will respect the request if the requested Auth certs allowName a principal certsprincipalallow to authorizeNameu ais principal thearbitrary certs the issuer membersallowname names ofto this identifier binda incertificate principal a princi- straightforwardGivenis encoded to thisC bind access=( usingu way princi-, a, specification, the [10], u 4) unary) [24]CCWe=( predi-=( composeu shouldu( ,c,u,range a, u)• we)theC leCC7 150) permitfile=(=(with spkiself (u set,c,uu ,.txtC(u5 item1in,a)to )theC, 1 obtain,t item2public=() self))) Ckey,13(u p,ar=otected), 1,t )file system • We now• present an• example• SACL specification• which 4 1 4 , 54 31 54, , 65 3 5 15 , 6 1∗ , . 1 1 , ∗ auth tag is more restrictive than the tag permitted for that of a name topals/names specified operations(u, to a a) localispals/names the on name nameAuth aisNameId resource in that certsto itscate a nameisallow local and(isNameIda defined) name space.a principal by( ina This) this its to name certificate authorize space. the This members(u ,u ,∗0,t ) items)))∗ gives the user permission to place a pur- • • access to t3 =(C7 tag=(u(purchase3, (u3,c), 0100C,t723=()item2, where5u3,))?(Givenu2t31,c=(Tag), 0 thistag,t (tag2)(, accesspurchase ( wherepurchase specification,t1 (=(( rangerangetag ( lepurchase amount should) (( we setrange items permit))) givesu5 principal. willto be possibly used as redelegate a running this example authorization. for names the restnames are of encoded are this encoded using using lists. lists. [u] [u corresponds] corresponds to to the the chase order over∗ the set of items∗items and for binding canand be extendedbinding toof can a a binding• name be extended to of specified principals to a binding operations to le of200) principals on ( a resourceset toitem15) le andFinally item2200) (access item3 weset compose item1 to)))thetand =(user item2tCtag2 12permission=(with( item3purchasetag (Cpurchase)))13 toandto100 place obtaint2item2=( a purchasetagC))?14(purchase= order over the section. Consider an SACL specificationprincipal, whichprincipal u, [ con-u, ua,] [denotesu, a] denotes the local the localname∗ name (u, a),( u,and a) ,[ andu, ∗ 3amount no more than amount. The basic units of an SACL specification are name certs arbitrary namesthis certificate inarbitrary a straightforwardto states names possiblyC that in all redelegateway a straightforward members [10],1) [24] this ofWe authorization.the way( name composerange [10],n le [24]150)C3 with ( ((setselfrange item1C,u1 5, leto0,t item2150)2 obtain).set Notice ())) ofset. itemsC item1 that8 items= the item2 intersection and))) for. amount of tags no moret1 than amount. Eachsists certificate of the following has a validity• set of field currently specifying valida , thea , certificates:[...,u, condi- aa ,a] corresponds, ..., a ] corresponds to the extended to the name extended (u, a name, a , and auth certs. Auth certs alloware also a principalAuth members certs toallow1 of authorize2 the a name principal1n the2 (u, members a to)n. authorize(u1, a,∗ ( theu2,b members)) and compose∗and∗ 1 t2 Cyields8 1)withtWeCsince∗2 to composet obtainis strictlyC awith subsetC of t to obtain C = • • Given this accessGiven specification,2 thisFor2 accessFor should example, example, specification,2 we certificates permit3 certificates shouldu5 1presented we presented1 permit inu 5 in the8 the running tionsC1 under=(u which1,a,u2 the), C certificate2 =(u2, b, is u valid3), CEach (typically3 =( certificate...,u1 a, a,) range(u,(u a11, has,a a, of2 b,)) a ...,, validity an) field specifying the condi- of a name to specifiedof a name operations ton specified on a resource operations andC on9 =( a resourceu1, a, u3) and (u , a, (u ,b)) and compose C with C to obtain Name certs allow a principal to bind princi-dates). For the purposes of this paper, we assume that we access to t3 =(tagaccess(purchase to t3runningexample100=(tagitem21 ( correspondpurchase example))?2 100 correspondto theitem2 following))? to facts:8 the following2 facts: • C4 =(u1, a, u4), C5 =(u3,c,u5),tionsC6 =( under selfauthorization, which(uauthorization1,a), the1,t tags certificate1), aretags encoded are is validencoded as (typically lists. as lists. [tag range[,tag [ftp, [,ftp offtp, ://ftp: to possibly redelegateto possibly this authorization.• redelegate this authorization.2) We compose C6 with C1 toname obtainC9def=(([Cuu1011,, a, a,= u[u32)]], [1]),name def ([u2, b, [u3]], [2]), pals/names to a local name in its name space. ThisareC dealing7 =(u3 with, (u3,c currently), 0,t2), where valid certificates,t1 =(dates).tag (purchase For soabc thewe.edu// purposesignore, abc([ , range.setedu, r,,[ ofw,]]]set this is,r,w an paper, example]]] is we an example assumeof an authorization thatof an we autho- ∗ ∗ (self ,u21), 1,tWe1). Composing compose1)CC63Wewithwith2)name composeC4WeyieldsCdef1 compose([touC1C,113 a,obtain=[uwith1,C a,6 bC]]Cwith8, 1[3])=to,nameC1 obtaintodef ([ obtainuC18, a, [=uC4]]10, [4])=, binding can be extended to a binding of principals tothele validity200) ( filedsetEach item1 of certificate the item2 certificates item3 hasEach a))) invalidityare certificateand this dealingt2 discussion. field=(tagtag has withthat specifyingrization( a purchasesays validitycurrently We that tag the the fieldthatcondi- valid holder says specifying certificates, thatcan read the the holderand condi-so write we can ignore fromread theand (self ,u4, 1,t(1u) , a, (u ,b)) and(u compose,name a, ((uselfdef,bC)),u([2uandwith,31,,t c, compose1[uC)5.]] Composing,to[5]) obtain,aclC entrywithC6 ([withCselfto,C[u4 obtain1yields,a], 1,tC1]11, [6])=, arbitrary names in a straightforward way [10], [24] now( describerange∗tions le the150) underformat ( set which of item1 thetions thecertificates. item2 certificate underthe))) validity. which issite valid the filedftp://abc.eduwrite certificate(typically offrom the the certificates rangeis valid site offtp (typically:// inabc this.edu range discussion.1 of 2 We 1 2 8 2 8 2 3) Composing C successively withauthC ([, uC3, [uand3,c],C0,t2], [7]), where Auth certs allow a principal to authorize the members A∗ name cert in SACL∗ is of the form (u, a, n) where,name cert (u, a, n) is encoded usingC96=( theu binary1, a, u pred-3) C9 =(u3(1self, a,1,u u34), 1,t12) • Givendates). this access For the specification, purposesdates).now ofFor thisshould the describe paper, purposesname• we the wecert permit format assumeof (u, this au, n5 of paper,) thatis the encoded wecertificates. we assume using the that binary we predicate t =[tag, [purchase, [ , range, le, 200], [ , set, item1 , item2 , name def ([u, a,yields n],s)C2)12 =(Weselfs compose,u3, 1,t2)1)C6Wewith3)1 composeComposingC1 toC obtain6 Cwith6 successivelyCC101 to= obtain with CC310, C1=and C2 of a name to specified operations on a resource and access toaret =( dealingtag (purchase with currentlyare100 dealingitem2 validA with name))? certificates,name_def currently certicate in ([ SACLu sovalid, a, we n is], certificates, of ignores), thewhere form s denotes so(,u, where we a, n ignore)thewhere,denotes cert number the cert ∗ ∗ principal3 u is the issuer of this certificate 4) We compose(selfC,u7 ,with1,t ).C Composing5 (selfto,u obtainitem3yields, C1,t]]]withC)C.and13 ComposingC=(=yieldsself ,uC, 1with,t= ) C yields C = to possibly redelegate this authorization. • number and can be used as a succinct2 reference1 to 2 6 1 12 4 3116 1 4 11 the validity filedthe of the validity certificates filedand of in can the this be certificates discussion.used as a succinct in We this reference discussion. to the We cert t2 =[tag, [purchase, [ , range, le, 150], [ , set, item1 , item2 ]]]. (u, a) is the name that is defined byprincipal this certificatetheu certis the issuer of( thisu3,u certificate5, 0,t2)(self ,u4, 1,t1) (self4),u4We, 1,t compose1) C∗7 with C5 to∗ obtain C13 = • 1) Wenow compose describe theC3 formatnowwith describe ofC the1• certificates.to theacl obtain format entry ( ofCself8 the,n,=d certificates.,t) is encoded using the binary predicate The above predicates form the database used for making Each certificate has a validity field specifying the condi- and (u, a) isacl the entry name(self that5),n,d,t isFinally defined) is encoded3) we by composeComposing this using certificate theC12C binarywith3)6 successivelyComposingC13 to(The obtainu3,u withaboveC5,6C0,tsuccessively14 Cpredicates23)=, C1 and form withC2 theC database3, C1 and usedC for2 making (u1,A a, name(u2,b)) certand in SACL composeA name is ofC• certthe8 with form inacl_entry• SACLC(2u,to a,([ isself obtainn of),where, then, d form, t], s),(u, where a, n) swhere, denotes the acl entry access decisions. tions under which the certificate is valid (typically range of this certificate states that all membersand of thepredicate name n acl entry([(selfself,u, n,5, d,0,t t],syields2))., Notice whereC12s that=(denotesself the,uyields intersection35)access, 1,tCFinally112) decisions.=( ofself tagswe,u composet13, 1,t1) C12 with C13 to obtain C14 = • C9 =(u1, a, u3) number dates). For the purposes of this paper, we assume that we are also membersprincipal of theu is name the issuerprincipal(u, a). ofthis thisu is certificate certificate thethe issuer acl entry states of this number that certificate alland memberst2 yields4) We oft2 thesince compose namet2 isn4) strictlyC Wewith a compose subset(selfC ,u oftot,C10 obtain,t with). NoticeCC thatto= the obtain intersectionC = of tags t 2) We• compose C with• C • to obtain C = 7 5 5 7 2 135 13 1 6 1 auth certauth (u cert10,n,d(,u,t) n,is d,encoded t) is encoded using the using binary the binary predicate pred- 2.3 Name closure computation are dealing with currently valid certificates, so we ignore (u, a) is the name(u, that a) isisare defined the also• name members by that this is certificate of defined the name by( thisu, a) certificate. (u3,u5, 0,t2) (u3,u2.35,and0 Name,t2t)2 yields closuret2 computationsince t2 is strictly a subset of t1 (self•,u2, 1,t1). Composing• C6 withauthC4 yields([icateu,n,dC,autht],11s),=([ whereu, n, d, s tdenotes],s), where the auths denotes cert number the auth the validity filed of the certificates in this discussion. We and and 5) Finally we compose5) FinallyCIn12 thiswithIn we this section, composeC section,13 to we obtainC describewe12 describewithC14 rulesC=13 rulesto for obtainfor computing computingC14 = the the name name (self ,u4, 1,t1) Authorizationcert number tags in SACL are carefully designed to now describe the format of the certificates. this certificate statesthis that certificate all members states of that the all name membersn of the name(self ,un , 0,t ). Notice(selfclosure. that,u , the0,t We intersection) .have Notice a binary that of thetags predicate intersectiont name_closure of tags t ([u, a, u1], • C • C C C 5 2 closure.5 We2 have a binary predicate1 name closure([1u, a, u1], 3) Composing 6 successively withsupport3 ,a variety1 and of applications.2 The structure of auth tags pf) which represents that the name cert (u, a, u ) is in the A name cert in SACL is of the form (u, a, n) where, are also members ofare the also name members(u, a). of the name (u, a). and t2 yields t2 sinceandpft2t2)isyieldswhich strictlyt represents2 asince subsett2 thatis of strictlyt1 the name a subset cert ( ofu,t a,1 u1)1 is in the yields C12 =(self ,u3, 1,t1) is as follows:Authorization tags in SACL are carefully designed to closure andand pfpf denotes the the sequence sequence of of certificates certificates whose whose principal u is the issuer of this certificate 4) We compose C7 with C5 tosupport obtain a varietyC13 = of applications. The structure of auth tags • composition leads to (u, a, u ). (u, a) is the name that is defined by this certificate (u ,u , 0,t ) is as follows: 1 • 3 5 2 name_closure([K,A,K1],Pf) :- and 5) Finally we compose C12 with C13 to obtainA tagC can14 = either be (tag ( )) or (tag tag-expr). Auth • ∗ name_def([K,A,[K1]],S),Pf=S. this certificate states that all members of the name n (self ,u5, 0,t2). Notice that the intersectiontag of(tag tags(t1)) denotes the set of all permissions and name_closure([K,A,K1],Pf) :- • ∗ CSI Journal of Computing | Vol. 2 • No. 4, 2015 are also members of the name (u, a). and t2 yields t2 since t2 is strictly a subsetmust of t be1 used with a lot of care in delegations name_def([K,A,N],S),length(N,L),L>1, A tag-expr can be one of byte-string, simple-expr, nth1(1,N,K2),tail(N,A2), • name_closure([K2,A2,K1],Pf1), set-expr, prefix-expr or range-expr append(S,Pf1,Pf). A byte-string is any word over an underlying alpha- name_closure([K,[A|T],K1],Pf) :- • bet Σ.Abyte-string represents atomic permissions length([A|T],L), (L==1->(name_closure([K,A,K1],Pf1),Pf=Pf1); that can not be further broken down and also form L>1->(name_closure([K,A,K2],Pf1), the basis from which complex tags are constructed name_closure([K2,T,K1],Pf2), simple-expr is a byte-string followed by zero or more append(Pf1,Pf2,Pf))). • tag-expr. simple-expr is used to represent permis- The first rule above states that if (u, a, u1) is a name cert sions over a particular resource then we can conclude that (u, a, u1) is in the closure. The set-expr has the structure set followed by zero • second rule states that if (u, a, u1n) is a name cert and if or more tag-expr. set-expr is∗ used to group permis- (u1, n, u′) is in the closure then (u, a, u′) is in the closure. sions/resources together for succinct presentation The third rule tells us how to compute the members of an prefix-expr has the structure prefix followed by • ∗ extended name by successively computing the members of a byte-string. prefix-expr represents the set of all each local name appearing in it. words which have the byte-string as their prefix. This Given the database in the previous section and the rules can be used to succinctly represent access to hierar- above, we can make the following conclusions: chically arranged resources like directory structures range-expr has the structure range followed by name closure([u , a, u ], [1]) by rule 1 using • ∗ • 1 2 the ordering, the lower-limit and the upper-limit. name def ([u1, a, [u2]], [1]) ordering specifies the type of data; possibilities are name closure([u , b, u ], [2]) by rule 1 using • 2 3 alpha, numeric, binary, time and date. name def ([u2, b, [u3]], [2]) 3

Since the tag t3 is more restricted than t2, we can con- Now, we present some examples for better understand- clude from C14 that u5 should be allowed access to t3 i.e. u5 ing of the tags. 4 should be allowed to purchase item2 worth 100. 4 Tag (tag (ftp xyz.com abc)) says that subject can do name closure([u , [a, b],u ], [1, 2]) by rule 3 using row gives the structure of the first tag and title of column For a more detailed exposition of SACL, we refer the • •name closure([u , [a,1 b],u ], [13, 2]) by rule 3 using row gives the structure of the first tag and title of column ftp to the host xyz.com as user abc • 1 3 interested reader to [27]. thethe previous previous two two conclusions conclusions givesgives the thestructure structure of the of thesecond second tag. tag. Tag (tag (http http://abc.com /xyz)) gives subject name closure([u , a, u ], [3, 1, 2]) by rule 2 using We have a ternary predicate tag combines(t ,t ,t) • •name closure([u , a,1 u ], [33, 1, 2]) by rule 2 using We have a ternary predicate tag combines(t ,t 1,t)2 • 1 3 1 2 the permission to access the web page at the given name def ([u1, a, [u1, a, b]], [3]) and the previous which denotes that tags t1 and t2 combine to yield name def ([u1, a, [u1, a, b]], [3]) and the previous which denotes that tags t1 and t2 combine to yield 2.2 Modelling basic elements in LP URI. To give access to the entire tree we can use the conclusion tag t. Now we present the rules used for computing tag (tag (http ( prefix http://abc.com /xyz/))) conclusion tag t. Now we present the rules used for computing In this section, we show how the basic elements of SACL tag combines. Tag (tag (pkpfs∗ //ftp.abc.net/xyz/spki.txt ( set tag combines. are modelled using Horn clauses. • ∗ 2.4 Modeling delegation and authorization closure tag_combines([*],X,X). read write))) gives subject the permission to read 2.4 Modeling delegation and authorization closure tag_combines([*],X,X). tag_combines(X,[*],X). and write the file spki.txt in the public key protected We have a binary predicate auth closure([u, u1, d, t], pf) totag_combines(X,[*],X). principal u is encoded using the unary predicate We have a binary predicate auth closure([u, u1, d, t], pf) to • The above rules state that the result of combining any isPrincipal(u) file system replacereplace the the name name appearing appearing in ACL in ACL entries entries and and auth auth certs certs The above rules state that the result of combining any name identifier a is encoded using the unary predi- Tag (tag (purchase ( range le amount)( set by its members. tag X with (tag ( )) is X. • • ∗ ∗ by its members. tag X with (tag ( )) ∗is X. cate isNameId(a) items))) gives the user permission to place a pur- auth_closure([K,K1,D,T],Pf) :- tag_combines(X,X,X)∗ :- atom(X). auth_closure([K,K1,D,T],Pf) :- tag_combines(X,X,X) :- atom(X). (acl_entry([K,N,D,T],S);auth([K,N,D,T],S)), tag_combines(X,[ ,set|Y],X) :- atom(X),member(X,Y). names are encoded using lists. [u] corresponds to the chase order over the set of items items and for (acl_entry([K,N,D,T],S);auth([K,N,D,T],S)), tag_combines(X,[ ,set|Y],X)* :- atom(X),member(X,Y). • length(N,L),nth1(1,N,K2),(L==1->(K1=K2,Pf=S); tag_combines([* ,set|Y],X,X) :- atom(X),member(X,Y). principal u, [u, a] denotes the local name (u, a), and amount no more than amount. length(N,L),nth1(1,N,K2),(L==1->(K1=K2,Pf=S); tag_combines([ ,set|Y],X,X)* :- atom(X),member(X,Y). L>1->(tail(N,A),name_closure([K2,A,K1],[],Pf1) tag_combines(X,[* ,prefix,Y],X) :- atom(X), L>1->(tail(N,A),name_closure([K2,A,K1],[],Pf1) tag_combines(X,[ ,prefix,Y],X)* :- atom(X), [u, a1,a2, ..., an] corresponds to the extended name ,append(S,Pf1,Pf))). * string_concat(Y,Z,X). For example, certificates presented in the ,append(S,Pf1,Pf))). string_concat(Y,Z,X). tag_combines([ ,prefix,Y],X,X) :- atom(X), (u, a1,a2, ..., an) tag_combines([ ,prefix,Y],X,X)* :- atom(X), running example correspond to the following facts: What the above rule says is that if there is an acl entry or * string_concat(Y,Z,X). authorization tags are encoded as lists. [tag, [ftp, ftp: What the above rule says is that if there is an acl entry or string_concat(Y,Z,X). • tag_combines(X,[ ,range,Y|Z],X):- atom(X), name def ([u1, a, [u2]], [1]),name def ([u2, b, [u3]], [2]), an auth cert (u, n, d, t) and if u1 is a member of n, then wetag_combines(X,[ ,range,Y|Z],X):-* atom(X), //abc.edu, [ , set,r,w]]] is an example of an autho- an auth cert (u, n, d, t) and if u1 is a member of n, then we * (length(Z,0); ∗ name def ([u1, a, [u1, a, b]], [3]),name def ([u1, a, [u4]], [4]), can conclude that (u, u , d, t) is a valid authorization. (length(Z,0); rization tag that says that the holder can read and can conclude that (u, u , d,1 t) is a valid authorization. range_check(Y,X,Z)). 1 range_check(Y,X,Z)). name def ([u3, c, [u5]], [5]),acl entry([self , [u1,a], 1,t1], [6]), By applying the above rule we can conclude: tag_combines([ ,range,Y|Z],X,X):- atom(X), write from the site ftp://abc.edu By applying the above rule we can conclude: tag_combines([ ,range,Y|Z],X,X):-* atom(X), auth([u3, [u3,c], 0,t2], [7]), where * (length(Z,0); name cert (u, a, n) is encoded using the binary pred- (length(Z,0); • auth closure([self ,u3, 1,t1], [6, 3, 1, 2]) range_check(Y,X,Z)). t1 =[tag, [purchase, [ , range, le, 200], [ , set, item1 , item2 , •auth closure([self ,u3, 1,t1], [6, 3, 1, 2]) range_check(Y,X,Z)). icate name def ([u, a, n],s), where s denotes the cert ∗ ∗ • using acl entry([self , [u1,a], 1,t1], [6]) item3 ]]] and using acl entry([self , [u1,a], 1,t1], [6]) number and can be used as a succinct reference to and name closure([u1,a,u3], [3, 1, 2]) The above set of rules tell us how to combine tags when t2 =[tag, [purchase, [ , range, le, 150], [ , set, item1 , item2 ]]]. and name closure([u ,a,u ], [3, 1, 2]) The above set of rules tell us how to combine tags when the cert ∗ ∗ auth closure([u ,u1 , 0,t3 ], [7, 5]) one of the tags is a byte-string. The first rule tells that two The above predicates form the database used for making •auth closure([u ,u3, 0,t5 ], [72, 5]) one of the tags is a byte-string. The first rule tells that two acl entry (self ,n,d,t) is encoded using the binary • 3 5 2 byte-strings can be combined iff they are equal. Second and • using auth([u3, [u3,c], 0,t2], [7]) byte-strings can be combined iff they are equal. Second and acl entry([self , n, d, t],s) s access decisions. using auth([u3, [u3,c], 0,t2], [7]) third rules state that a byte-string X intersects with a set predicate , where denotes and name closure([u3,c,u5], [5]) third rules state that a byte-string X intersects with a set and name closure([u3,c,u5], [5]) Y X Y the acl entry number Y iff Xiff is ais member a member of Y of. Fourth. Fourth and and fifth fifth rules rules express express auth cert (u, n, d, t) is encoded using the binary pred- We have a ternary predicate grants([u, u1, d, t], ph, pf ) that byte-string X combines with a prefix form Y iff Y is • 2.3 Name closure computation We have a ternary predicate grants([u, u1, d, t], ph, pf ) that byte-string X combines with a prefix form Y iff Y is icate auth([u, n, d, t],s), where s denotes the auth to computeto compute the the authorizations authorizations of principals of principals either either directly directly a prefix of X. The last two rules say that the byte-string In this section, we describe rules for computing the name pha prefix of X. The last two rules say that the byte-string cert number or transitivelyor transitively obtained obtained (by (by a chain a chain of delegations); of delegations);ph X combines with a range form Z considered under the closure. We have a binary predicate name closure([u, a, u1], X combines with a range form Z considered under the guardsguards us against us against going going around around in loops in loops by using by using the thesame same ordering Y iff X is in the range Z under ordering Y using Authorization tags in SACL are carefully designed to R4pf) : which38 represents that the name cert (u, a, u1) isTowards in the An Executable Declarative Specificationpf of Access Controorderingl Y iff X is in the range Z under ordering Y using authauth certs certs repetitively repetitively and andpf givesgives us theus the proof proof of the of the the predicate range check. For lack of space we do not give support a variety of applications. The structure of auth tags closure and pf denotes the sequence of certificates whose the predicate range check. For lack of space we do not give authorizationauthorization as a as sequence a sequence of certs of certs used used in the in thederivation. derivation. rules corresponding to the predicate range check. is as follows: composition leads to (u, a, u1). rules corresponding to the predicate range check. composition leads to (u, a, u ). grants([K,K1,D,T],Ph,Pf) :- tag_combines(X,Y,R) :- 1 grants([K,K1,D,T],Ph,Pf) :- tag_combines(X,Y,R) :- name_closure([K,A,K1],Pf) :- auth_closure([K,K1,D,T],Pf),nth1(1,Pf,C), simple_expr(X),simple_expr(Y), A tag can either be (tag ( )) or (tag tag-expr). Auth auth_closure([K,K1,D,T],Pf),nth1(1,Pf,C), simple_expr(X),simple_expr(Y), • name_def([K,A,[K1]],S),Pf=S. not(member(C,Ph)). lists_combines(X,Y,R). ∗ not(member(C,Ph)). lists_combines(X,Y,R). tag (tag ( )) denotes the set of all permissions and name_closure([K,A,K1],Pf) :- grants([K,K1,D,T],Ph,Pf) :- tag_combines(X,[ ,set|Y],Z1) :- ∗ grants([K,K1,D,T],Ph,Pf) :- tag_combines(X,[ ,set|Y],Z1)* :- must be used with a lot of care in delegations name_def([K,A,N],S),length(N,L),L>1, auth_closure([K,K2,1,T1],Pf1),nth1(1,Pf1,C), simple_expr(X),set_combines(Y,X,Z),* auth_closure([K,K2,1,T1],Pf1),nth1(1,Pf1,C), simple_expr(X),set_combines(Y,X,Z), nth1(1,N,K2),tail(N,A2), not(member(C,Ph)),append(Ph,[C],Ph1), append([ ,set],Z,Z1). A tag-expr can be one of byte-string, simple-expr, not(member(C,Ph)),append(Ph,[C],Ph1), append([ ,set],Z,Z1).* • name_closure([K2,A2,K1],Pf1), grants([K2,K1,D,T2],Ph1,Pf2), tag_combines([* ,set|Y],X,Z1) :- grants([K2,K1,D,T2],Ph1,Pf2), tag_combines([ ,set|Y],X,Z1)* :- set-expr, prefix-expr or range-expr append(S,Pf1,Pf). append(Pf1,Pf2,Pf),tag_combines(T1,T2,T). simple_expr(X),set_combines(Y,X,Z),* append(Pf1,Pf2,Pf),tag_combines(T1,T2,T). simple_expr(X),set_combines(Y,X,Z), A byte-string is any word over an underlying alpha- name_closure([K,[A|T],K1],Pf) :- append([ ,set],Z,Z1). • append([ ,set],Z,Z1).* bet Σ.Abyte-string represents atomic permissions length([A|T],L), TheThe firstThe first rule first rule above rule above looks above looks for looks afor direct a direct for authorization. authorization. a direct Theauthorization. The * (L==1->(name_closure([K,A,K1],Pf1),Pf=Pf1); Thesecond second rule rule says says that that if u ifgives u gives to u 2tothe u authorizationthe authorizationt1 These rules tell us how to combine tags when one of the that can not be further broken down and also form L>1->(name_closure([K,A,K2],Pf1), second rule says that if u gives to u2 the authorization2 t1 These rules tell us how to combine tags when one of the with permission to further delegate it, and if u2 grants t2 tags is a simple-expr. The first rule tells that for combining the basis from which complex tags are constructed name_closure([K2,T,K1],Pf2), witht1 permissionwith permission to further to further delegate delegate it, and it, if u and2 grants if u2 tgrants2 tags is a simple-expr. The first rule tells that for combining append(Pf1,Pf2,Pf))). to u1, and if tags t1 and t2 combine to result in t, then two simple expressions X and Y we need to combine their simple-expr is a byte-string followed by zero or more to ut2 ,to and u1, if and tags if t tagsand t1 tandcombine t2 combine to result to result in t ,in then t, then • 1 1 2 two simple expressions X and Y we need to combine their we say that u grants t to u1. Second rule captures the elements position-wise. The predicate lists combines does tag-expr. simple-expr is used to represent permis- wewe say say that thatu ugrants grantst tto tou u1.1. SecondSecond rule captures captures the wayelements position-wise. The predicate lists combines does TheThe first first rule rule above above states states that ifthat(u, a,if (u, u 1a,) is u1 a) name is a name cert way authorization flows. The predicate grants computes the sions over a particular resource wayauthorization authorization flows. flows. The The predicate predicategrants grantscomputes computes the thethis this computation. computation. For For lack lack of space of space we we do donot not give give rules rules certthen then we canwe can conclude conclude that that(u, a, (u, u1 a,) is u in) is the in closure. the closure. The set-expr has the structure set followed by zero 1 authorizationauthorization closure closure of a of given a givengiven set ofset certificates. of of certificates.certificates. for lists combines. Remaining rules state that for combining • second rule states that if (u, a, u n) is a name cert and if for lists combines. Remaining rules state that for combining ∗ The second rule states that if (u, 1a, u1 n) is a name cert and From the above rules we can conclude: the simple-expr X with a set Y , we combine X with each or more tag-expr. set-expr is used to group permis- (u , n, u )′is in the closure then (u, a, u )′ is in the closure. FromFrom the above the above rules werules can we conclude: can conclude: the simple-expr X with a set Y , we combine X with each if (1u1 , n, ′u ) is in the closure then (u, a, u′ ) is in the closure. element of the set Y and collect the results in a set. We have sions/resources together for succinct presentation grants([u3,u5, 0,t2], [], [7, 5]) from rule 1 usingelement of the set Y and collect the results in a set. We have TheThe third rulerule tellstells us us how how to to compute compute thethe members members of of an an •grants([u3,u5, 0,t2], [], [7, 5]) from rule 1 using a predicate set combines that combines each element of a prefix-expr has the structure prefix followed by • auth closure([u3,u5, 0,t2], [7, 5]) a predicate set combines that combines each element of a • ∗ extendedextended name name by successively computing the members members of of auth closure([u3,u5, 0,t2], [7, 5]) set with a given element. Note that a simple-expr does not a byte-string. prefix-expr represents the set of all grants([self ,u5, 0,t2], [], [6, 3, 1, 2, 7, 5]) from rule 2set with a given element. Note that a simple-expr does not eacheach local name name appearingappearing inin it. it. •grants([self ,u5, 0,t2], [], [6, 3, 1, 2, 7, 5]) from rule 2 combine with a prefix-expr or a range-expr. words which have the byte-string as their prefix. This • using auth closure([self ,u3, 1,t1], [6, 3, 1, 2]) andcombine with a prefix-expr or a range-expr. Given the database database in in the the previous previous section section and and the rulesthe using auth closure([self ,u3, 1,t1], [6, 3, 1, 2]) and can be used to succinctly represent access to hierar- tag_combines([*,set|X],[*,set|Y],Z1) :- above, we can make the following conclusions: the previous conclusion since, tags t1 and t2 combinetag_combines([*,set|X],[*,set|Y],Z1) :- rules above, we can make the following conclusions: the previous conclusion since, tags t1 and t2 combine sets_combines(X,Y,Z),append([*,set],Z,Z1). chically arranged resources like directory structures to yield t2 sets_combines(X,Y,Z),append([*,set],Z,Z1). to yield t2 tag_combines([*,set|X],[*,prefix,Y],Z1) :- range-expr has the structure range followed by name closure([u1, a, u2], [1]) by rule 1 using tag_combines([*,set|X],[*,prefix,Y],Z1) :- • •name_closure([u1, a, u2], [1]) by rule 1 using name_def set_combines(X,[*,prefix,Y],Z), ∗ set_combines(X,[*,prefix,Y],Z), the ordering, the lower-limit and the upper-limit. ([uname, a, [u def]], [1])([u1, a, [u2]], [1]) append([*,set],Z,Z1). 1 2 2.5 Tag intersection algorithm 4 append([*,set],Z,Z1). 2.52.5 Tag Tag intersection intersection algorithm algorithm 4 tag_combines([*,prefix,Y],[*,set|X],Z1) :- ordering specifies the type of data; possibilities are namename closure([closureu ([, ub,2 ,u b,], u 3 [2])], [2]) by ruleby 1rule using 1name_def using tag_combines([*,prefix,Y],[*,set|X],Z1) :- • 2 3 Algorithm for combining tags depends on the structure of set_combines(X,[*,prefix,Y],Z), alpha, numeric, binary, time and date. namename closuredef ([u2([,u b,1[,u[a,3]] b,][2]),u3], [1, 2]) by rule 3 using row gives the structure of the first tag and title of column set_combines(X,[*,prefix,Y],Z), •name([u , b,closure [u ]], [2])([u , [a, b],u ], [1, 2]) by rule 3 using rowAlgorithm gives theAlgorithm for structure combining for of combining the tags first depends tag tags and depends on title the of structureon column the structure of append([*,set],Z,Z1). • the2 previous3 two1 conclusions3 givesthe the tags structure being combined, of the second and istag. defined inductively over the append([*,set],Z,Z1). the previous two conclusions givesthe the tags structure being combined, of the second and is tag. defined inductively over the tag_combines([*,set|X],[*,range|Y],Z1) :- name_closure([u1, [a, b], u3], [1, 2]) by rule 3 using the ofstructure the tags of the being input combined, tags. Table and 1 describes is defined the structure inductively oftag_combines([ *,set|X],[*,range|Y],Z1) :- name closure([u1, a, u3], [3, 1, 2]) by rule 2 using We have a ternary predicate tag combines(t1,t2,t) set_combines(X,[*,range|Y],Z), •name closure([u , a, u ], [3, 1, 2]) by rule 2 using structureWe have of a the ternary input tags. predicate Table 1tag describescombines the( structuret ,t ,t) of 4 set_combines(X,[*,range|Y],Z), • previous two conclusions1 3 overthe tagthe resulting structure due of tothe intersection input tags. of Table tags, where1 describes2 title ofthe append([*,set],Z,Z1). name def ([u1, a, [u1, a, b]], [3]) and the previous thewhich tag denotesresulting that due to tags intersectiont1 and t of2 tags,combine where to title yield of append([*,set],Z,Z1). namename_closuredef ([u1,([ a,u [,u a1,, u a,], b ]][3,, [3]) 1, 2])and by rule the 2 previoususing name_whichstructure denotes that of the tags tag tresulting1 and t 2 duecombine to intersection to yield of tags, conclusionname closure1 ([u31, [a, b],u3], [1, 2]) by rule 3 usingtag rowt. Now gives we thestructure present the of the rules first used tag and for title computing of column conclusion• tag t.where Now title we of present row gives the the rules structure used of for the computing first tag and title def ([uthe1, a, previous [u1, a, b]], two [3]) conclusions and the previous conclusion tag givescombines the structure. of the second tag. tag combinesof column. gives the structure of the second tag. name closure([u1, a, u3], [3, 1, 2]) by rule 2 using We have a ternary predicate tag combines(t1,t2,t) 2.4 Modeling• delegation1 and authorization3 closure tag_combines([*],X,X). 1 2 2.42.4 Modeling Modelingname delegation delegationdef ([u and, a, [andu authorization, a, authorization b]], [3]) and closure theclosure previoustag_combines([which denotes*],X,X). that tags t and t combine to yield name def ([u1, a, [u1, a, b]], [3]) and the previoustag_combines(X,[whichWe denoteshave a* ternary],X). that tagspredicatet1 and tag combinest2 combine (t1 , t to2 , t) yieldwhich We have aconclusion binary predicate auth closure([u, u1, d, t], pf) totag_combines(X,[t *],X). We haveWe a binaryhaveconclusion a predicatebinary predicateauth closure auth_closure([u, u , d,([u t],, u pf, )dto, t], pf) denotestag t. Nowthat tags we presentt1 and t2 thecombine rules to used yield for tag computing t. Now we replace the name appearing in ACL entries1 and auth1 certs The above rules state that the result of combining any replaceto replace the name the name appearing appearing in ACL in ACL entries entries and authand auth certs certs Thepresenttag abovecombines the rules rules. state used that for the computing result of tag_combines combining any . by its members. tag X with (tag ( )) is X. tag X with (tag ( )) is X. by itsby members.its2.4 members. Modeling delegation and authorization closure tag_combines([∗ *],X,X). auth_closure([K,K1,D,T],Pf) :- tag_combines(X,X,X)tag_combines(X,[∗ ],X). :- atom(X). auth_closure([K,K1,D,T],Pf) :- tag_combines(X,[*],X). We have(acl_entry([K,N,D,T],S);auth([K,N,D,T],S)), a binary predicate auth closure([u, u1, d, t], pftag_combines(X,X,X)) totag_combines(X,[ ,set|Y],X):- :-atom(X). atom(X),member(X,Y). (acl_entry([K,N,D,T],S);auth([K,N,D,T],S)),1 * replacelength(N,L),nth1(1,N,K2),(L==1->(K1=K2,Pf=S); the name appearing in ACL entries and auth certstag_combines(X,[tag_combines([The above*,set|Y],X),set|Y],X,X) rules state that :- the :-atom(X),member(X,Y). resultatom(X),member(X,Y). of combining any replacelength(N,L),nth1(1,N,K2),(L==1->(K1=K2,Pf=S); the name appearing in ACL entries and auth certs TheThe above above* rulesrules state state that that the the result result of of combining combining any any L>1->(tail(N,A),name_closure([K2,A,K1],[],Pf1)tag_combines([tag_combines(X,[*,set|Y],X,X),prefix,Y],X) :- :-atom(X),member(X,Y). atom(X), byL>1->(tail(N,A),name_closure([K2,A,K1],[],Pf1) its members. tag_combines(X,[tag X with (,prefix,Y],X)tag* ( )) is X. :- atom(X), ,append(S,Pf1,Pf))). tag X with (*tag (*))∗ is X. string_concat(Y,Z,X). ,append(S,Pf1,Pf))). ∗ auth_closure([K,K1,D,T],Pf) :- tag_combines([tag_combines(X,X,X),prefix,Y],X,X) :-string_concat(Y,Z,X). atom(X),:- atom(X). auth_closure([K,K1,D,T],Pf) :- tag_combines([tag_combines(X,X,X),prefix,Y],X,X)* :- atom(X),:- atom(X). What the(acl_entry([K,N,D,T],S);auth([K,N,D,T],S)), above rule says is that if there is an acl entry or tag_combines(X,[* *,set|Y],X)string_concat(Y,Z,X). :- atom(X),member(X,Y). WhatWhat the above the(acl_entry([K,N,D,T],S);auth([K,N,D,T],S)), above rule says rule is thatsays if thereis that is anif aclthere entry is oran acl tag_combines(X,[*,set|Y],X)string_concat(Y,Z,X). :- atom(X),member(X,Y). an auth cert length(N,L),nth1(1,N,K2),(L==1->(K1=K2,Pf=S);(u, n, d, t) and if u is a member of n, then we tag_combines(X,[tag_combines([*,range,Y|Z],X):-*,set|Y],X,X) atom(X),:- atom(X),member(X,Y). 1 tag_combines(X,[*,range,Y|Z],X):-* atom(X), an authentry cert or an(u, authL>1->(tail(N,A),name_closure([K2,A,K1],[],Pf1) n, d, tcert) and (u, if n,u1 d,is t a) and member if u1 is of an , member then we of n, tag_combines(X,[*,prefix,Y],X)(length(Z,0); :- atom(X), can conclude that (u, u1, d, t) is a valid authorization. * (length(Z,0); canthen conclude we can that conclude(u, u,append(S,Pf1,Pf)))., d,that t) is(u, a u valid, d, t authorization.) is a valid authorization. range_check(Y,X,Z)).string_concat(Y,Z,X). By applying the1 above rule we1 can conclude: range_check(Y,X,Z)). By applying the above rule we can conclude: tag_combines([tag_combines([*,range,Y|Z],X,X):-*,prefix,Y],X,X) atom(X),:- atom(X), By applyingWhat the above rule sayswe can is thatconclude: if there is an acl entrytag_combines([ or *,range,Y|Z],X,X):- atom(X),(length(Z,0);string_concat(Y,Z,X). What the above rule says is that if there is an acl entry or (length(Z,0);string_concat(Y,Z,X). auth closure([self ,u3, 1,t1], [6, 3, 1, 2]) tag_combines(X,[*,range,Y|Z],X):-range_check(Y,X,Z)). atom(X), •anauth authclosure cert (([u,self n, d,,u t3), 1and,t1] if, [6u,13is, 1 a, 2]) member of n, then we tag_combines(X,[*,range,Y|Z],X):-range_check(Y,X,Z)). atom(X), • canusing concludeacl thatentry(u,([self u , d,, [u t1),ais] a, 1 valid,t1], [6]) authorization. (length(Z,0); canusing concludeacl entry that([self(u, u, [1u,1 d,,a t)],is1,t a1 valid], [6]) authorization. The above set of rules tell us how to combinerange_check(Y,X,Z)). tags when and name closure([u1,a,u3], [3, 1, 2]) The above set of rules tell us how to combinerange_check(Y,X,Z)). tags when andByname applyingclosure the([ aboveu1,a,u rule3], [3 we, 1, can2]) conclude: tag_combines([ ,range,Y|Z],X,X):- atom(X), auth closure([u ,u , 0,t ], [7, 5]) onetag_combines([ of the tags is a* byte-string.,range,Y|Z],X,X):- The first rule atom(X), tells that two •auth closure([u ,u3 , 05,t ], [72 , 5]) one of the tags is a byte-string. The first rule(length(Z,0); tells that two • auth closure3 5([self2,u , 1,t ], [6, 3, 1, 2]) byte-strings can be combined iff they are equal. Second and •usingauthauthclosure([u3, [u([3self,c],,u0,t3,21],t, [7])1], [6, 3, 1, 2]) byte-strings can be combined iff they are equal.range_check(Y,X,Z)). Second and using• authusing([uacl3, [uentry3,c], ([0,tself2],,[7])[u ,a], 1,t ], [6]) third rules state that a byte-string X intersects with a set andusingnameaclclosureentry([([uself3,c,u, [u51],a, [5])], 1,t1], [6]) third rules state that a byte-string X intersects with a set and nameand nameclosureclosure([u3,c,u([u5],a,u, [5]) ], [3, 1, 2]) Y iff XTheisabove a member set of of rulesY . Fourth tell us how andto fifth combine rules express tags when and name closure([u1,a,u3], [3, 1, 2]) Y iff X isThe a member above set of Y of. rules Fourth tell and us how fifth to rules combine express tags when We haveauth a ternaryclosure([ predicateu ,u , 0,tgrants], [7,([5])u, u1, d, t], ph, pf ) thatone byte-string of the tagsX iscombines a byte-string. with a The prefix first form ruleY tellsiff thatY is two We haveWe• have aauth ternary a ternaryclosure predicate ([ predicateu3,u5grants, 0,t grants2],([[7u,, 5]) u([1u,, d,u t ],, d,ph t,]pf, ph) , pfthat ) byte-stringone of theX tagscombines is a byte-string. with a prefix The form first Y ruleiff tellsY is that two to compute• using the authorizationsauth([u , [u ,c] of, 0,t principals], [7]) either1 directly a prefixbyte-strings of X. The can be last combined two rules iff say they that are theequal. byte-string Second and to computeto compute theusing the authorizations authorizationsauth([u3, [u3 of,c ofprincipals], 0 principals,t2], [7]) either either directly directly aor prefixbyte-strings of X. The can last be two combined rules say iff thatthey the are byte-string equal. Second and or transitively obtained (by a chain of delegations); ph X combinesthird rules with state a that range a byte-string form Z consideredX intersects under with the a set ortransitively and obtained obtainedname (byclosure (by a chain a([ chainu 3of,c,u delegations); of5] delegations);, [5]) ph guardsph us third rules state that a byte-string X intersects with a set Y guards us against going around in loops by using the sameX combinesY iff X withis a a member range form of Y .Z Fourthconsidered and fifth under rules the express guardsagainst us againstgoing around going aroundin loops in by loops using by the using same the auth same certs orderingiffY iffX isXY aiffis memberX a memberis in theof ofY range. Y Fourth. FourthZ under and and fifth ordering fifth rules rulesY expressusing express that auth certsWe have repetitively a ternary and predicatepf givesgrants us the([u, proof u1, d,of t], ph the,orderingpf ) Y iff X is in the range Z under ordering Y using auth certsWe repetitively have a ternary and pf predicategives us the proof1 of the thethat predicate byte-stringrange Xcheckcombines. For lack with of spacea prefix we form do notY giveiff Y is repetitivelyauthorizationto compute and as the apf sequence authorizationsgives us of the certs proof of used principals of in the the authorization derivation. either directlythe predicatebyte-stringrange X checkcombines. For with lack a of prefix space weform do Y notiff Y give is a prefix authorizationto compute as a sequence the authorizations of certs used of principalsin the derivation. either directlyrulesa correspondingprefix of X. The to the last predicate two rulesrange say thatcheck the. byte-string as aor sequence transitively of certs obtained used in (by the a derivation. chain of delegations);rulesph correspondingofX Xcombines . The last withto two the arules predicate range sayform thatrange theZ checkconsidered byte-string. underX combines the grants([K,K1,D,T],Ph,Pf) :- tag_combines(X,Y,R)X combines with a range form:- Z considered under the grants([K,K1,D,T],Ph,Pf)guardsauth_closure([K,K1,D,T],Pf),nth1(1,Pf,C), us against going :- around in loops by using the sametag_combines(X,Y,R)ordering Y iff X is in the:- range Z under ordering Y using auth_closure([K,K1,D,T],Pf),nth1(1,Pf,C), orderingsimple_expr(X),simple_expr(Y),Y iff X is in the range Z under ordering Y using authnot(member(C,Ph)). certs repetitively and pf gives us the proof of the thesimple_expr(X),simple_expr(Y), predicatelists_combines(X,Y,R).range check. For lack of space we do not give grants([K,K1,D,T],Ph,Pf)not(member(C,Ph)). :- thelists_combines(X,Y,R). predicate range check. For lack of space we do not give grants([K,K1,D,T],Ph,Pf)authorization as a sequence :- of certs used in the derivation. tag_combines(X,[rules corresponding*,set|Y],Z1) to the predicate :- range check. auth_closure([K,K2,1,T1],Pf1),nth1(1,Pf1,C),tag_combines(X,[rules correspondingsimple_expr(X),set_combines(Y,X,Z),*,set|Y],Z1) to the :- predicate . grants([K,K1,D,T],Ph,Pf)auth_closure([K,K2,1,T1],Pf1),nth1(1,Pf1,C),not(member(C,Ph)),append(Ph,[C],Ph1), :- simple_expr(X),set_combines(Y,X,Z), CSIgrants([K,K1,D,T],Ph,Pf) Journalnot(member(C,Ph)),append(Ph,[C],Ph1), of Computing | :-Vol. 2 • No. 4, 2015 tag_combines(X,Y,R)append([*,set],Z,Z1). :- grants([K2,K1,D,T2],Ph1,Pf2),auth_closure([K,K1,D,T],Pf),nth1(1,Pf,C), append([*,set],Z,Z1). grants([K2,K1,D,T2],Ph1,Pf2),auth_closure([K,K1,D,T],Pf),nth1(1,Pf,C), tag_combines([simple_expr(X),simple_expr(Y),*,set|Y],X,Z1) :- append(Pf1,Pf2,Pf),tag_combines(T1,T2,T).not(member(C,Ph)). tag_combines([*,set|Y],X,Z1) :- append(Pf1,Pf2,Pf),tag_combines(T1,T2,T).not(member(C,Ph)). simple_expr(X),set_combines(Y,X,Z),lists_combines(X,Y,R). grants([K,K1,D,T],Ph,Pf) :- simple_expr(X),set_combines(Y,X,Z), tag_combines(X,[append([*,set],Z,Z1).*,set|Y],Z1) :- The first ruleauth_closure([K,K2,1,T1],Pf1),nth1(1,Pf1,C), above looks for a direct authorization. The append([simple_expr(X),set_combines(Y,X,Z),*,set],Z,Z1). The first rulenot(member(C,Ph)),append(Ph,[C],Ph1), above looks for a direct authorization. The simple_expr(X),set_combines(Y,X,Z), second rulenot(member(C,Ph)),append(Ph,[C],Ph1), says that if u gives to u2 the authorization t1 These rulesappend([ tell us* how,set],Z,Z1). to combine tags when one of the second rule saysgrants([K2,K1,D,T2],Ph1,Pf2), that if u gives to u2 the authorization t1 These rules tell us how* to combine tags when one of the grants([K2,K1,D,T2],Ph1,Pf2), tag_combines([*,set|Y],X,Z1) :- with permissionappend(Pf1,Pf2,Pf),tag_combines(T1,T2,T). to further delegate it, and if u2 grants t2 tagstag_combines([ is a simple-expr.*,set|Y],X,Z1) The first rule tells :- that for combining with permissionappend(Pf1,Pf2,Pf),tag_combines(T1,T2,T). to further delegate it, and if u2 grants t2 tags is a simple-expr.simple_expr(X),set_combines(Y,X,Z), The first rule tells that for combining to u1, and if tags t1 and t2 combine to result in t, then two simple expressions X and Y we need to combine their to u1, and if tags t1 and t2 combine to result in t, then two simple expressionsappend([X*,set],Z,Z1).and Y we need to combine their we sayThe that firstu rulegrants abovet to looksu1. for Second a direct rule authorization. captures the Theelements position-wise. The predicate lists combines does we say that u grants t to u1. Second rule captures the elements position-wise. The predicate lists combines does waysecond authorization rule says flows. that The if u predicategives to grantsu2 thecomputes authorization the t1this computation.These rules tellFor us lack how of to space combine we do tags not when give one rules of the way authorization flows. The predicate grants2 computes the this1 computation.These rules For tell lack us of how space to combine we do not tags give when rules one of the authorizationwith permission closure to of further a given delegate set of certificates. it, and if u2 grants t2for liststags iscombines a simple-expr.. Remaining The first rules rule state tells that that for for combining combining authorization closure of a given set of certificates. 2 for2liststagscombines is a simple-expr.. Remaining The rules first state rule that tells for that combining for combining Fromto u1, the and above if tags rulest1 weand cant2 conclude:combine to result in t, thenthetwo simple-expr simple expressionsX with a setX andY , weY we combine need toX combinewith each their From the1 above rules we1 can conclude:2 the simple-exprtwo simpleX expressionswith a set Y ,and we combinewe needX with to combine each their we say that u grants t to u1. Second rule captures theelementelements of the position-wise. set Y and collect The the predicate resultslists in a set.combines We havedoes grants([u3,u5, 0,t2], [], [7, 5]) from rule 1 usingelement of the set Y and collect the results in a set. We have •waygrants authorization([u3,u5, 0,t2 flows.], [], [7, The5]) predicatefrom rulegrants 1computes using thea predicatethis computation.set combines Forthat lack combines of space we each do element not give of rules a • auth closure([u3,u5, 0,t2], [7, 5]) a predicate set combines that combines each element of a authorizationauth closure([ closureu3,u5, 0 of,t a2] given, [7, 5]) set of certificates. setfor withlists a givencombines element.. Remaining Note that rules a simple-expr state that for does combining not grants([self ,u5, 0,t2], [], [6, 3, 1, 2, 7, 5]) from rule 2set with a given element. Note that a simple-expr does not •grantsFrom([self the,u above5, 0,t rules2], [], [6 we, 3 can, 1, 2 conclude:, 7, 5]) from rule 2 combinethe simple-expr with a prefix-exprX with or a seta range-expr.Y , we combine X with each • using auth closure([self ,u, 1,t ], [6, 3, 1, 2]) andcombine with a prefix-expr or a range-expr. using auth closure([self ,u, 13,t ], [61 , 3, 1, 2]) and element of the set Y and collect the results in a set. We have grants([u ,u , 0,t ], [],3[7, 5])1 from rule 1 usingtag_combines([element of the,set|X],[ set Y and,set|Y],Z1) collect the results :- in a set. We have •thegrants previous([u3 conclusion,u5, 0,t2], since,[], [7, 5]) tagsfromt1 and rulet2 combine 1 usingtag_combines([ ,set|X],[* ,set|Y],Z1)* :- the• previous conclusion since, tags t1 and t2 combine a predicatesets_combines(X,Y,Z),append([* set combines* that combines,set],Z,Z1). each element of a to yieldauth tclosure([u3,u5, 0,t2], [7, 5]) sets_combines(X,Y,Z),append([ ,set],Z,Z1).* t 2 3 5 2 tag_combines([set with a given,set|X],[ element.,prefix,Y],Z1) Note that* a simple-expr :- does not to yield 2 tag_combines([set with a,set|X],[ given* element.,prefix,Y],Z1)* Note that a simple-expr :- does not grants([self ,u5, 0,t2], [], [6, 3, 1, 2, 7, 5]) from rule 2 set_combines(X,[* * ,prefix,Y],Z), • combineset_combines(X,[ with a prefix-expr,prefix,Y],Z),* or a range-expr. using auth closure([self ,u3, 1,t1], [6, 3, 1, 2]) and append([*,set],Z,Z1).* 2.5 Tag intersection algorithm 3 1 append([*,set],Z,Z1). 2.5 Tag intersectionthe previous algorithm conclusion since, tags t and t combinetag_combines([tag_combines([*,prefix,Y],[*,set|X],[*,set|X],Z1),set|Y],Z1) :- :- the previous conclusion since, tags t1 and t2 combinetag_combines([*,prefix,Y],[* *,set|X],Z1)* :- Algorithm for combining tags depends on the structure of set_combines(X,[sets_combines(X,Y,Z),append([*,prefix,Y],Z), *,set],Z,Z1). to yield t2 set_combines(X,[*,prefix,Y],Z), * Algorithm forto combining yield 2 tags depends on the structure of tag_combines([append([*,set],Z,Z1).*,set|X],[*,prefix,Y],Z1) :- the tags being combined, and is defined inductively over the append([*,set],Z,Z1).* * the tags being combined, and is defined inductively over the tag_combines([set_combines(X,[*,set|X],[*,range|Y],Z1)*,prefix,Y],Z), :- tag_combines([*,set|X],[*,range|Y],Z1)* :- structure of the input tags. Table 1 describes the structure of set_combines(X,[append([*,set],Z,Z1).*,range|Y],Z), structure2.5 of theTag input intersection tags. Table algorithm 1 describes the structure of set_combines(X,[* *,range|Y],Z), the2.5 tag resulting Tag intersection due to intersection algorithm of tags, where title of tag_combines([append([*,set],Z,Z1).*,prefix,Y],[*,set|X],Z1) :- the tag resulting due to intersection of tags, where title of append([*,set],Z,Z1).* * Algorithm for combining tags depends on the structure of set_combines(X,[*,prefix,Y],Z), the tags being combined, and is defined inductively over the append([*,set],Z,Z1). the tags being combined, and is defined inductively over the tag_combines([*,set|X],[*,range|Y],Z1) :- structure of the input tags. Table 1 describes the structure of tag_combines([*,set|X],[*,range|Y],Z1) :- structure of the input tags. Table 1 describes the structure of set_combines(X,[*,range|Y],Z), the tag resulting due to intersection of tags, where title of append([*,set],Z,Z1). 4 name closure([u1, [a, b],u3], [1, 2]) by rule 3 using row gives the structure of the first tag and title of column • 1 3 the previous two conclusions gives the structure of the second tag. name closure([u1, a, u3], [3, 1, 2]) by rule 2 using We have a ternary predicate tag combines(t1,t2,t) • 1 3 1 2 name def ([u1, a, [u1, a, b]], [3]) and the previous which denotes that tags t1 and t2 combine to yield conclusion tag t. Now we present the rules used for computing tag combines. 5 * string simple set prefix range 5 2.4 Modeling delegation and authorization closure tag_combines([*],X,X). tag_combines(X,[ ],X). * * string simple set prefix range We have a binary predicate auth closure([u, u , d, t], pf) to tag_combines(X,[*],X). We have a binary predicate auth closure([u, u1, d, t], pf) to string* string* string/FAILstring simpleFAIL string/FAILset string/FAILprefix string/FAILrange replace the name appearing in ACL entries and auth certs The above rules state that the result of combining any simplestring simplestring string/FAILFAIL simple/FAILFAIL string/FAILset string/FAILFAIL string/FAILFAIL by its members. tag X with (tag ( )) is X. simpleset simpleset string/FAILFAIL simple/FAILset set FAILset FAILset by its members. tag with ∗ is . ∗ prefixset prefixset string/FAIL FAILset set prefixset FAILset auth_closure([K,K1,D,T],Pf) :- tag_combines(X,X,X) :- atom(X). prefixrange prefixrange string/FAIL FAIL set prefixFAIL rangeFAIL (acl_entry([K,N,D,T],S);auth([K,N,D,T],S)), tag_combines(X,[*,set|Y],X) :- atom(X),member(X,Y). range string/FAIL FAIL set FAIL range length(N,L),nth1(1,N,K2),(L==1->(K1=K2,Pf=S); tag_combines([*,set|Y],X,X) :- atom(X),member(X,Y). range TABLE 1 L>1->(tail(N,A),name_closure([K2,A,K1],[],Pf1) tag_combines(X,[ ,prefix,Y],X) :- atom(X), L>1->(tail(N,A),name_closure([K2,A,K1],[],Pf1) tag_combines(X,[*,prefix,Y],X) :- atom(X), Table showing the structure of theTABLE tag resulting 1 from intersection of two tags ,append(S,Pf1,Pf))). string_concat(Y,Z,X). string_concat(Y,Z,X). Table showing the structure of the tag resulting from intersection of two tags tag_combines([*,prefix,Y],X,X) :- atom(X), What the above rule says is that if there is an acl entry or string_concat(Y,Z,X). tag_combines([*,range|Y],[*,set|X],Z1) :- the query access request(u5,t3,X) returns yes with an auth cert (u, n, d, t) and if u is a member of n, then we tag_combines(X,[*,range,Y|Z],X):- atom(X), set_combines(X,[*,range|Y],Z), an auth cert (u, n, d, t) and if u1 is a member of n, then we * tag_combines([*,range|Y],[*,set|X],Z1) :- Xthe= query [6, 3, 1, 2access, 7, 5]. Thereforerequest(u5 we,t3,X conclude) returns that uyes5 mustwith be (length(Z,0); append([ ,set],Z,Z1). can conclude that (u, u1, d, t) is a valid authorization. set_combines(X,[* ,range|Y],Z), 1 range_check(Y,X,Z)). * grantedX = [6, 3 access, 1, 2, 7 to, 5]t3.. Therefore we conclude that u5 must be By applying the above rule we can conclude: tag_combines([ ,range,Y|Z],X,X):- atom(X), append([*,set],Z,Z1). tag_combines([*,range,Y|Z],X,X):- atom(X), These rules tell us how to combine tags when one of the granted access to t3. (length(Z,0); auth closure([self ,u3, 1,t1], [6, 3, 1, 2]) range_check(Y,X,Z)). tagsThese is a set-expr. rules tell The us how first ruleto combine says that tags for when combining one of sets the • 3 1 range_check(Y,X,Z)). 2.7 Prototype implementation using acl entry([self , [u1,a], 1,t1], [6]) Xtagsand is aY set-expr., we should The combine first rule each says element that for of combiningX with every sets The above set of rules tell us how to combine tags when We2.7 developed Prototype a implementation prototype tool for specifying and verify- and name closure([u1,a,u3], [3, 1, 2]) The above set of rules tell us how to combine tags when Xelementand Y of, weY shouldand collect combine the eachresult element in a set. of TheX with predicate every auth closure([u ,u , 0,t ], [7, 5]) one of the tags is a byte-string. The first rule tells that two ingWe developedauthorizations a prototype in SACL tool based for on specifying the ideas and presented verify- • auth closure([u3,u5, 0,t2], [7, 5]) setselementcombines of Y anddoes collect this computation. the result in The a set. remaining The predicate rules • byte-strings can be combined iff they are equal. Second and above.ing authorizations We used SWI in PrologSACL1 basedfor our on implementation. the ideas presented We using auth([u3, [u3,c], 0,t2], [7]) expresssets combines that fordoes combining this computation. a prefix or a range-expr The remaining with rules a set third rules state that a byte-string X intersects with a set hard coded the name and auth1 certs into the program. We and name closure([u3,c,u5], [5]) expressX we need that to for combine combining the aexpressions prefix or a range-exprwith each element with a set of above. We used SWI Prolog for our implementation. We N.Y iffV. XNarendrais a member Kumar, of Y et.. Fourth al. and fifth rules express R4 : 39 implementedhard coded the an name initialization and auth phase certs which into the checks program. currently We We have a ternary predicate grants([u, u1, d, t], ph, pf ) Xthewe set need and collectto combine the result the expressions in a set. with each element of We have a ternary predicate 1 that byte-string X combines with a prefix form Y iff Y is validimplemented certs and an adds initialization them to phase the database which checks together currently with a to compute the authorizations of principals either directly a prefix of X. The last two rules say that the byte-string tag_combines([the set and collect*,prefix,X],[ the result in* a,prefix,Y],[ set. *,prefix,Z]) :- string_length(X,L),string_length(Y,M), certvalid number. certs and We adds also them implemented to the database our algorithm together for with tag a or transitively obtained (by a chain of delegations); ph withX combines a range withform aZ rangeconsidered form underZ considered the ordering under Y theiff X tag_combines([ ,prefix,X],[ ,prefix,Y],[ ,prefix,Z]) (L>M->(string_concat(Y,A,X),Z=X);* * * intersection.cert number. We also implemented our algorithm for tag guards us against going around in loops by using the same isordering in theY rangeiff X isZ inunder the range orderingZ under Y using ordering the Y predicateusing :- string_length(X,L),string_length(Y,M), (L>M->(string_concat(Y,A,X),Z=X);M>L->(string_concat(X,B,Y),Z=Y);X==Y->Z=X).intersection.Our tool can be used for checking various properties auth certs repetitively and pf gives us the proof of the the predicate range check. For lack of space we do not give range_check. For lack of space we do not give rules tag_combines([M>L->(string_concat(X,B,Y),Z=Y);X==Y->Z=X).*,range,X|A],[*,range,X|B],Z1) including: authorization as a sequence of certs used in the derivation. rules corresponding to the predicate range check. :- range_combines(X,A,B,C), Our tool can be used for checking various properties correspondingrules corresponding to the to predicate the predicate range_checkrange check . . tag_combines([*,range,X|A],[*,range,X|B],Z1) grants([K,K1,D,T],Ph,Pf) :- append([*,range,X],C,Z1). including: grants([K,K1,D,T],Ph,Pf) :- tag_combines(X,Y,R) :- :- range_combines(X,A,B,C), Is k′ a member of (k, a)? Can be answered by query- auth_closure([K,K1,D,T],Pf),nth1(1,Pf,C), • auth_closure([K,K1,D,T],Pf),nth1(1,Pf,C), simple_expr(X),simple_expr(Y), append([*,range,X],C,Z1). not(member(C,Ph)). The first rule above says that the result of combining two ingIs k′namea memberclosure of (([k,k, a a,)? k Can′], [] be, Pf answered), and if by it isquery- true not(member(C,Ph)). lists_combines(X,Y,R). • grants([K,K1,D,T],Ph,Pf) :- then our tool returns a proof of this fact in Pf . In the grants([K,K1,D,T],Ph,Pf) :- tag_combines(X,[*,set|Y],Z1) :- prefixTheThe expressions first first rule rule above X aboveand says Y thatsaysis the the that longer result the of result combiningthem of provided combining two ing name closure([k, a, k′], [], Pf ), and if it is true auth_closure([K,K2,1,T1],Pf1),nth1(1,Pf1,C), * auth_closure([K,K2,1,T1],Pf1),nth1(1,Pf1,C), simple_expr(X),set_combines(Y,X,Z), thensecond our component tool returns we a can proof specify of this a listfact of in certificatesPf . In the not(member(C,Ph)),append(Ph,[C],Ph1), twothatprefix prefix the expressions shorter expressions expressionXX and andY Y is isis a the the prefix longer longer of theof of them them longer provided provided expres- not(member(C,Ph)),append(Ph,[C],Ph1), append([*,set],Z,Z1). grants([K2,K1,D,T2],Ph1,Pf2), * sion. The second rule expresses that we can combine two notsecond to be component included in we the can derivation specify a list of certificates grants([K2,K1,D,T2],Ph1,Pf2), tag_combines([*,set|Y],X,Z1) :- that the the shorter expression is is a a prefix prefix of of thethe longer longer expres- expres- append(Pf1,Pf2,Pf),tag_combines(T1,T2,T). * What are the permissions authorized by both the append(Pf1,Pf2,Pf),tag_combines(T1,T2,T). simple_expr(X),set_combines(Y,X,Z), sion.range The The forms second secondY and rule ruleZ iffexpresses expresses they are that considered that we we can can undercombine combine the two same two • not to be included in the derivation append([*,set],Z,Z1). tagsWhatt1 areand thet permissions2? Can be authorizedanswered by by querying both the The first rule above looks for a direct authorization. The rangeordering formsX .YY TheandandZ predicateZiff iff they they arerange are considered consideredcombines under (underX,A,B,C the the same ) • same ordering X . TheC predicate range_combines (X, A, B, C ) tagtags combinest1 and (tt21?,t2 Can,T),be it returns answered fail if by there querying are no second rule says that if u gives to u2 the authorization t1 TheseThese rules rules tell tell us us how how to to combine combine tags tags when when one one of of the the orderingcomputesX the. range The predicatewhich isrange the resultcombines of combining(X,A,B,C the) computes the range C which is the result of combining the commontag combines permissions(t1,t2,T or), returns it returns the fail common if there permis- are no with permission to further delegate it, and if u2 grants t2 tags is a simple-expr.simple-expr. TheThe first first rule rule tells tells that that for for combining combining rangescomputesA and theB rangewhenC consideredwhich is the under result the of ordering combiningX. For the ranges A and B when considered under the ordering X . For sions in T to u1, and if tags t1 and t2 combine to result in t, then two simple simple expressionsexpressionsX X andandY Y wewe need need to to combine combine their their rangeslack ofA spaceand weB when do not considered give the rules under for therange orderingcombinesX. For. common permissions or returns the common permis- What permissions does principal k have? We we say that u grants t to u1. Second rule captures the elements position-wise.position-wise. TheThe predicate predicatelists lists_combinescombines does does lackThe of space space process we of do do tag not not combination, give give the the rules for for for the range_combinesrange tags tcombines1 and t2 .in. • sions in T way authorization flows. The predicate grants computes the this computation. For lack of space we do not give rules implementedWhat permissions a predicate does principal k have? We way authorization flows. The predicate grants computes the this computation. For lack of of space space we we do do not not give give rules rules for theThe runningThe processprocess example, ofof tagtag is combination,combination, illustrated in forfor Figure thethe tagstags 1. t 1t1and andt 2t2 in • authorization closure of a given set of certificates. lists_combinesfor lists combines . Remaining. Remaining rules rules state state that that for for combining combining inthe theWe running running do not example, example, see any is is illustrated meaningful illustrated in in Figuresituations Fig. 1. 1. in practice implementedpermissions(k a) predicatewhich computes the From the above rules we can conclude: the simple-exprsimple-expr X with aa set setY Y,, we we combine combineX X withwith each each whereWeWe it dodo will not be see see necessary any any meaningful meaningful to intersect situations situations a prefix in in expressionpractice practice permissionsset of permissions(k) which that k computeshas as the element of the set Y and collect the results in a set. We have unionset of ofpermissionsdelegable thatpermissionsk has(k) asand the grants([u3,u5, 0,t2], [], [7, 5]) from rule 1 using element of the set Y and collect the results in a set. We wherewith a it it range will be beexpression. necessary necessary So to to intersect we intersect make a thea prefix prefix simplifying expression expression as- • 3 5 2 a predicate set combines that combines each element of a auth closure([u ,u , 0,t ], [7, 5]) havea predicate a predicateset combines set_combinesthat combines that combines each elementeach element of a withsumption a a rangerange that expression.expression. a range expression SoSo wewe make make and athethe prefix simplifying simplifying expression as- unionuse only permissions of delegable(k) permissions(k) and auth closure([u3,u5, 0,t2], [7, 5]) set with a given element. Note that a simple-expr does not grants([self ,u , 0,t ], [], [6, 3, 1, 2, 7, 5]) from rule 2 ofset a with set witha given a given element. element. Note Note that athat simple-expr a simple-expr does notdoes assumptionfail to combine. that Oura range tag expression combination and algorithm a prefix isexpression sound. 5 Isuse principalonly permissionsk allowed( tok) access t? We simply query • grants([self ,u5, 0,t2], [], [6, 3, 1, 2, 7, 5]) from rule 2 sumption that a range expression and a prefix expression • • combine with a prefix-expr or a range-expr. fail to combine. Our tag combination algorithm is sound. accessIs principalrequestk allowed(k, t, Pf ) toand access wet get? We the simply proof ofquery au- using auth closure([self ,u3, 1,t1], [6, 3, 1, 2]) and not combine with a prefix-expr or* a range-expr.string simple fail to combine.set Ourprefix tag combinationrange algorithm is sound. • the previous conclusion since, tags t and t combine tag_combines([*,set|X],[* *,set|Y],Z1)* string :- simple set prefix range accessthorizationrequest in Pf(k,if t, thisPf ) requestand we is get permitted. the proof of au- the previous conclusion since, tags t1 and t2 combine * * 2.6 Access decisions sets_combines(X,Y,Z),append([string string string/FAIL*,set],Z,Z1). FAIL 2.6 string/FAILAccess decisionsstring/FAIL string/FAIL to yield t2 thorization in Pf if this request is permitted. tag_combines([*,set|X],[simple*,prefix,Y],Z1)simple FAIL :- simple/FAILWe2.6 have Accessset a ternary decisionsFAIL predicate accessFAIL request(u, t, pf ) which We hard coded the certs corresponding to running ex- set_combines(X,[ ,prefix,Y],Z), set_combines(X,[set *,prefix,Y],Z),set string/FAIL set Weset have a ternaryset predicateset access_request (u, t, pf) ample in the program and queried the properties mentioned append([*,set],Z,Z1). expressesWe have a that ternary principal predicateu isaccess authorizedrequest to( accessu, t, pf )t whichand a We hard coded the certs corresponding to running ex- 2.5 Tag intersection algorithm * prefix prefix string/FAIL FAIL which expset resses thatprefix principal uFAIL is authorized to access t 2.5 Tag intersection algorithm tag_combines([*,prefix,Y],[*,set|X],Z1) :- proof is given by the sequence of certs pf . above.ample in We the present program snapshots and queried of the the results properties in Figure mentioned 2. * range *range string/FAIL FAIL andexpresses a psetroof that is given principalFAIL by theu sequenceis authorizedrange of certs to accesspf . t and a Algorithm for combining tags depends on the structure of set_combines(X,[*,prefix,Y],Z), above.As a We possible present future snapshots extension of the we results plan in to Figure create 2. a front append([*,set],Z,Z1). access_request(K,T,Pf)proof is given by the sequence :- of certs pf . the tags being combined, and is defined inductively over the append([*,set],Z,Z1). TABLE 1 endAs where a possible user can future input extension certs which we planwill be to parsed create a before front tag_combines([*,set|X],[*,range|Y],Z1) :- access_request(K,T,Pf)grants([self,K,D,T1],[],Pf), :- structure of the input tags. Table 1 describes the structure of Table showing the structure of the tag resulting from intersection of two tags set_combines(X,[*,range|Y],Z), tag_combines(T,T1,T2),T==T2.grants([self,K,D,T1],[],Pf), theyend where get added user can to the input database. certs which Another will be possibility parsed before is to the tag resulting due to intersection of tags, where title of append([ ,set],Z,Z1). the tag resulting due to intersection of tags, where title of append([*,set],Z,Z1). The ruletag_combines(T,T1,T2),T==T2. above says that principal u is authorized to specifythey get a fileadded in which to the the database. certs are Another coded. possibility is to The rule above says that principal u is authorized to tag_combines([*,range|Y],[*,set|X],Z1) :- the query access request(u5,t3,X) returns yes with access t with proof pf if, (self , u, d, t1 ) is in the authorization specify a file in which the certs are coded. set_combines(X,[ ,range|Y],Z), access t with proof pf if, (self , u, d, t1) is inu the authorization * closureX The= [6 with, rule3, 1 proof, above2, 7, pf5] . saysand Therefore the that tags principal we t concludeand t combineis that authorizedu to5 must yield to be append([*,set],Z,Z1). closure with proof pf and the tags t and 1t combine to yield 2.8 Related work taccess.granted The t tagwith access combination proof to tpf3. if, means(self , that u, d, the t1) is requested in1 the authorization access is t TheseThese rules rules telltell us us how how to to combine combine tags tags when when one one of ofthe the aclosure. subset The tag withof combinationthe proof availablepf and meansset the of tagspermissions. thatt theand requestedt1 combine access to yield is a Li2.8 et al. Related [24], have work presented a spectrum of semantics for the subsett. The tag of the combination available set means of permissions. that the requested access is a SPKI/SDSI’sLi et al. [24], have naming presented scheme a based spectrum on set of semanticstheory, LP, for string the tags is a set-expr. The The first first rulerule says thatthat forfor combiningcombining sets sets 2.7The Prototype request in implementation the example will be encoded as access_ X andand YY , we should combine combine each each element element of of XX withwith every every subsetThe of therequest available in set the of permissions. example will be encoded rewritingSPKI/SDSI’s and naming FOL. They scheme have based shown on relationshipsset theory, LP, among string request (u5, t3, X ). We can conclude access_request (u5, t3 , [6, element of of YY andand collect collect the the result result in in a set. a set. The The predicate predicate asWeThe developedaccess requestrequest a prototype in(u5 the,t3,X toolexample). for We specifying will can be and conclude encoded verify- therewriting various and semantics FOL. They and have have shown argued relationships their semantics among 3, 1, 2, 7, 5]) from the above rule using grants ([self, u5 , 0, t2 ] , sets_combinessets combines doesdoes this this computation. computation. The The remaining remaining rules rules asaccessing authorizationsaccessrequestrequest(u5,t3 in(,u[65 SACL,,t3,31,X, 2 based),.7, 5]) We onfrom the can ideas the presentedconclude above providethe various advantages semantics over and the semantics have argued due to their Abadi semantics [1] and [], [6, 3, 1, 2, 7, 5]) and the fact that1 tags t2 and t3 combine to ruleaccessabove. usingrequest We usedgrants(u5 SWI,t3([, [6self Prolog, 3,,u1,52,,0for7,t, 25])] our, [], [6from implementation., 3, 1, 2, the7, 5]) aboveand We express that for for combining combining a prefixprefix or or a range-expr a range-expr with with a set a yield t3 . Thus the query access request (u5, t3, X) returns yes provide advantages over the semantics due to Abadi [1] and rulethehard factcoded using that thegrants tags namet([2 selfand ,u autht53, 0combine,t certs2], [], into[6, to3, the1 yield, 2 program., 7, 5])t3. Thusand We 1. http://www.swi-prolog.org/ setX weX we need need to to combine combine the the expressions expressions withwith eacheach elementelement of with X = [6, 3, 1, 2, 7, 5]. Therefore we conclude that u5 must ofthe the set set and and collect collect the the result result in in a set. a set. betheimplemented granted fact that access tagsan to initialization tt.2 and t3 phasecombine which to checks yield t currently3. Thus 1. http://www.swi-prolog.org/ valid certs and adds3 them to the database together with a tag_combines([*,prefix,X],[*,prefix,Y],[*,prefix,Z]) :- string_length(X,L),string_length(Y,M),Table 1 : Table showing the structure of the tagcert resulting number. from We intersection also implemented of two our tags algorithm for tag (L>M->(string_concat(Y,A,X),Z=X); intersection. M>L->(string_concat(X,B,Y),Z=Y);X==Y->Z=X).* string simple Our toolset can be used forprefix checking variousrange properties tag_combines([*,range,X|A],[*,range,X|B],Z1) including: *:- range_combines(X,A,B,C),* string simple set prefix range append([ ,range,X],C,Z1). string *string string/FAIL FAIL Isstring/FAILk′ a member of (string/FAILk, a)? Can be answeredstring/FAIL by query- • Thesimple first rule abovesimple says that the resultFAIL of combiningsimple/FAIL two ing nameset closure([k,FAIL a, k′], [], Pf ), andFAIL if it is true then our tool returns a proof of this fact in Pf . In the prefix expressionsset X andsetY is the longerstring/FAIL of them providedset set set set that the shorter expression is a prefix of the longer expres- second component we can specify a list of certificates sion. Theprefix second ruleprefix expresses thatstring/FAIL we can combine twoFAIL not toset be included inprefix the derivation FAIL What are the permissions authorized by both the rangerange forms Y and Z iffrange they are consideredstring/FAIL under the sameFAIL • set FAIL range ordering X. The predicate range combines(X,A,B,C) tags t1 and t2? Can be answered by querying computes the range C which is the result of combining the tag combines(t1,t2,T), it returns fail if there are no ranges A and B when considered under the ordering X. For common permissions or returns the common permis- lack of space we do not give the rules for range combines. sions in T What permissions does principal k have? We The process of tag combination, for the tags t1 and t2 in • the running example, is illustrated in Figure 1. CSIimplemented Journal of Computing a predicate | Vol. 2 • No. 4, 2015 We do not see any meaningful situations in practice permissions(k) which computes the where it will be necessary to intersect a prefix expression set of permissions that k has as the with a range expression. So we make the simplifying as- union of delegable permissions(k) and sumption that a range expression and a prefix expression use only permissions(k) Is principal k allowed to access t? We simply query fail to combine. Our tag combination algorithm is sound. • access request(k, t, Pf ) and we get the proof of authorization in Pf if this request is permitted. 2.6 Access decisions We have a ternary predicate access request(u, t, pf ) which We hard coded the certs corresponding to running ex- expresses that principal u is authorized to access t and a ample in the program and queried the properties mentioned proof is given by the sequence of certs pf . above. We present snapshots of the results in Figure 2. As a possible future extension we plan to create a front access_request(K,T,Pf) :- grants([self,K,D,T1],[],Pf), end where user can input certs which will be parsed before tag_combines(T,T1,T2),T==T2. they get added to the database. Another possibility is to specify a file in which the certs are coded. The rule above says that principal u is authorized to access t with proof pf if, (self , u, d, t1) is in the authorization closure with proof pf and the tags t and t1 combine to yield 2.8 Related work t. The tag combination means that the requested access is a Li et al. [24], have presented a spectrum of semantics for the subset of the available set of permissions. SPKI/SDSI’s naming scheme based on set theory, LP, string The request in the example will be encoded rewriting and FOL. They have shown relationships among as access request(u5,t3,X). We can conclude the various semantics and have argued their semantics access request(u5,t3, [6, 3, 1, 2, 7, 5]) from the above provide advantages over the semantics due to Abadi [1] and rule using grants([self ,u5, 0,t2], [], [6, 3, 1, 2, 7, 5]) and the fact that tags t2 and t3 combine to yield t3. Thus 1. http://www.swi-prolog.org/ R4 : 40 Towards An Executable Declarative Specification of Access Control

2.7 Prototype implementation We developed a prototype tool for specifying and verifying Is principal k allowed to access t? We simply query authorizations in SACL based on the ideas presented above. access_request (k, t, Pf ) and we get the proof of 1 We used SWI Prolog for our implementation. We hard coded authorization in Pf if this request is permitted. the name and auth certs into the program. We implemented We hard coded the certs corresponding to running an initialization phase which checks currently valid certs example in the program and queried the properties and adds them to the database together with a cert number. mentioned above. We present snapshots of the results We also implemented our algorithm for tag intersection. in Fig. 2. Our tool can be used for checking various properties As a possible future extension we plan to create including: a front end where user can input certs which will be Is k′ a member of (k, a)? Can be answered by querying parsed before they get added to the database. Another name_closure ([k, a, k′ ], [], Pf ), and if it is true then possibility is to specify a file in which the certs are coded. our tool returns a proof of this fact in Pf . In the second component we can specify a list of certificates not to be 2.8 Related work included in the derivation Li et al. [24], have presented a spectrum of semantics What are the permissions authorized by both the tags t1 for the SPKI/SDSI’s naming scheme based on set theory, LP,

and t2 ? Can be answered by querying tag_combines (t1, t2, string rewriting and FOL. They have shown relationships T ), it returns fail if there are no common permissions among the various semantics and have argued their or returns the common permissions in T semantics provide advantages over the semantics due to What permissions does principal k have? We implemented Abadi [1] and van der Mayden [16]. They have also extended a predicate permissions (k) which computes the set their FOL semantics to include SPKI/SDSI auth certs. In the of permissions that k has as the union of delegable_ following, we provide a detailed comparison of our approach permissions (k) and use_only_permissions (k) to the approach of Li et al. 6 6 [tag, [purchase, [*,range,le,200], [*,set,item1,item2,item3]]] [tag, [purchase, [*,range,le,150], [*,set,item1,item2]]] [tag, [purchase, [*,range,le,200], [*,set,item1,item2,item3]]] [tag, [purchase, [*,range,le,150], [*,set,item1,item2]]]

12 3 4 12 3 4

[tag, [purchase, [*,range,le,150], [*,set,item1,item2]]] [tag, [purchase, [*,range,le,150], [*,set,item1,item2]]]

1,2 : algorithm for combining byte−strings 3 : algorithm for combining ranges 4 : algorithm for combining sets 1,2 : algorithm for combining byte−strings 3 : algorithm for combining ranges 4 : algorithm for combining sets

Fig. 1. Figure illustrating how tagsFig. are combined1. Figure position-wise illustrating how tags are combined position-wise Fig. 1. Figure illustrating how tags are combined position-wise

(a) (b) (a) (b)

1. http://www.swi-prolog.org/

CSI Journal of Computing | Vol. 2 • No. 4, 2015

(d) (d) Fig. 2. Figure illustrating (a)name closure, (b)tag intersection, (c)permissions of a user and (d)decision of a access request Fig. 2. Figure illustrating (a)name closure, (b)tag intersection, (c)permissions of a user and (d)decision of a access request van der Mayden [16]. They have also extended their FOL program is defined by its least Herbrand model. Whenever van der Mayden [16]. They have also extended their FOL program is defined by its least Herbrand model. Whenever semantics to include SPKI/SDSI auth certs. In the following, an atom A is in the least Herbrand model of program L we semantics to include SPKI/SDSI auth certs. In the following, an atom A is in the least Herbrand model of program L we we provide a detailed comparison of our approach to the write L = A. Our semantics and theirs are equivalent in the we provide a detailed comparison of our approach to the write L = A. Our semantics and theirs are equivalent in the approach of Li et al. following| sense approach of Li et al. following| sense In the approach of Li et al., predicate m(K, A, K′) is In the approach of Li et al., predicate m(K, A, K′) is Proposition 1. Given a set of name certs , princi- used to denote that K′ is a member of KA. They use the Proposition 1. Given a set of name certsC , princi- used to denote that K′ is a member of KA. They use the pals K and K′, and name identifier A, LPC1( ) = following rules for computing m. pals K and K′, and name identifier A, LP1C( )| = following rules for computing m. m(K, A, K′) iff there exists a Pf such that LP2( C) =| m(K,A,K’). m(K, A, K′) iff there exists a Pf such that LP2C( )| = m(K,A,K’). name closure([K,A,K′], Pf ). C | for every cert (K,A,K’) name closure([K,A,K′], Pf ). for every cert (K,A,K’) m(K,A,K’) :- m(K_1,A_1,K’). The above proposition is an easy consequence of the m(K,A,K’) :-for m(K_1,A_1,K’). every cert (K,A,(K_1,A_1)) The above proposition is an easy consequence of the for every cert (K,A,(K_1,A_1)) following observations: the first rule of our program and m(K,A,K’) :- m(K_1,A_1,K_2), following observations: the first rule of our program and m(K,A,K’) :-m(K_2,A_2,K_3), m(K_1,A_1,K_2), theirs compute direct associations. Second rule of our pro- m(K_2,A_2,K_3), theirs compute direct associations. Second rule of our pro- ..., gram replaces a name by its definition and the third rule m(K_n,A_n,K’)...., gram replaces a name by its definition and the third rule m(K_n,A_n,K’). then takes care of chaining the intermediate results together for every cert (K,A,(K_1,A_1,...,A_n)), then takes care of chaining the intermediate results together for every certn>1 (K,A,(K_1,A_1,...,A_n)),to find the final result. In fact, first part of our third rule n>1 to find the final result. In fact, first part of our third rule generates the body of their second rule and second part Given a set of name certs , let LP ( ) and LP ( ) generates the body of their second rule and second part Given a set of name certs , let LP1 ( ) and LP2 ( ) generates the body of their third rule. denote their program and ourC program respectively1C for2C generates the body of their third rule. denote their program and ourC program respectivelyC forC Li et al. have extended their model to include SPKI auth computing members of names. The semantics of a logic certs,Li wherein et al. have they extended treat auth their tags model and validity to include conditions SPKI auth as computing members of names. The semantics of a logic certs, wherein they treat auth tags and validity conditions as 6

[tag, [purchase, [*,range,le,200], [*,set,item1,item2,item3]]] [tag, [purchase, [*,range,le,150], [*,set,item1,item2]]]

12 3 4 6

[tag, [purchase, [*,range,le,200], [*,set,item1,item2,item3]]] [tag, [purchase, [*,range,le,150], [*,set,item1,item2]]]

[tag, [purchase, [*,range,le,150], [*,set,item1,item2]]]

12 3 4 1,2 : algorithm for combining byte−strings 3 : algorithm for combining ranges 4 : algorithm for combining sets

Fig. 1. Figure illustrating how tags are combined position-wise[tag, [purchase, [*,range,le,150], [*,set,item1,item2]]]

1,2 : algorithm for combining byte−strings 3 : algorithm for combining ranges 4 : algorithm for combining sets

Fig. 1. Figure illustrating how tags are combined position-wise

(a) (b)

(c)

N. V. Narendra Kumar, et. al. R4 : 41

(c)

(d)

Fig. 2. Figure illustrating (a)name closure, (b)tag intersection, (c)permissions of a user and (d)decision of a access request

van der Mayden [16]. They have also extended their FOL(d) program is defined by its least Herbrand model. Whenever semantics to include SPKI/SDSI auth certs. In the following, an atom A is in the least Herbrand model of program L we Fig. 2 : Figure illustrating (a) name closure, (b) tag intersection, (c) permissions of a user and Fig.we 2. provide Figure illustrating a detailed (a)name comparison closure, (b)tag of our intersection, approach (c)permissions to the write of a userL = andA (d)decision. Our semantics of a access and request theirs are equivalent in the approach of Li et al. (d) decision of a accessfollowing request| sense In the approach of Li et al., predicate m(K, A, K ) is In the approach of Li et al., predicate m(K, A, K′) is used′ whereProposition the predicate 1. Given satisfies acaptures set of constraint name certs satisfaction., princi- vanused der to Mayden denote that [16].K They′ is a have member also extendedof KA. They their use FOL the program is defined by its least Herbrand model. WheneverC to denote that K′ is a member of KA. They use the following In thepals aboveK program,and K ′the, and following name points identifier are to Abe, notedLP ( ) = semanticsfollowing to rules include for SPKI/SDSIcomputing m auth. certs. In the following, an atom A is in the least Herbrand model of program1 LC we| rules for computing m. m(K, A, K′) iff there exists a Pf such that LP2( ) = wem(K,A,K’). provide a detailed comparison of our approach to the write predicateL = A. Ourg of Li, semantics talks about and individual theirs are equivalentauthorizations, inC the | name| closure([K,A,K′], Pf ). approach of Li et al.for every cert (K,A,K’) followingwhere senseas our grants predicate can handle sets of m(K,A,K’)In the approach :- m(K_1,A_1,K’). of Li et al., predicate m(K, A, K′) is authorizations. So we get a succinct representation of the for every cert (K,A,(K_1,A_1)) PropositionThe above 1. propositionGiven a set is anof easyname consequence certs , princi- of the used to denote that K′ is a member of KA. They use the followingauthorizations observations: that a principal the first has rule of our programC and followingm(K,A,K’) rules :- for m(K_1,A_1,K_2), computing m. pals K and K′, and name identifier A, LP1( ) = m(K_2,A_2,K_3), the predicate grants explicitly captures the delegationC | theirsm(K, compute A, K′) iff direct there associations. exists a Pf Secondsuch that ruleLP of2 our( ) pro-= m(K,A,K’). ..., grambit, replaceswhich allows a name us to by easily its definition express the and permissions the thirdC rule| m(K_n,A_n,K’). name closure([K,A,K′], Pf ). for every cert (K,A,K’) thena takesprincipal care can of chainingand cannot the further intermediate delegate. results Whereas together m(K,A,K’) :- m(K_1,A_1,K’).for every cert (K,A,(K_1,A_1,...,A_n)), using the predicate g it becomes a complicated query for every cert (K,A,(K_1,A_1))n>1 toThe find above the final proposition result. In is fact, an firsteasy part consequence of our third of the rule Another advantage of our model is that we handle m(K,A,K’)Given :-a set m(K_1,A_1,K_2), of name certs C, let LP (C) and LP (C) followinggenerates observations: the body of the their first second rule of rule our and program second and part Given a set of name certs , let LP1 ( ) and 2LP ( ) both auth tags and validity purely syntactically (except denote their programm(K_2,A_2,K_3), and our program respectively1 for2 theirsgenerates compute the body direct of associations. their third rule. Second rule of our pro- ..., C C C for evaluating the range expressions at which point using computingdenote their members program of andnames. our The program semantics respectively of a logic for gramLi replaces et al. have a name extended by its their definition model to and include the third SPKI rule auth m(K_n,A_n,K’). semantics is unavoidable), thus maintaining the free programcomputing is defined members by ofits names. least The Herbrand semantics model. of aWhenever logic thencerts, takes wherein care of they chaining treat auth the intermediatetags and validity results conditions together as for every cert (K,A,(K_1,A_1,...,A_n)), interpretation. We have also described algorithms for an atom A is in the least Herbrandn>1 model of program L we to find the final result. In fact, first part of our third rule computing the intersection of auth tags based on their write L |= A. Our semantics and theirs are equivalent in the generates the body of their second rule and second part structure. followingGiven sense a set of name certs , let LP1( ) and LP2( ) generates the body of their third rule. denote their program and ourC program respectivelyC forC Proposition 1. Given a set of name certs C, principals 3. Li Capturing et al. have extendedother security their model models to include in SACL SPKI auth computing members of names. The semantics of a logic certs, wherein they treat auth tags and validity conditions as K and K′, and name identifierA , LP1(C) |= m(K, A,K′) iff there In this section, we show how the naming scheme of

exists a Pf such that LP2(C) |= name_closure([K, A, K′]; Pf ). SACL induces a hierarchy which can be used to model The above proposition is an easy consequence of the information flow policies. In this section, we limit our

following observations: the first rule of our program and attention to name certs of the form (u, a, u1) or (u, a, u1

theirs compute direct associations. Second rule of our program a1) only i.e., we do not consider extended names. The replaces a name by its definition and the third rule then takes applications we consider below justify this limitation. care of chaining the intermediate results together to find the Given a set of name certs, consider the induced directed final result. In fact, first part of our third rule generates the graph whose vertices are either local names or principals

body of their second rule and second part generates the body appearing in certs; and there is an edge from (u, a) to u1 iff we

of their third rule. have the cert (u, a, u1); and an edge from (u, a) to (u1 , a1) iff

Li et al. have extended their model to include SPKI auth we have the cert (u, a, (u1, a1)). This graph can be interpreted 7 certs, wherein they treat auth tags and validity conditions as as a dominance relation on names: name n1 dominates name n , denoted n n , iff there is a path from n to n . From the constraint domains. They They have have a a predicate predicate g(gk,(k,t,k t, k′) ′ which) which 3.12 Lattice1 ≻ model2 of information flow 1 2 represents that k gives authority t to k′ (either directly or semantics of SPKI we can conclude the following whenever represents that k gives authority t to k′ (either directly or In [12], Denning introduced the concept of lattice based transitively). They They define define aa macro macro containscontains [n[][nk][],k ] in, in terms terms n ≻ n : information1 2 flow. He defined a flow model as the 5-tuple of predicate predicate mm, ,which which is is used used to to capture capture the the fact fact that that k k all the members of n are members of n is a member of of nn. . Their Their program program (constraint (constraint Datalog) Datalog) for for (S,O,L, , ), where 2 1 computing g is as follows: → ⊕ computing g is as follows: all theS denotespermissions the authorized set of subjects for n which are authorized are active for n enti- • 1 2 contains[KA][K’] :- contains[N][K’]. ties in the system that perform actions resulting in for every cert (K,A,N) In particular, when relation ≻ forms a lattice it g(K,t,K’) :- contains[N][K’],satisfies(t,T). becomesinformation very useful flow. for Dependingmodelling onseveral the application, important the for every cert (K,N,0,T) applicationslevel ofarising abstraction in practice of a ranging subject could from varymilitary from a g(K,t,K’) :- contains[N][K’],satisfies(t,T). organizationsprocess/thread to corporate to a organizations user name in a and system protection for every cert (K,N,1,T) O denotes the set of objects which are passive entities g(K,t,K’) :- contains[N][K’’],g(K’’,t,K’), in operating• systems. In the following, we show how SACL satisfies(t,T). can be usedin the to system capture used the for lattice storing model information. of information The level for every cert (K,N,1,T) flow [12].of abstraction for objects can vary from a variable in a program to an entire database where the predicate satisfies captures constraint satisfac- L is a finite set of labels also known as security tion. In the above program, the following points are to be • classes or security levels. Every subject and object noted CSI Journal of Computing | Vol. 2 • No. 4, 2015 in the system is allotted a label, this can be done predicate g of Li, talks about individual authoriza- either statically or dynamically depending on the • tions, where as our grants predicate can handle sets application of authorizations. So we get a succinct representation is a binary relation on L, which defines permissi- • → of the authorizations that a principal has ble information flows the predicate grants explicitly captures the delega- is a function from L L to L; (l1,l2)=l means • • ⊕ × ⊕ tion bit, which allows us to easily express the per- that when we combine information labelled l1, l2 the missions a principal can and cannot further delegate. label of the resulting information must be at least l g Whereas using the predicate it becomes a compli- Fundamental requirements for information flow like cated query transitivity and reflexivity naturally force (L, , ) to be → ⊕ Another advantage of our model is that we handle both a lattice structure, where denotes the join (least upper ⊕ auth tags and validity purely syntactically (except for evalu- bound) operator. Restricting ourselves to read and write ating the range expressions at which point using semantics operations (represented by r and w respectively), we now is unavoidable), thus maintaining the free interpretation. We show how a lattice model of information flow can be simu- have also described algorithms for computing the intersec- lated in SACL. Given a flow model FM =(S, O, L, , ), → ⊕ tion of auth tags based on their structure. we simulate it using the SACL FM where C for every subject s S we have a principal u • ∈ ∈U 3CAPTURING OTHER SECURITY MODELS IN for every object o O we have auth tags (o, r) • ∈ ∈T SACL and (o, w) . Note that we abuse notation and use short forms∈T for tags In this section, we show how the naming scheme of SACL for every label l L we have two name identifiers induces a hierarchy which can be used to model information • lr and lw ∈ flow policies. In this section, we limit our attention to name for∈A every flow∈A(l ,l ) we have the name certs certs of the form (u, a, u ) or (u, a, u a ) only i.e., we do • 1 2 1 1 1 (u ,lr, (u ,lr)) and (u ∈→,lw, ( u ,lw)). Intuitively, the not consider extended names. The applications we consider ? 1 ? 2 ? 2 ? 1 name certs are saying that who ever can read l can below justify this limitation. 2 also read l and who ever can write l can write l . Given a set of name certs, consider the induced directed 1 1 2 These are precisely the implications of flow from l graph whose vertices are either local names or principals 1 to l appearing in certs; and there is an edge from (u, a) to u 2 1 for assigning label l to subject s we issue the name iff we have the cert (u, a, u ); and an edge from (u, a) to • 1 certs (u ,lr,u) and (u ,lw,u) (u ,a ) iff we have the cert (u, a, (u ,a )). This graph can ? ? 1 1 1 1 for assigning label l to object o we issue the auth certs be interpreted as a dominance relation on names: name n • 1 (u , (u ,lr), 0, (o, r)) and (u , (u ,lw), 0, (o, w)) dominates name n , denoted n n , iff there is a path ? ? ? ? 2 1 ≻ 2 from n1 to n2. From the semantics of SPKI we can conclude Note that u? is the issuing principal and all names the following whenever n1 n2: are defined in his name space. This corresponds to an ≻ administrator in a mandatory policy: for example in an all the members of n are members of n • 2 1 organization, all the data is owned by the organization and all the permissions authorized for n are authorized • 1 its employees just access data according to the policy. Once for n 2 we have the SACL specification, we can use the standard In particular, when relation forms a lattice it becomes certificate composition or simple graph reachability algo- very useful for modelling several≻ important applications rithms to validate access requests and verify properties of arising in practice ranging from military organizations to such systems. Another important aspect is that a user might corporate organizations and protection in operating sys- not want to operate at his highest level always (the level tems. In the following, we show how SACL can be used assigned to a user is typically representative of the highest to capture the lattice model of information flow [12]. level at which he can operate). In fact the principle of least 7 7 7 7 constraint domains. They have a predicate g(k,t,k′) which 3.1 Lattice model of information flow constraintconstraint domains. domains. They They have have a predicate a predicateg(k,t,kg(k,t,k′) which′) which 3.13.1 Lattice Lattice model model of of information information flow flow represents that k gives authority t to k′ (either directly or R4 : 42 Towards An Executable Declarative Specification of Access Control constraint domains. They have a predicaterepresentsrepresentsg(k,t,k that that′)kwhichgivesk gives authority3.1 authority Latticet tot tok model′ k(either′ (either of directly information directly or or flow In [12], Denning introduced the concept of lattice based transitively). They define a macro contains[n][k], in termsInIn [12], [12], Denning Denning introduced introduced the the concept concept of of lattice lattice based based represents that k gives authority t totransitively).transitively).k′ (either directly They They define or define a macro a macrocontainscontains[n][[nk][],k in], in terms terms information flow. He defined a flow model as the 5-tuple of predicateof predicatem,m which, whichIn is used [12], is usedDenning to capture to capture introduced the the fact fact that the thatk conceptinformationkinformation of lattice flow. flow. based He He defined defined a flow a flow model model as as the the 5-tuple 5-tuple transitively). They define a macro containsof predicate[n][k]m, in, which terms is used to capture the fact that k (S,O,L,3.1 Lattice, ) model, where of information flow of predicate m, which is used to captureisis a member ais member thea member fact of ofn that.n Theirof. Theirkn. Their programinformation program program (constraint (constraint flow. (constraint He Datalog) defined Datalog) Datalog) for a for flow( forS,O,L, model(S,O,L, as, the,), where 5-tuple), where computing g is as follows: →→⊕ ⊕→ ⊕ is a member of n. Their program (constraintcomputingcomputing Datalog)g isg asis as follows: follows:for (S,O,L, , ), where InS denotes[12], Denning the set introduced of subjects thewhich concept are of lattice active based enti- operate). In fact the principle of least privileges, an important → ⊕ S •Sdenotesdenotes the the set set of ofsubjectssubjectswhichwhich are are active active enti- enti- computing g is as follows: contains[KA][K’] :- contains[N][K’]. • • contains[KA][K’]contains[KA][K’] :- :- contains[N][K’]. contains[N][K’].S denotes the set of subjects whichties areinformationties in activeties in the the in system enti- the system flow. system that thatHe perform that performdefined perform actions actions a actions resultingflow resulting resultingmodel in in as in the 5-tuple security property, regulates that a user should not operate at forfor every•for every every cert cert (K,A,N)cert (K,A,N) (K,A,N) (S, O, L, , ), where a level more than needed for achieving the current task. Such contains[KA][K’] :- contains[N][K’]. g(K,t,K’) :- contains[N][K’],satisfies(t,T).ties in the system that perform actionsinformationinformation resultinginformation→ ⊕ flow. in flow. Depending flow. Depending Depending on on the the on application, the application, application, the the the for every certg(K,t,K’)g(K,t,K’) (K,A,N) :-:- contains[N][K’],satisfies(t,T). contains[N][K’],satisfies(t,T). level of abstraction of a subject could vary from a principles can also be enforced by keeping track of the current forfor everyinformation every cert cert (K,N,0,T) flow. (K,N,0,T) Depending on thelevel application,level ofS denotes of abstraction abstraction thethe set of of subjects of a subject a subject which couldare could active vary entities vary from fromin the a system a g(K,t,K’) :- contains[N][K’],satisfies(t,T). for every cert (K,N,0,T) label, and changing the label dynamically in such a way that g(K,t,K’)g(K,t,K’)g(K,t,K’) :-:- contains[N][K’],satisfies(t,T). contains[N][K’],satisfies(t,T).:- contains[N][K’],satisfies(t,T).level of abstraction of a subject couldprocess/thread varythatprocess/thread perform from a actions to a to userresulting a user name in name information in a in system a system flow. Depending on 8 for every cert (K,N,0,T) for every cert (K,N,1,T) process/thread to a user name in a system g(K,t,K’) :- contains[N][K’],satisfies(t,T). forfor every every cert cert (K,N,1,T) (K,N,1,T) theO denotesapplication, the the set level of objects of abstractionwhich of are a passivesubject could entities vary it satisfies the policy. process/thread to a user name in a systemO •Odenotesdenotes the the set set of ofobjectsobjectswhichwhich are are passive passive entities entities (o2,r) (o1,r) g(K,t,K’)g(K,t,K’) :- contains[N][K’’],g(K’’,t,K’),:- contains[N][K’’],g(K’’,t,K’), • • 4CONCLUSIONS for every certg(K,t,K’) (K,N,1,T) :- contains[N][K’’],g(K’’,t,K’), fromin the a process/thread system used to for a user storing name in information. a system The level s1 H o1 satisfies(t,T).O denotes the set of objects which arein passivein the the system system entities used used for for storing storing information. information. The The level level r satisfies(t,T).satisfies(t,T).• u?L r g(K,t,K’) :- contains[N][K’’],g(K’’,t,K’), for every cert (K,N,1,T) Oof denotes abstraction the set for of objects objectscan which vary are from passive a variable entities inin u?H In this paper, we have provided a LP model for SACL satisfies(t,T). forfor every everyin cert the cert system (K,N,1,T) (K,N,1,T) used for storing information.ofof abstraction abstraction The level for for objects objects can can vary vary from from a variable a variable in in thea program system to used an entirefor storing database information. The level of s2 L o2 u covering all its features including the authorization tags and for every cert (K,N,1,T)where the predicate satisfiesof abstractioncaptures constraint for objects satisfac- can vary froma programa aprogram variable to to an in an entire entire database database u2 1 wherewhere the the predicate predicatesatisfiessatisfiescapturescaptures constraint constraint satisfac- satisfac- abstractionL is a finite for set objects of labels can also vary from known a variable as security in a delegation. The significance of our approach over existing a program to an entire database L •Lisis a afinite finite set set of oflabelslabelsalsoalso known known as as security security w w where the predicate satisfies capturestion. constrainttion. In the In the above satisfac- above program, program, the the following following points points are are to be to be • • Information flow u?L u?H tion. In the above program, the followingL is a finite points set are of tolabels be also knownclasses asprogramclasses security or security or to security an entire levels. levels. database Every Every subject subject and and object object work is that we have described a detailed algorithm for tag noted • classes or security levels. Every subject and object Subject label tion. In the above program, the followingnotednoted points are to be (o2,w) (o1,w) intersection computation which plays an important role in classes or security levels. Every subjectinin the the andLin systemis the systema object finite system is is allottedset allotted is of labels allotted a label,a also label, a known label, this this can thisas can besecurity can be done done be classes done Object label noted predicate g of Li, talks about individual authoriza- oreither security statically levels. Every or dynamically subject and dependingobject in the onsystem the making access decisions. To our knowledge this is the first predicate•predicateg ofg of Li, Li, talks talks aboutin about the individual individual system is authoriza- allotted authoriza- a label, thiseithereither can statically be statically done or or dynamically dynamically depending depending on on the the • • Flow model SACL such algorithmic description. We have also described a pro- predicate g of Li, talks about individualtions,tions,tions, whereauthoriza- where where as as our our asgrants ourgrantsgrantseitherpredicatepredicatepredicate statically can can handle can or handle dynamically handle sets sets sets dependingapplicationapplicationisapplication allotted on the a label, this can be done either statically or • of authorizations. So we get a succinct representation dynamicallyis a binary depending relation on theL, whichapplication defines permissi- totype implementation of a tool based on the model, which tions, where as our grants predicateof canof authorizations. authorizations. handle sets So So we we getapplication get a succinct a succinct representation representation •is ais binary a binary relation relation on onL,L which, which defines defines permissi- permissi- • • → Fig. 3. Figure illustrating a flow model and an equivalent SACL of authorizations. So we get a succinct representationof the authorizations thatis a binaryprincipal relation has on L, which defines→→ble → informationble permissi- is information a binary flows relation flows on L, which defines permissible Fig. 3 : Figure illustrating a flow model and an enables us to verify properties of SACL specifications and ofof the the authorizations authorizations that• that a principal a principal has has ble information flows the predicate grants→explicitly captures the delega- informationis a function flows from L L to L; (l1,l2)=l means equivalent SACL authorizations that can be derived by composing several of the authorizations that a principalthe has•the predicate predicategrantsgrantsexplicitlyexplicitlyble information captures captures the flows the delega- delega- •is ais function a function from fromL L L toL toL;L;(l1(,ll12,l)=2)=l meansl means • • • • ⊕ × ⊕ tion bit, which allows us to easily express the per- ⊕⊕ ⊕that is a when function we from combine ×L ×× L to information L; ⊕(l⊕ , l ) = labelled l means thatl1, l 2whenthe tags. By exploiting the hierarchy induced by the naming the predicate grants explicitly capturestiontion bit, the bit, whichdelega- which allows allows us us tois to easily a easily function express express from the theL per- per-L to L; (lthat1,lthat2)= when whenl means we we combine combine information information labelled1 labelled2 l1,ll12, lthe2 the privileges, an important security property, regulates that • missions a principal• ⊕ can and cannot further× delegate. ⊕ welabel combine of the resultinginformation information labelled l must, l the be at least l A flow model and its equivalent SACL specification scheme of SACL, we have shown how lattice-based infor- tion bit, which allows us to easily expressmissions the a per- principal canthat and when cannot we further combine delegate. information labelledlabel ofl1, thel2 the resulting information must1 be2 at leastlabell of the missions a principal can and cannot further delegate. label of the resulting information must be at least l aar usere illustrated should not in operate Fig. 3. From at a level Fig. 3, more we can than conclude needed missions a principal can and cannot furtherWhereasWhereas delegate. using using the the predicate predicatelabelg ofg it theg becomesit resulting becomes a compli- information a compli- must beresulting at least l information must be at least l mation flow policies can be modelled in SACL. This gives Whereas using the predicate it becomes a compli- FundamentalFundamentalFundamental requirements requirements requirements for for information for information information flow flow likeflow like like forthat achieving principal the u current can read task. and Such write principles o , and read can o also; while be cated query Fundamental requirements for information flow like 1 1 2 us the power to capture several well studied policy concepts Whereas using the predicate g it becomescatedcated query a query compli- transitivity and reflexivity naturally force (L, , ) to be enforcedu can read by keeping and write track o and of thewrite current o . This label, matches and changing the flow Fundamental requirements fortransitivity informationtransitivity andflow and reflexivity reflexivity like naturally naturally force force(L,(L, , ,) to) to be be 2 2 1 and models (RBAC, Bell-LaPadula model for example) in cated query transitivity and reflexivity naturally force→→ (L,⊕ ⊕→, ⊕) to be a AnotherAnotherAnother advantage advantage advantage of of ourtransitivity our of model our model model is and thatis that reflexivityis we that we handle wehandle handle naturally both both botha force latticea latticea( latticeL, structure, structure,, structure,) to where be where wheredenotesdenotesdenotes the the join the join (least join (least (least upper upper upper themodel label semantics. dynamically in such a way that it satisfies the lattice→ ⊕ structure, where ⊕ ⊕denotes the join (least upper bound) SACL which form the basis for protecting confidentiality auth tags and validitya purely lattice syntactically structure, where (except fordenotes evalu- thebound) join (least operator. upper Restricting⊕ ⊕ ourselves to read and write policy.Note that the above simulation immediately gives us Another advantage of our modelauth isauth that tags tags we and handle and validity validity both purely purely syntactically syntactically (except (except for for evalu- evalu- bound)bound) operator.operator. operator. Restricting Restricting Restricting ourselves ourselves ourselves to toread to read and read and write and write operations write and integrity of data from untrusted applications. This ating the range expressions at which point using⊕ semantics operations (represented by r and w respectively), we now theA flowcapability model to and enforce its equivalent Role-Based SACL Access specification Control [31], are auth tags and validity purely syntacticallyatingating the (except the range range for expressions evalu- expressionsbound) at at which which operator. point point using Restricting using semantics semantics ourselvesoperationsoperations to read(represented (represented and (represented write by r by and by r andr w and respectively), w w respectively), respectively), we now we we show now now how a also leads to a uniform framework in which heterogenous is unavoidable), thus maintaining the free interpretation. We show how a lattice model of information flow can be simu- illustratedthe Bell-LaPadula in Figure 3. security From Figure model 3, we [5], can Biba’s conclude integrity that ating the range expressions at whichis point unavoidable),is unavoidable), using semantics thus thus maintaining maintainingoperations the the free (represented free interpretation. interpretation. by r and We We wshow respectively),show how howlattice a lattice a latticemodelwe now model of model information of of information information flow can flow flow be can simulated can be be simu- simu- in SACL. policies can be specified and their properties analyzed. have also describedshow algorithms how a for lattice computing model of the information intersec-lated flowlated in SACL.can in SACL. be Givensimu- Given a flow a flow model modelFMFM=(S,=( O,S, L, O, L,, ),, ), principalmodel [20]u1 andcan the read Chinese and write wall osecurity1, and readmodelo 2[9]; while as theseu2 is unavoidable), thus maintaining thehave freehave alsointerpretation. also described described Wealgorithms algorithms for for computing computing the the intersec- intersec- lated in SACL.Given a Given flow amodel flowF M model = (S, O,FM L, →=(, ⊕S,), we O, L,simulate, ) →it, using⊕ tion of auth tags based on their structure. we simulate it using the SACL FM where →→⊕ ⊕ canare read special and cases write ofo latticeand write modelo of. Thisinformation matches flow the. flow have also described algorithms for computingtiontion of of auth auth the tags tags intersec- based based on onlated their their structure. in structure. SACL. Given a flow model weFMwe simulate=( simulatetheS, O, SACL it L, using it using C, the )where, the SACL SACLFMFMwherewhere 2 1 EFERENCES → FM⊕ C C C R tion of auth tags based on their structure. we simulate it using the SACL FM where for every subject s S we have a principal u model semantics. for•for every everyfor every subject subject subjects s S s weSÎwe S have we have havea principal a principal a principalu u u Î 3.2 Related work [1] M. Abadi. On SDSI’s linked local name spaces. In CSFW ’97: Pro- C • • ∈ ∈U APTURING OTHER SECURITY MODELS IN for every object∈ ∈o O we have auth tags∈U∈U(o, r) Note that the above simulation immediately gives us , 3C3C3CAPTURINGAPTURING OTHER OTHER SECURITY SECURITYfor every subject MODELS MODELSs S we IN IN have a principalfor•for everyfor everyu every object object objecto o Oo ÎOwe Owe havewe have have auth auth auth tags tagstags(o, (( ro,o,) rr) Î and ceedings of the 10th IEEE workshop on Computer Security Foundations • ∈ • • and∈U(o, w) ∈ ∈. Note∈ that we abuse notation∈T∈T and∈T use the capabilityIn [26], to Myers enforce et Role-Based al. introduced Access the Control decentralized [31], the page 98, Washington, DC, USA, 1997. IEEE Computer Society. 3CAPTURING OTHER SECURITYSACLSACL MODELS IN for every object o O we have authand tagsand(o,((o,(o, wo, w) r w) )Î . Note. .Note Note that that that we we we abuse abuse abuse notation notation notation and and and use use use short SACL • ∈ short∈T forms∈T∈T∈T for tags Bell-LaPadulainformation securityflow control model (called [5], Biba’sDIFC henceforth) integrity model model, [20] [2] M. Abadi. Logic in access control. In LICS ’03: Proceedings of the SACL In this section, we show howand the(o, naming w) . scheme Note that of we SACL abuse notationshortshort forms forms and for for use fortags tags tags 18th Annual IEEE Symposium on Logic in Computer Science, page InIn this this section, section, we we show show how how the the naming naming scheme∈T scheme of of SACL SACL for every label l L we have two name identifiers andbased the Chineseon a lattice wall of security labels, modelfor controlling [9] as these information are special induces a hierarchy which canshort be formsused to for model tags information for•for everyfor every every label label labell l Ll ÎLwe Lwe havewe have have two two two name namename identifiers identifiers identifierslr Î 228, Washington, DC, USA, 2003. IEEE Computer Society. In this section, we show how the naminginducesinduces scheme a hierarchy a hierarchy of SACL which which can can be be used used to to model model information information • • r r w ∈w∈ ∈ casesflow of in lattice systems model with of information decentralized flow. authority. DIFC for every label l L we have twol namer l Al identifiersand andand llwwand lÎ l [3] M. Abadi, M. Burrows, B. Lampson, and G. Plotkin. A calculus for induces a hierarchy which can be used to modelflow policies. information In this section,• we limit our attention to name ∈A ∈A model allows users to share information with untrusted flowflow policies. policies. In In this this section, section, we we limitr limit our our attention attentionw ∈ to to name name ∈A∈Afor every∈A flow∈A(l1,l2) we have the name certs access control in distributed systems. ACM Trans. Program. Lang. (u, a, u )l (u, a,and u la ) for•for everyfor every every flow flow flow(l1(,ll1 2(l,l), 2l)) Î→we we have have have the the the namename name certscerts certs (u , lr, flow policies. In this section, we limit ourcerts attentioncerts of the of the form to name( formu,(u, a, a,u u) 1) or1(u,(oru, a, a,u ua1a) 1)1 only1 only i.e., i.e., we we do do • • r r 1 2 ∈→w w ? 1 code while still being able to control how this information is Syst., 15(4):706–734, 1993. certs of the form 1 or ∈A1 1 only∈A i.e., we do r(ur?,lr ,r(ur?,l )) andw ∈→w∈→(uw?,lw ,w( uw?,l )). Intuitively, the for every flow (l1,l2) we have(u the?(u,l? name,l(u, (1u, ?(lu,l1))? certs ,land))2))and (and2u ,( ul ?( u,l, ?(u,l,,2( l,u()).?2u,l ?Intuitively,,l))1.)) Intuitively,.1 Intuitively, the name the the certs [4] M. Y. Becker, C. Fournet, and A. D. Gordon. SecPAL: Design and certs of the form (u, a, u ) or (u, a, unota )not consideronly consider i.e., extended we extended do names. names.• The The applications applications we considerwe consider 1 ? 2 2 ? 2 2 ? 1 1 disseminated to others. Their model allows decentralized 1 not1 consider1 extended names. The applicationsr r we consider∈→w w name certs are saying that who ever can read l2 can 3.2 Related work semantics of a decentralized authorization language. Technical below justify this limitation.(u?,l1, (u?,l2)) and (u?,l2 , ( u?,l1 ))name. Intuitively,nameare certs certs saying are the are saying that saying who that thatever who whocan ever read ever can l can can read readalsol2 readlcan2 can l and information declassification and supports finer data not consider extended names. The applicationsbelowbelow justify justify we this consider this limitation. limitation. also read l and who ever can write2 l can write1 l . report, In Proceedings of the 20th IEEE Computer Security Foun- name certs are saying that who everalso canalso read readwho read leverl2land1can andcan1 who write who ever l ever can can canwrite write write l . Thesel lcan1 can are1 write writepreciselyl .l2. the2 below justify this limitation. GivenGivenGiven a set a set of a ofset name name of namecerts, certs, considercerts, consider consider the the induced theinduced induced directed directed directed 1 1 2 1 2 Insharing. [26], Myers et al. introduced the decentralized informa- dations Symposium (CSF), 2006. also read l and who ever can write lThesecanThese write are are preciselyl . precisely the the implications implications of flow of flow from froml l1 graph whose vertices are either local1 names or principals These1 implications are precisely2 of theflow implications from l1 to l2 of flow from l1 1 tion flowIndrajit control et al.[30],(called presented DIFC henceforth) Laminar, model, a system based [5] D. Bell and L. La Padula. Secure computer systems: Unified Given a set of name certs, considergraphgraph the induced whose whose vertices directed vertices are are either either local local names names or or principals principals to l These are precisely the implicationsto oftol flowl2 from2 l1 exposition and multics interpretation. In Technical Report ESD- graph whose vertices are either localappearing namesappearing or in principals certs; in certs; and and there there is an is edge an edge from from(u, a(u,) to a)uto u1 2 for assigning label l to subject s we issue the name certs onthat a latticeimplements of labels, DIFC for controlling using an abstraction information suitable flow in appearing in certs; and there is an edge from (u, a) to u1 1 for assigningr labelw l to subject s we issue the name TR-75-306, MTR-2997, MITRE, Bedford, Mass, 1975. iff we have the cert (u, a, uto)l;2 and an edge from (u, a) to for•for assigning( assigningu , l , u) labeland label (ul ,to l to, subjectu) subjects wes we issue issue the the name name for both operating system resources and program data appearing in certs; and there is aniff edgeiff we we from have have( theu, thea) certto certu(u,1(u, a, a,u u);1) and; and1 an an edge edge from from(u,(u, a) ato) to • • ? r ? w systems with decentralized authority. DIFC model allows [6] M. Blaze, J. Feigenbaum, J. Ioannidis, and A. Keromytis. The 1 for assigning label l to subject s we issuecertscerts the(u r,l name(ur?,u,l),uand) and(u w,l(uw?,u,l),u) (u1,a1) iff we have the• cert (u, a, (u1,a1)). This graph can certs (foru? ,lassigning? ,u) and label(u? ,ll ?to ,uobject) o we issue the auth certs (u , usersstructures. to share Users information label data with with untrusted secrecy code and whileintegrity still KeyNote trust-management system version 2, 1999. iff we have the cert (u, a, u1); and( anu1(u,a edge1,a1) 1iff) fromiff we we have(u, have a) theto the cert cert(u,(u, a, a,(u1(u,a1,a1))1.))r This. This graph graph canw can for assigning label l to object o we issue the auth certs? certs (u?,l ,u) and (u?,l ,u) for•for assigning assigning(u , lr), 0, label (o label, r))l toandl to object ( objectu , (uo we, olwwe), issue 0, issue (o,w the)) the auth auth certs certs labels, defining thedesired security policy to be enforced [7] M. Blaze, J. Feigenbaum, and J. Lacy. Decentralized trust man- (u ,a ) iff we have the cert (u, a, (u ,abe)) interpretedbe. This interpreted graph as cana as dominance a dominance relation relation on names: on names: name namen n1 • • ? r ? ? w being able to control how this information is disseminated 1 1 be1 interpreted1 as a dominance relation on names: name n1 1 (u?,r(ur?,l ), 0, (o, r)) and (u?,w(uw?,l ), 0, (o, w)) dominates name n , denotedfor assigningn n , label iff therel to object is a patho we issue(u?( theu, (?u, ?( authu,l?,l), certs0), (0o,, ( ro,)) r))andand(u?(u, (?u, ?(u,l?,l), 0), (0o,, ( wo,)) w)) toat others. run time. Their Processes model allowsmust access decentralized labelled data information within agement. In SP ’96: Proceedings of the 1996 IEEE Symposium on be interpreted as a dominance relationdominates on names: name namen nn21, denoted2 • n n1 n1 n2, iff2 there is a path Note that u? is the issuing principal and all names are dominates name 2, denoted 1(u , (u2,≻,l iffr) there, 0, (o, is r)) aand path(u , (u ,lw), 0, (o, w)) Security and Privacy, page 164, Washington, DC, USA, 1996. IEEE from n1 to n2. From the semantics≻? ≻ ? of SPKI we can conclude? ? definedNote that in hisu ?nameis thespace. issuing This corresponds principal to andan administrator all names declassificationspecial security and regions supports. finer data sharing. dominates name n2, denoted n1 fromnfrom2, iffn1n thereto1 ton2n. is From2. a From path the the semantics semantics of of SPKI SPKI we we can can conclude conclude NoteNote that thatu?u?isis the the issuing issuing principal principal and and all all names names Computer Society. ≻ the following whenever n1 n2: arein defined a mandatory in his policy: name for example space. This in an corresponds organization, toall the an Li et al.[23] identify several design principles for [8] M. Blaze, J. Feigenbaum, and M. Strauss. Compliance checking in from n1 to n2. From the semantics ofthe SPKIthe following following we can whenever conclude whenevern1n1 Noten2n: 2: that u? is the issuing principalareare defined defined and inall in his names his name name space. space. This This corresponds corresponds to to an an Indrajit et al.[30], presented Laminar, a system that im- ≻ ≻ ≻ administratordata is owned in by a mandatorythe organization policy: and for its example employees in just an usable access control mechanisms and introduce UMIP, the PolicyMaker trust management system. In FC ’98: Proceedings the following whenever n1 n2: all the membersare of definedn are members in his name of n space. Thisadministratoradministrator corresponds in in toa amandatory an mandatory policy: policy: for for example example in in an an plements DIFC using an abstraction suitable for both oper- ≻ all•all the the members members of ofn nare2 are members2 members of ofn n1 1 a practical mandatory access control model. The objective of the Second International Conference on Financial Cryptography, • • administrator2 in a mandatory1 policy:organization, fororganization, exampleaccess data all in the allaccording an thedata data is to owned isthe owned policy. by the by Once theorganization we organization have the and SACL and ating system resources and program data structures. Users all the permissions authorized for n1 are authorizedorganization, all the data is owned by the organization and pages 254–274, London, UK, 1998. Springer-Verlag. all the members of n2 are members ofall•alln the1 the permissions permissions authorized authorized for forn1nare1 are authorized authorized itsspecification, employees just we accesscan use data the accordingstandard certificate to the policy. composition Once of UMIP is preserving system integrity in the face of • • • for n organization, all the data is owned byitsits the employees employees organization just just access and access data data according according to to the the policy. policy. Once Once label data with secrecy and integrity labels, defining the [9] D. F. Brewer and M. J. Nash. The chinese wall security policy. IEEE all the permissions authorized for nfor1 areforn authorizedn2 2 network-based attacks. They have also demonstrated • 2 its employees just access data accordingwewe have towe have theor thehave policy.simple the SACL the SACL Oncegraph SACL specification, specification, reachability specification, we wealgorithms can we can use can use the useto the standardvalidate the standard standard access desired security policy to be enforced at run time. Processes Symposium on Security and Privacy, 0:206, 1989. for n2 requests and verify properties of such systems. Another experimentally that UMIP is easy to configure, has low [10] D. Clarke, J.-E. Elien, C. Ellison, M. Fredette, A. Morcos, and R. L. InIn particular, particular,In particular, when when when relation relationwe relation haveforms theforms SACLforms a lattice a lattice a specification, lattice it becomes it becomes it becomes wecertificate cancertificatecertificate use thecomposition composition standard composition or or simple simple or simple graph graph graph reachability reachability reachability algo- algo- algo- must access labelled data within special security regions. ≻ ≻ ≻ important aspect is that a user might not want to operate overhead and defends successfully against a number of Rivest. Certificate chain discovery in SPKI/SDSI. J. Comput. Secur., In particular, when relation formsveryvery a usefulvery lattice useful useful itfor becomesfor modelling for modelling modellingcertificate several several several important composition important important applications applications or applicationssimple graphrithmsrithms reachabilityrithms to to validate validate to validate algo- access access access requests requests requests and and verify and verify verify properties properties properties of of of Li et al.[23] identify several design principles for usable 9(4):285–322, 2001. ≻ at his highest level always (the level assigned to a user is known network-attacks. very useful for modelling severalarising importantarisingarising in in practice applications practice in practice ranging ranging rangingrithms from from to militaryfrom militaryvalidate military organizations organizationsaccess organizations requests to to andsuch tosuch verify systems.such systems. properties systems. Another Another Another of important important important aspect aspect aspect is thatis that is a that user a user a might user might might access control mechanisms and introduce UMIP, a practical [11] W. F. Clocksin and C. Mellish. Programming in Prolog, 3rd Edition. typically representative of the highest level at which he can Using the techniques presented in Section 3 for Springer, 1987. arising in practice ranging from militarycorporatecorporatecorporate organizations organizations organizations organizations to andsuch and protection and systems. protection protection Another in in operating operating in important operating sys- sys- aspect sys-notnot wantisnot wantthat wantto ato user operate operate to might operate at at his his at highest his highest highest level level alwayslevel always always (the (the level (the level level mandatory access control model. The objective of UMIP is tems. In the following, we show how SACL can be used assigned to a user is typically representative of the highest [12] D. E. Denning. A lattice model of secure information flow. corporate organizations and protectiontems.tems. in In In operating the the following, following, sys- wenot we show want show how to how operate SACL SACL can at can his be be highest used used assigned levelassigned always to to a (theuser a user level is istypically typically representative representative of of the the highest highest preserving system integrity in the face of network-based Commun. ACM, 19(5):236–243, 1976. tems. In the following, we show howtoto capture SACL captureto capture the can the lattice be thelattice used lattice model modelassigned model of of information information of to information a user flow flowis typically [12]. flow [12]. [12]. representativelevellevel atlevel at which of which at the which he highest he can can he operate). can operate). operate). In In fact fact In the fact the principle the principle principle of of least least of least attacks. They have also demonstrated experimentally that [13] J. DeTreville. Binder, a logic-based security language. In SP ’02: CSI Journal of Computing | Vol. 2 • No. 4, 2015 to capture the lattice model of information flow [12]. level at which he can operate). In fact the principle of least UMIP is easy to configure, has low overhead and defends Proceedings of the 2002 IEEE Symposium on Security and Privacy, page 105, Washington, DC, USA, 2002. IEEE Computer Society. successfully against a number of known network-attacks. [14] C. M. Ellison, B. Frantz, B. Lampson, R. Rivest, B. M. Thomas, and Using the techniques presented in Section 3 for mod- T. Ylonen. Simple public key certificate. July 1999. elling lattices in SACL, we can arrive at fine grained infor- [15] C. M. Ellison, B. Frantz, B. Lampson, R. Rivest, B. M. Thomas, and T. Ylonen. SPKI certificate theory, 1999. mation flow policies in operating systems. We can then use [16] J. Y. Halpern and R. van der Meyden. A logic for SDSI’s linked the LP model of SACL presented in Section 2 to analyze local name spaces. J. Comput. Secur., 9(1-2):105–142, 2001. and enforce these policies, and also verify their properties. [17] J. Y. Halpern and R. van der Meyden. A logical reconstruction of In particular this will help us handle dynamic label changes SPKI. J. Comput. Secur., 11(4):581–613, 2003. [18] J. Howell and D. Kotz. A formal semantics for SPKI. In ESORICS that are essential for enforcing policies like the Chinese wall ’00: Proceedings of the 6th European Symposium on Research in Com- security policy. puter Security, pages 140–158, London, UK, 2000. Springer-Verlag. N. V. Narendra Kumar, et. al. R4 : 43 modelling lattices in SACL, we can arrive at fine grained Secur., 9(4):285-322, 2001. information flow policies in operating systems. We can [11] W. F. Clocksin and C. Mellish. Programming in Prolog, 3rd then use the LP model of SACL presented in Section Edition. Springer, 1987. [12] D. E. Denning. A lattice model of secure information flow. 2 to analyze and enforce these policies, and also verify Commun. ACM, 19(5):236-243, 1976. their properties. In particular this will help us handle [13] J. DeTreville. Binder, a logic-based security language. In SP dynamic label changes that are essential for enforcing ’02: Proceedings of the 2002 IEEE Symposium on Security policies like the Chinese wall security policy. and Privacy, page 105, Washington, DC, USA, 2002. IEEE Computer Society. [14] C. M. Ellison, B. Frantz, B. Lampson, R. Rivest, B. M. Thomas, 4. Conclusions and T. Ylonen. Simple public key certificate. July 1999. In this paper, we have provided a LP model for SACL [15] C. M. Ellison, B. Frantz, B. Lampson, R. Rivest, B. M. Thomas, covering all its features including the authorization tags and and T. Ylonen. SPKI certificate theory, 1999. delegation. The significance of our approach over existing [16] J. Y. Halpern and R. van der Meyden. A logic for SDSI’s linked local name spaces. J. Comput. Secur., 9(1-2):105-142, 2001. work is that we have described a detailed algorithm for tag [17] J. Y. Halpern and R. van der Meyden. A logical reconstruction of intersection computation which plays an important role in SPKI. J. Comput. Secur., 11(4):581-613, 2003. making access decisions. To our knowledge this is the first [18] J. Howell and D. Kotz. A formal semantics for SPKI. In such algorithmic description. We have also described a ESORICS ’00: Proceedings of the 6th European Symposium on prototype implementation of a tool based on the model, which Research in Computer Security, pages 140-158, London, UK, 2000. Springer-Verlag. enables us to verify properties of SACL specifications and [19] T. Jim. SD3: A trust management system with certified authorizations that can be derived by composing several tags. evaluation. In SP ’01: Proceedings of the 2001 IEEE Symposium By exploiting the hierarchy induced by the naming scheme on Security and Privacy, page 106, Washington, DC, USA, 2001. of SACL, we have shown how lattice-based information flow IEEE Computer Society. policies can be modelled in SACL. This gives us the power [20] K.Biba. Integrity considerations for secure computer systems. In Technical Report ESD-TR-76-372, MITRE, Bedford, Mass, to capture several well studied policy concepts and models 1976. (RBAC, Bell-LaPadula model for example) in SACL which [21] B. Lampson, M. Abadi, M. Burrows, and E. Wobber. form the basis for protecting confidentiality and integrity of Authentication in distributed systems: Theory and practice. data from untrusted applications. This also leads to a uniform ACM Transactions on Computer Systems, 10:265-310, 1992. framework in which heterogenous policies can be specified [22] N. Li, B. N. Grosof, and J. Feigenbaum. Delegation Logic: A logicbased approach to distributed authorization. ACM and their properties analyzed. Transactions on Information and System Security (TISSEC), 6, 2003. References [23] N. Li, Z. Mao, and H. Chen. Usable mandatory integrity [1] M. Abadi. On SDSI’s linked local name spaces. In CSFW ’97: protection for operating systems. In SP ’07: Proceedings of the Proceedings of the 10th IEEE workshop on Computer Security 2007 IEEE Symposium on Security and Privacy, pages 164-178, Foundations, page 98, Washington, DC, USA, 1997. IEEE Washington, DC, USA, 2007. IEEE Computer Society. Computer Society. [24] N. Li and J. C. Mitchell. Understanding SPKI/SDSI using [2] M. Abadi. Logic in access control. In LICS ’03: Proceedings of the firstorder logic. Int. J. Inf. Secur., 5(1):48-64, 2006. 18th Annual IEEE Symposium on Logic in Computer Science, [25] N. Li, J. C. Mitchell, and W. H. Winsborough. Design of a role- page 228, Washington, DC, USA, 2003. IEEE Computer Society. based trust-management framework. In SP ’02: Proceedings of [3] M. Abadi, M. Burrows, B. Lampson, and G. Plotkin. A calculus the 2002 IEEE Symposium on Security and Privacy, page 114, for access control in distributed systems. ACM Trans. Program. Washington, DC, USA, 2002. IEEE Computer Society. Lang. Syst., 15(4):706-734, 1993. [26] A. C. Myers and B. Liskov. A decentralized model for information [4] M. Y. Becker, C. Fournet, and A. D. Gordon. SecPAL: Design flow control. In SOSP ’97: Proceedings of the sixteenth ACM and semantics of a decentralized authorization language. symposium on Operating systems principles, pages 129-142, Technical report, In Proceedings of the 20th IEEE Computer New York, NY, USA, 1997. ACM. Security Foundations Symposium (CSF), 2006. [27] N. V. Narendra Kumar and R. K. Shyamasundar. Specification [5] D. Bell and L. La Padula. Secure computer systems: Unified and realization of access control in SPKI/SDSI. In A. Bagchi exposition and multics interpretation. In Technical Report and V. Atluri, editors, ICISS, volume 4332 of Lecture Notes in ESDTR-75-306, MTR-2997, MITRE, Bedford, Mass, 1975. Computer Science, pages 177-193. Springer, 2006. [6] M. Blaze, J. Feigenbaum, J. Ioannidis, and A. Keromytis. The [28] V. Patil and R. K. Shyamasundar. Towards a flexible access KeyNote trust-management system version 2, 1999. control mechanism for e-transactions. In M. S. Obaidat and N. [7] M. Blaze, J. Feigenbaum, and J. Lacy. Decentralized trust Boudriga, editors, EGCDMAS, pages 66-81. INSTICC Press, management. In SP ’96: Proceedings of the 1996 IEEE 2004. Symposium on Security and Privacy, page 164, Washington, [29] R. L. Rivest and B. Lampson. SDSI - a simple distributed DC, USA, 1996. IEEE Computer Society. security infrastructure. Technical report, 1996. [8] M. Blaze, J. Feigenbaum, and M. Strauss. Compliance checking [30] I. Roy, D. E. Porter, M. D. Bond, K. S. McKinley, and E. Witchel. in the PolicyMaker trust management system. In FC ’98: Laminar: practical fine-grained decentralized information flow Proceedings of the Second International Conference on Financial control. SIGPLAN Not., 44(6):63-74, 2009. Cryptography, pages 254-274, London, UK, 1998. Springer- [31] R. S. Sandhu, E. J. Coyne, H. L. Feinstein, and C. E. Youman. Verlag. Rolebased access control models. Computer, 29:38-47, 1996. [9] D. F. Brewer and M. J. Nash. The chinese wall security policy. [32] F. B. Schneider. Enforceable security policies. ACM Trans. Inf. IEEE Symposium on Security and Privacy, 0:206, 1989. Syst. Secur., 3(1):30-50, 2000. [10] D. Clarke, J.-E. Elien, C. Ellison, M. Fredette, A. Morcos, and R. n L. Rivest. Certificate chain discovery in SPKI/SDSI. J. Comput.

CSI Journal of Computing | Vol. 2 • No. 4, 2015 R4 : 44 Towards An Executable Declarative Specification of Access Control

About the Authors

N. V. Narendra Kumar obtained M.S. and Ph.D. degrees in Computer Science from the Tata Institute of Fundamental Research (TIFR), and is a recipient of Microsoft Research India Rising Star Award in 2011 for his Ph.D. thesis work on “Malware Detection by Behavioural Approach and Protection by Access Control”. His research interests span several topics in the field of Cyber Security like Information Flow Control, Access Control and Trust Management, Cloud and Distributed Systems Security, Malware Detection and Protection. He is currently a Research Scientist at TIFR working on a DRDO sponsored project for the development of information flow secure operating system.

R. K. Shyamasundar is a Fellow ACM, Fellow IEEE and currently JC Bose National Fellow and Distinguished Professor in the Department of Computer Science and Engineering, IIT Bombay. He is a Fellow of all the National Academies of Science and Engineering in India and Fellow TWAS.

CSI Journal of Computing | Vol. 2 • No. 4, 2015 CSI JOURNAL OF COMPUTING, VOL. XX, NO. XX, XXXXXXXXX 1

An Empirical Study of Popular Matrix Shamsuddin N. Ladha, et. al. R5 : 45 Factorization based Collaborative Filtering An Empirical StudyAlgorithms of Popular Matrix Factorization based Collaborative Filtering Algorithms Shamsuddin N. Ladha, Hari Manassery Koduvely, and Lokendra Shastri Shamsuddin N. Ladha*, Hari Manassery Koduvely** and Lokendra Shastri***

* Tata Research Development and Design Centre, Pune, India. E-mail: See http://in.linkedin.com/in/shamsuddin Abstract—In this paper we present a systematic study to understand how the performance of different sparse matrix ** Samsung R&D Institute, Bangalore, India, Email: [email protected] factorization techniques are related to the characteristics of the underlying data sets and model parameters. We *** Samsung R&D Institute, Bangalore, India, Email: [email protected] compare the performance of two popular matrix factorization methods, namely Variational Bayesian and Alternating Least Squares with Weighted-λ-Regularization (ALS - WR) using synthetic and real data sets with different sparsity, In this paper we present a systematic study to understand how the performance of different sparse noise levels and latent features. In our studies we have found that Variational Bayesian method relatively outperforms ALSmatrix - WR when factorization the data is more techniques sparse or containsare related more to noise the or characteristics the number of latent of featuresthe underlying in the model data are lesssets and thanmodel the underlying parameters. features ofWe the compare data. This indicatesthe performance that Bayesian of methods two popular (for example matrix VBMF) factorization are better at avoiding methods, modelnamely over-ﬁtting Variational compared Bayesian to the explicit and regularization Alternating methods Least (for Squares example with ALS - Weighted- WR). Infact,l we-Regularization have proved that (ALS ALS- WR) - WR using is a maximum-a-posteriori synthetic and real (MAP) data estimator. sets with Our different experiments sparsity, have also noise revealed levels instances and whenlatent VBMF features. underﬁtsIn our the studies given data. we We have have alsofound compared that Variational our results with Bayesian some theoretical method results relatively available foroutperforms factorization of ALS - completeWR when matrices the and data found is thatmore our sparse results demonstrate or contains consistent more noise behavior. or the number of latent features in the model are less than the underlying features of the data. This indicates that Bayesian methods (for Indexexample Terms—Collaborative VBMF) are better Filtering, at Matrixavoiding Factorization model over-fitting Algorithms, Comparative compared Analysis, to the Variationalexplicit regularization Bayesian Matrixmethods Factorization, (for example Alternate LeastALS-WR). Squares, Infact, Singular we Value have Decomposition, proved that Recommendation ALS - WR is a Systems maximum-a-posteriori (MAP) estimator. Our experiments have also! revealed instances when VBMF underfits the given data. We have also compared our results with some theoretical results available for factorization of complete matrices and found that our results demonstrate consistent behavior.

1IIndexNTRODUCTION Terms : Collaborative Filtering, Matrix Factorizationhighly Algorithms, sparse Comparative with onlyAnalysis, less Variational than 10% Bayesianﬁlled Matrix Matrix factorizationFactorization, is a very Alternate popular Least Squares, tech- Singularby known Value Decompos values.ition, Some Recommendation of these Systems values, in general, are corrupted by noise. This makes niqueI. usedIntroduction for Collaborative Filtering to generate personalized recommendations in E-commerce. the above matrix factorization task highly non- Matrix factorization is a very popular techniquetrivial [2] from and Variational the point Bayesian of view Matrix of Factorization both compu- (VBMF) Its basicused for idea Collaborative is to approximate Filtering to generate the user-item personalized [3]. The performance of these different matrix factorization transactionrecommendations matrix Xin asE-commerce. a product Its of basic two idea low- is totation techniques time (due vary depending to scale) upon and the accuracy characteristics (due of the to data rankapproximate matrices U theand user-itemV transaction matrix X as overa sets. ﬁtting). For example, Hernndez- Lobato and Ghahramani have product of two lowrank matrices U and V Inshown the last that decade,VBMF, compared several to machineother matrix learning factorization T techniques, generates more accurate predictions at the long X UV (1) (1)methods have been developed to improve the tail part of E-Commerce data sets [4]. This was attributed ≈ x If the X matrix has dimensions M N then U and V willaccuracy to the ofrobustness sparse of matrix VBMF against factorization over-fitting meth- due to the x x If thehaveX dimensionsmatrix has M dimensionsK and N K respectively.M N Herethen K <

CSI Journal of Computing | Vol. 2 • No. 4, 2015 CSI JOURNAL OF COMPUTING, VOL. XX, NO. XX, XXXXXXXXX 2 the underlying Bayesian averaging. There was interpret the results of our experiments in the no systematic study done to understand the light of these theoretical results. relative performance of different Matrix Factor- Rest of the paper is organized as fol- ization methods in the context of different data lows. In Sec. 2 we describe briefly the two sets and the aim of this paper is to addresses methods of matrix factorization namely Varia- some aspects of this gap in the understanding. tional Bayesian and Alternating Least Squares We studied two types of matrix factorization with Weighted-λ-Regularization. In Sec. 3 we methods, one of which uses Bayesian inference present our experimental design and meth- scheme and the other uses regularization. Our ods of generating synthetic data for comparing goal is to understand how these two meth- these two methods. Sec. 4 contains the results ods perform when underlying parameters of of our experimental studies, proof of ALS - WR datasets (sparsity and noise levels) or model as a maximum-a-posteriori (MAP) estimator, (latent features) are varied. For comparison and explanation of results using some of the purpose we used results of the SVD method theoretical results available in the literature. In as baseline which we know is unstable and Sec. 5 we summarize our findings and present known to over-fit when data is sparse and some future directions for research. noisy. Hence, in this paper following matrix factorization methods for collaborative filtering 2VARIATIONAL BAYESIAN AND AL- are compared empirically TERNATE LEAST SQUARES METHODS Variational Bayesian based Matrix Factor- OF MATRIX FACTORIZATION • ization (VBMF) [3] In this section we briefly describe the two Alternating Least Squares with Weighted- • matrix factorization technique used in our com- λ-Regularization (ALS - WR) [2] parison study. For details readers may refer to Singular Value Decomposition (SVD) [5] • the papers of [3] and [2]. [6] provides a comprehensive literature sur- 2.1 Variational Bayesian Matrix Factoriza- vey of different collaborative filtering tech- tion (VBMF) niques, however no empirical evidence is reported. In contrast, our work studies the three Variational approximation is a relatively new methods in detail with empirical analysis. approach to estimate the posterior distribution There were several empirical studies done in in a Bayesian inference problem [13], [14]. Here the past–by [7], [8], [9]–but none of them com- the idea is to propose an approximate posterior pare the three methods that we have compared distribution which has a factorized form and is in this paper. Our work augments the compar- parameterized by a set of variational parame- ison work earlier done by these authors. ters. One then minimizes the Kullback-Leibler divergence between the variational distribution There are some theoretical studies aimed at and the target posterior by using the variational understanding how different techniques reg- parameters to arrive at a good approximate ularize the matrix factorization. Nakajima et. solution for the posterior distribution. Raiko al. [10] have done a theoretical analysis of the et. al. [3] were the first to apply variational VBMF for the case of complete matrix (no method for matrix factorization based collab- missing values) and shown that two type of orative filtering problems. Here we describe shrinkage factors are responsible for regular- their method briefly, for more detailsCSI JOURNAL readers OF COMPUTING, VOL. XX, NO. XX, XXXXXXXXX 3 R5 : 46 An Empirical Study of Popular Matrix Factorization based CollaborativeCSI JOURNALFiltering OF COMPUTING, Algorithms VOL. XX, NO. XX, XXXXXXXXX 3 ization. Though their results do notCSI JOURNAL apply to OF COMPUTING, VOL. XX, NO. XX, XXXXXXXXX 3 CSICSI JOURNAL JOURNALmay OF COMPUTING,OF COMPUTING, readCSI JOURNAL VOL.the VOL. XX, OF paper XX,NO. COMPUTING, NO. XX, XX,XXXXXXXXX of XXXXXXXXX VOL. Raiko XX, NO.et. XX, XXXXXXXXX al. [3]. 3 3 3 sparse matrices, where there are large number CSI JOURNALCSI JOURNAL OF COMPUTING, OF VOL.COMPUTING, XX, NO. XX, XXXXXXXXX VOL. XX, NO. XX, XXXXXXXXX 3 3 We assume the followingwhere likelihoodu and v denotes func- the i-th and j-th rows 2.2 Alternate Least Squares with Weighted CSICSI JOURNAL JOURNAL OF OF COMPUTING, COMPUTING, VOL. VOL. XX, XX, NO. NO. XX, XX,where XXXXXXXXX XXXXXXXXXi uj i and vj denotes the i-th and j-th rows3 3 2.2 Alternate Least Squares with Weighted of missing Alternating entries, Least we have Squares related with our results Weighted-whereCSI u JOURNALandof Raikovwhere OFdenotes COMPUTING,et.u al.i and the[3]. VOL.vi-thj denotes XX, and NO.j XX,-th the XXXXXXXXX rowsi-th and2.2j-th Alternate rows 2.2 Least Alternate Squares Least with SquaresWeighted2 with Weighted3 where ui andi vj denotesj the i-th and j-th rows of2.2 matrices AlternateU and LeastV respectively. Squares with And Weightedσx is Regularization2 (ALS - WR) where ui and vtionjwheredenotes forui Xand, the conditionalvj denotesi-th and the ij-th on-th2 and(2U, rowsj-th V ) rows, given2.22.22 Alternate by Alternate the Least Least Squares Squares with Weighted with Weighted l-Regularization (ALS - WR) [2] of matricesU Uofand matricesV V respectively.U and V Andrespectively.σ σ is Regularizationof And matricesσx is Regularization (ALSU -and WR) V (ALSrespectively. - WR) And σ is Regularization (ALS - WR) to their predictions, for noisy but completeof matrices andwhereWe assumerespectively.U ui and Vthe vfollowing Andj denotesx isxlikelihoodthe2Regularization the noiseσ2 i -th parameterfunction and (ALS jfor in-th observations -X WR) rows, 2.2 in X and Alternate is In the caseLeast ofx ALS Squares - WR, the with elements Weighted of the of matriceswherewhereuandiuandandvjrespectively.vdenotesdenotes the thei And-thi-th and andx jis-thj-thRegularization rows rows 2.22.2 Alternate Alternate (ALS - WR)Least Least Squares Squares with with Weighted Weighted of matricesthethe noise noiseUproduct parameterCSIand parameter JOURNALtheV OF in ofnoiserespectively.COMPUTING, observationsin Normal observations parameteri VOL.j indistribution XX, inX in NO.And observationsandX XX,and XXXXXXXXX isσ is is inforXRegularization eachand is of its 2 (ALS - WR) 3 T Singular Value Decomposition (SVD) [5] whereconditionalui and vj denotes on (U, theV), i-th given and byj-th treatedx theInX rowsIn the the product the as case2.2 noise a case hyper-parameter. of 2 Alternate of2 of ALS parameterIn Normal ALS the- WR, -Least case WR, the Also of the Squares inelements ALS elements we observations - assume WR, with of of the the Weighted thematrix elements inUXand ofand theV isare found by minimizing the datasets. Another set of work looks at the the noiseofof parameterof matrices matrices matricesU inUU observationsandandandVVrespectively.respectively.V inrespectively.2and And is AndInσ σ theisis And caseRegularizationRegularization ofσx ALSis - WR,TRegularization (ALS (ALS the - WR) elements- WR) of the (ALS - WR)In the case of ALS - WR, the elements of the treatedtreatedCSI as JOURNAL as a hyper-parameter. a hyper-parameter. OFtreated COMPUTING,U as aV VOL. hyper-parameter. Also XX, Also NO.we we XX, assume XXXXXXXXX assume Alsotheσ following we assumex factorizedx T matrixT priorU and distributionsV are found for by minimizing3 the [6] provides a comprehensive literaturethe survey noise of different parameterCSIof JOURNALelements matricesdistribution OF in COMPUTING, observationsand for VOL.eachrespectively. XX, of NO. its XX, in elements XXXXXXXXXX Andandmatrix where isxmatrixis UuRegularizationi andUandand Vvj Vdenotesareare found (ALSfoundT by - by WR)minimizing minimizing the thefollowing3 error function T necessary and sufficient criteria for recovering treatedthethethe asthe following noisea noise hyper-parameter. noise parameter parameter parameter factorized in in observationsAlso prior observations in we observations distributions assumetreated inIn inXXandmatrix the forand as is in caseisU aXand hyper-parameter.and ofV ALS isare found - WR, by minimizing the Also elements we the assume of the U V thethe following followingthe noise factorizedu factorized parameterv prior prior in distributions observationsM distributionsNi forin X forjUfollowingandandfollowing isV respectively error error functionfollowingIn functionIn the the case errorcase of of function ALSIn ALS the - WR, - WR, case the the elementsof elements ALS of - of the WR, the matrix the elementsand of theare found by minimizing the collaborative filtering techniques, howevertreated no as empirical a hyper-parameter. wherethe followingitreatedandtreatedj factorized asdenotes as a Alsoahyper-parameter. hyper-parameter. the prior we-th distributions and assume-th Also Also rows we for weIn assume2.2 assumefollowing theAlternate case of error ALS Least function -T WR, SquaresT theT elements with Weighted of the 2 CSI JOURNAL OF COMPUTING, VOL. XX,U NO.Uand XX,andV XXXXXXXXXVrespectivelyrespectivelyU and V respectively the2 matrixT following2 Umatrixandmatrix factorizedT UVUandandareV V foundare priorare found3 found distributionsby by by minimizing minimizing minimizingT the for the the a low-rank matrix completely from a partial treatedof matrices astreated a hyper-parameter.U and asV arespectively. hyper-parameter. Also we And assumeσ is AlsoU we assumeV 2 f (U, V )= (xij following u i. v j) error function evidence is reported. In contrast, our work studies the threewherewhere UuPandi and(XVvvUrespectivelyjj denotesdenotes, V)= the thei-thi-th and andj-thjx-thi,j rows rowsui x 2.2v2.2jmatrix,σRegularization Alternate Alternatex and(2) Least(2) Least (ALSare Squares2 Squares found -2 WR)matrix with by with minimizing WeightedU Weightedand theV are found− by minimizing the thethe following following factorized factorized prior prior distributionsf distributions(U,f ( VU,)= V )= forM forK(followingxf following(xU,ij V u )=. v u i). error v j error) function(x functionij 2 u i. v j) (i,j) O the followingthe factorized following| factorized prior distributions prior distributionsN 2| for2 forU· and V respectivelyij i jp p observation of its entries [11], [12]. We willofof matricesthe noisetheUM parameterandand followingK VV respectively.respectively. ini observations=1M factorizedjK=1 And And inσXσx priorisandis isRegularizationfollowingfollowing distributionsf (U, V error)= error− (ALS function− for function( -x WR)ij u i. v j−) ∈ methods in detail with empirical analysis. There were several M UKUandandVVrespectivelyrespectively x RegularizationP (UIn)=( thei,j)(i,jO case) O (ALS of(u ALSik -u¯ WR)ik(, -i,ju˜ WR,)ikOfollowing) the elements(3) error of the function 2 U and V respectivelyM pK p p p p p ∈∈ (i,j)T O ∈− 2 2 U and V respectivelythe noisetreated parameter as a hyper-parameter.P ( inU observations)= Also( inu X weu¯ and, assumeu˜ is) (3) N | 2 2 empirical studies done in the past–by [7], [8], [9]–but none theofP ( noisePU()=U)= parameterU (u inVik(u observationsu¯ikiku¯, uik˜ik, u˜) ik) p in ikX(3)p and(3)ik isik InIn thematrix the case casei=1U ofk=1 ofand ALSf ( ALSfU,(VU, - V∈ WR,)=V -are)= WR, found the the elements(x2 by(ij elementsxij minimizing u i u. v ofij. v) j the of) the the f (U, V )= (xij u i. v j) where u and v denotes the i-th and j-th rowsP (andU)= respectively(u Nu¯ , u˜ | ) (3) f (U, V )= (x u . v ) 2 + λ nui u i + nvj v j (8) i j thethe following i-thi=1 kNand=1 factorizedN j2.2-th| | rowsi=1 priorAlternateMk M=1ikofK K distributionsmatricesik ik Least U and for SquaresV respectively.N K T withT 2 Weightedij2 i j 2−−22 2   − treatedtreated asi=1 a hyper-parameter. hyper-parameter.k=1 N Also Also| we we assume assume matrixffollowing(U,U Vand)= errorV areM function+ foundλK−((i,jx()i,jijO) bynOu minimizing u u i. v j+) nv the v j (8) i || ||2 j || ||(i,j) O them compare the three methods that we have compared in 2 M i=1K k=1 p p p pmatrix+ Uλ+andλ nVu(ii,jnuare u)iiO u foundi+p + pn by∈vjn∈ minimizingv v2jji v j (8)(8) the2 j U AndN VNK Kis theP P( Unoise( U)=)=N parameterK p p (u( uinu¯ observationsu¯, u˜, u˜)P)(V)= in(3) X and+ isλ∈v v¯ n, uv˜ u i f +(||U,|| Vnpv)= v jp || (8)|| (xij u i. v j) ∈ of matrices U and V respectively.thethe following AndMandKσxrespectively factorized factorizedis Regularization prior prior distributions distributionsik ikp ik (ALS forikp forik ik - WR)(3)i i || ||||jk ||jk j ijkji ||−|||| || (4) j j   this paper. Our work augments the comparison work earlier P (U)= N pK p(up ikpu¯ik,Nu˜ikN) | | (3)followingfollowingP error( errorU)=N function function(i,j|) O ||||(u2ik u¯ik,||u˜ik|| ) −(3) P (PV()=V)= Pv(Vv)=v¯jkpv¯, v˜ ,i=1v˜pi=1k=1Mk=1p vKjk(4)p v¯(4)jk, v˜jk (4)j=1k=1 ∈i j 2 2 here(i,jn)uiOand2 2nvj represents the number of ob- UU and treatedV respectively as ai=1 hyper-parameter.jkk=1jkN jkjk jk | Also we assume fthe( U, following V )= (x+ij+λλ u2i. v nj)unu u i ui ++2 nvnv v j v j (8)(8) the noise parameter in observationsP (U)= in X andP (VN is)=(uN |u¯ | ,j=1uÑkv=1)jkKNv¯jk, v˜jk| (4)np p nhere nu i=1andk=1nv Nrepresentsi i| the numberj ∈j of ob- 2 2 done by these authors. j=1jk=1=1k=1 ikM InKik theikN caseK of ALShere(3)here n -ui WR,andui andnv thej representsvj+represents elementsλ (i,ji)On theui the u − numberij of2 number+2 the|| ofn||vj|| of ob- v||j ob-served(8)|| entries|||| || in the i-th row and j-th column factorized priorN Pj=1K distributions(kU=1 )= N | for U andp (pu pV prespectivelyu¯f (fU,,(hereu˜U, V )= V))=nu and(xn( vx(3)represents u . v u ). v i) i the numberj j of ob- + λ nu u i + nv v j (8) N | p p ik p ik ikp servedi entriesiji∈ j ij ||i inji|| thej ij-th row|| || andj-th column i j i=1 k=1P (U)=M KP P(V( V)=)=(uik pu¯ ,pu˜ v)jkvjkv¯HerejkTservedv¯,jkv˜served,jkv˜u˜jk(3) entriesand entriesv˜ (4) inare(4) inthe the thei-th priori-thN row− row variances−K and andj-thj-th for column column2 the of the X matrix,2 and O represents the set of  There are sometreated theoretical as a studies hyper-parameter. aimed at understanding Also p we assumeP (pV)=Mp K p vjk v¯ ik, v˜ ik N (4)|ik servedjk (i,j entries()i,jO) O in the i-th row and j-th column  2 2 || || || || p p matrixN j=1pjp=1kUi|=1jk=1kp=1pNandkjkN=1 V| | are found byhere∈here+ minimizing∈nλunuandandnvnuvirepresents therepresents u i +p the thep numbern numbervj v j of of ob- ob-(8) i j HereHereu˜ uãndik andNPv˜ (KUvãreHerejkp)=are theu˜ theik priorpiand=1 priork=1N( variancesv˜ujk variancesare u¯ |, the u˜ for) prior for the the varianceselementsofof the theX forofXmatrix,U thematrix,nandof andV the andnrespectively.(3)OXOrepresentsimatrix,i representsj Theand2j the the meanO setrepresents set of + ofordered2λ the pair setnu of of (row, u i column)+ n indexv v j for which(8) how different techniques regularize the matrix factorization.ik Pjk(U)= j=1 k=1 (uikik u¯ik , u˜ik ) (3)(3) here Pui(Vand)=+ λvj representsnui u i || the+v || numbernv¯vj , v jv˜ of ob-(8)|| || i (4) j the following factorized prior distributionsHere u˜ik and for v˜jkN NareKp the| p priorik N ik variancesK for the ofp the Xservedpservedmatrix, entries entries andi inO in therepresents thejki-thi-thjk row row thej andjk and setj-thj of-th column column   elementselements of ofU UandelementsandVii=1=1Vrespectively.kk=1=1respectively. offollowingNU and| TheV Therespectively. errormean mean functionorderedordered The pair mean pairu¯ of of(row,ordered (row,v¯ column) column) pairi of2 index|| (row,2 index|| for column) forj which which2 || 2 index|| for which|| ||X || || p p p p p pp p prior valuespservedikp and+ entriesλ jk are innju=1 the usually u ii-th+N row takennv and to v |jj be-th columndata(8) is availablei in .here Thej secondnu and term actsnv asrepresents the number of ob- Nakajima et. al. [10] have done a theoretical analysisP (ofV the)= elementsp p Pp Here(V of)=p pUu˜vp jkuãndandv¯jkVv˜, v˜respectively.jkvarejk v¯ the, v˜ prior The variances(4) mean(4) forordered the +of pairλ the ofXin (row,uXimatrix,k=1 u i column)+ andjnvO indexj O vrepresentsj for which(8) the set of i j U and V respectively prior valuesu¯ u¯ priorandHerev¯NN valuesv¯ pKikKPareik(andVu¯ usuallyik)=pjkandjk arev¯ takenjkjk thearejk to usuallyprior be variancesvdata takenv¯ is, toavailablev˜ for be thedata inXof isX the. available The||(4) || secondmatrix, in X term and. The|| acts|| secondrepresents as  term actsthe set as of prior valuesHere u˜ikikandikandNjkv˜jkarejkare usually| theN priorp taken| variancesp to be zero. fordata thejk is availablejkof thejknX inmatrix,.i(4) Theni and second|| O|| represents termj j acts|| as the||the set regularization of term which depends on the VBMF for the case of complete matrix (no missing values) jprior=1P k=1 values u¯ikj=1andk=1vUv¯jkvarep, v usuallyVp taken to behereheredatanu isuii 2 availableandandvnj vrepresents inj representsX. The the second number the term number of acts ob- as of ob-  i j zero. P ( Vzero.)=)=elementselements of ofvjkjkUand¯v¯jkandj,˜=1vjk˜ Vrespectively.respectively.N (4)(4)the| The regularization The mean meantheorderedordered regularization term which pair pair ofhere ofdepends (row, term (row, column)n whichu column) onand the depends index indexnv forrepresents onfor which the which served the number entries of in ob- the -th row and -th column zero. elements of U andNVf respectively.(pU,|p jk V )=jkp kp=1 Thethe meanUsing(x regularizationij Bayesordered u i. vrule,j pair) term the of posterior which(row, column) depends distribution index oni the forsparsity whichj of the data set through nui and nvj . zero. j=1 k=1 N | here servedntheui pand regularization entriesnvj representsp in the termi-th the which row number and dependsj of-th ob- column on the and shown that two type of shrinkage factors are responsible priorpriorjp=1 valuesk=1 valuesp u¯ u¯ andand v¯ v¯ areare usually usually takenhere taken ton tou bei beanddatadatanvj isrepresents is available availableini the inX numberX. The. The second second of ob- termj term acts acts as as p Using p Bayes ik rule,ik thejkjk posteriorsparsity distributionserved− of the entriessparsity data set of in through the the datan-th setand row throughn and. nui and-thn columnvj . M K UsingUsingprior Bayes Bayes values rule, rule,u¯ the ik theand posterior posteriorv¯jk are distribution usually distribution taken(i,j)forsparsityO toU beHereanddata ofVu˜ theisik is givenavailable dataand by setv˜jk through inareX. The thenu secondi priorandui n termvj variances.vjFor acts details as for of minimizationi the of the stepsXj readersmatrix, may and O represents the set of p pHereUsingu˜ik and Bayesv˜jk rule,are the the prior posterior variances distribution for theservedofsparsity entries the X inofmatrix, the thei data-th and row setOserved andrepresents throughj-th entries columnnu thei and setn inv ofj the. -th row and -th column for regularization. Though their results do not apply to psparsep p forzero.zero.pU and V is given by ∈ served entriesForthethe details in regularization regularization the i of-th minimization row term and termj which-th which steps column depends readersdepends on may on the the Here u˜ikforandforU Uandvzero.ãndjkVpareVis givenis the givenp by prior byp variancesp for theForFor details detailsofthe the of regularization ofminimizationX minimizationmatrix, term steps steps and which readers readersO dependsrepresents may mayrefer on the to the the paper set of of R. Pan et. al. [2] P (U)= (uik u¯ , u˜Here)elementsforu˜ik Uandandv˜ ofjk(3)VUareu˜isand the given priorV byrespectively.v˜ variances for The the meanelementsof theorderedForX detailsmatrix, of pairU of andofand minimization (row,O representsV column)respectively. steps index thereaders set for of which Themay mean ordered pair of (row, column) index for which matrices, where there are large number of missing entries,ik Hereik u˜ik andHereHerev˜jkUsing Usingarep ik the Bayesandand Bayes priorp rule,jk rule, variances areare the the posterior the posterior for prior prior the distribution variances distributionof variances the X refermatrix,forsparsity sparsitythe for to the and the of paperof theO therepresentsof data of data the R. set Pan setX through the throughet.matrix, al. set[2]n ofuni uandi and nvnj vO.j . represents the set of referrefer to tothesparsity the paper paperP of( ofX the ofR.Up, R. dataPanV) PanPet. set(Uet. al.) throughpP al.[2](V[2]) nu and nv . i=1 k=1elementsN | ofelementsUpriorUsingand values of BayesVU andrespectively.u¯ik rule,andV respectively. thev¯jk posteriorare usually The The distribution takenmean mean toordered beordereddatarefer2 pair is to available of pair the (row, paper of column)in (row,X of.2 R. The Pan index column) secondet. for al.i term which[2] index actsj as for which we have related our results to their predictions, for noisy butelements elements of forUpPforand U(ofXU UandU andVandp, VVrespectively.) VVPis (respectively.is givenUP given)(XP (UV by,) byV The) P The mean(U P)meanP(priorU(V,orderedV )priorX values)= values pairFor Forof|u¯ detailsik (row, details and column) ofv¯ of minimizationjk minimizationare index usually(5) for steps which steps taken readers readers to may may be data is available in X. The second term acts as forp U andelementsPVp(pXisU given, V)pP of by(UU) P (andV) + Vλ respectively.nui For u i details+ The ofn minimization meanvj v j ordered steps(8) readers pair3E may ofXPERIMENTAL (row, column)DESIGN index for which N K P (PpriorU(,UVzero., X valuesV)=X)=u¯ikPand(U|, Vv¯jkPXare()=X usuallyU, V)|P taken(U) P(5) to(V be) data|the is(5) available regularization in XP.( TheX term) second which term depends acts as on the prior valuesprioru¯ valuesandand v¯u¯ik areand|are usuallyv¯ usuallyjk are taken usuallyp taken to takenbe(5) zero.p to to be be data|| || is is available availablereferreferXPERIMENTAL toin to|| theX the. in The|| paper paperX second. of The of R. R.ESIGN termPan second Panet. actset. al. al. as[2] term[2] acts as complete datasets. Another set of work looks at the necessaryp p ik | |P (U,jkV X)=P |(PX()X|u)¯ Pv¯(X)3Ei 3Ezero.(5)XPERIMENTALreferXPERIMENTAL to the3Ej paperDESIGND ofESIGN R. Pan et. al.D [2] X the regularization term which depends on the zero. Usingprior Bayes| values rule, the posteriorikPand(X) distributionjk are usuallythe regularizationsparsity3E takenXPERIMENTAL of the term to data be which setdataD through dependsESIGN is availablen onui andThe the performancenvj . in . The comparison second of term the candidate acts as and sufficient criteriaP ( Vfor)= recovering a v jklow-rankv¯ , v˜zero. matrix Using(4) Bayes rule, theP P(X posterior(XUU, V, V) P) Pdistribution(U(U) P) P(theV(V) regularization) for U and V term which depends on the zero. jk jk P P(U(PU, V(,XVXUX)=,)=V) P (U| )|P (VTo)The computeThe performancesparsitythe performanceFor the regularization(5) ofdetails(5) posterior theThe comparison comparisondata of performance minimization setdistribution through of term ofthe comparison the candidate stepsn on whichucandidateand the readersn ofv depends the. may candidate on the N | UsingforP U(U Bayesand,zero.V XV rule,)=is given the| posterior by distribution (5) sparsityUsingThe performance of Bayes3E the dataXPERIMENTAL rule,XPERIMENTAL set comparison throughthethe posterior regularizationi ofnDESIGN theandESIGNalgorithms candidatej n distribution. term was done whichsparsity using depends the popular of the on error data the set through nui and nvj . completely from a partial observationj=1 ofk =1its entries [11],ToTo compute[12]. computeUsing is the Bayesgiven theTo posterior posterior computeby rule, distributionherethe| distribution the| posteriorn posteriorui and on distribution on the distributionPn theP(vL.H.SXj(X)represents) of equation on the the (5),algorithms3E Raiko numberet. was al. used of done ob- the using fol-uDi the popularvj error Using BayesforUTo rule,and computeV the|is given posteriorthe by posteriorP ( distributionX distribution) algorithms onalgorithmsFor thesparsity3E detailsrefer was wasXPERIMENTAL to ofdone the of doneminimization paper using the using ofdata the R.D the popularstepsESIGN Pan set popularet. readers through al. error[2] error maymetric,nu RMSE.and RMSEnv . is defined as, L.H.SL.H.Sfor of ofequationU equationand VL.H.S (5),is (5), givenRaiko of Raiko equation byet.et. al. al.used (5),used Raiko the the fol-et. fol- al. usedforFor theUalgorithms details fol-and TheV ofThe minimizationis was performance performance given done usingsparsity by steps comparison comparison the readers popular of of the may of the error the datai candidate candidate setj throughFor detailsnu and of minimizationnv . steps readers may We will interpret the results of our experiments in the light of L.H.S ofUsing equationPserved Bayes (5),(X U Raiko, V entries) rule,P (U) P theused( inlowingVmetric,) posteriorthe the metric,refer fol- approximatei RMSE.-thTheRMSE. to performancerow the distribution RMSE papermetric, RMSE and variational is(5) of definedis RMSE.j R.comparison-th defined Pan column posterior RMSE as,et. as, al. of[2] isdefinedthe dis- candidate as, i 1 j ToTo compute compute the the posterior posterioret. al. distribution distributionreferFor onmetric, on to details the the RMSE. paper of of RMSE minimization R. Pan is definedet. al. [2] as, steps readers may 2 p p for U andlowinglowingV approximateis approximate givenP (Ulowing, V byX variational)= variational approximate| posterior posterior variational dis- dis- posterior(5) Q (U dis-, V) algorithmsalgorithms was was done done1 using1 using the the popular popular1 error error refer to the paper of R. Pan [2] these theoretical results. To compute theP ( posteriorX U, V) P distribution(U) P (V) tribution on the For details of minimization2 1 steps readers2 may et. al. Here u˜ and v˜ are the prior varianceslowing forforL.H.S approximate theL.H.S|U ofand of equationof equationV the variationalisP (5),X given( (5),X Raiko)matrix, Raiko posterior byet. al. andused dis-usedalgorithms theO3E the fol-represents fol-XPERIMENTAL was done the usingD setESIGN the of2 popular2 error1 = (x u . v ) ik jk tributionPQ((UQU,(,VUV,tribution)XV)=) P (QX|(UU,,VV))P (U) P (V) et.(5) al. 1 metric,metric, RMSE. RMSE.1 RMSE RMSE is is defined defined2rmse as, as,2 ij i j (9) tributionL.H.SP tribution(U, ofV| equationX)=Q (U, V (5),)| RaikoP (X)et. al. used the(5) fol-3Erefermetric,XPERIMENTAL to1 RMSE. the paperRMSE=PDESIGN is(X of definedU R.2, V Pan2 as,) (Px et.(U u al.).P v )[2](V)M N  −  Rest of the paper is organized as follows. In Sec. 2 we lowing approximate variational posteriorrmsermse=The= dis- performancermse1(x comparisonij(xij u i. v u ji). v j) of the(9)ij (9) candidate2 i j 1 (9) (i,j) O elements of U and V respectively. The| meanTo lowingcomputeordered approximate theP posterior(X) pair variational distribution of (row, posterior3E column)on theXPERIMENTAL dis- L.H.SM K index of forDESIGN whichrefer  to the paper of1 × R. Pan∈ et. al. [2] lowing approximate variational posterior dis- P (MUMrmse, VN NX= )= M− −N| ((xi,jij) O u i. v j−)1 (9) 2 2 (5) describe briefly the two methodsp ofp matrix factorization To computeP (X MU the,KV posterior) P (UM) distributionPK (V) on theThe performance(i,j) comparison(i,jO) O  of the candidate2   equationMtributiontributionK (5), RaikoQQ(U( U, Vet., V) al.) used the followingQ (U, Valgorithms )=approximate× × wasM∈( u∈ doneikNu¯ik(×,i,j usingu˜1)ik1O) the∈− popularThe error performance2 2 of the3E algorithmsXPERIMENTAL was mea- DESIGN u¯ v¯ P (U, V XTo)=tribution computeQ the(U, posteriorV) M K distributionP (X onU the, VX) PThe(U performance) P (V| )1 × comparison  ofP the(X candidate2)  prior values ik and jk are usuallyQ ( takenL.H.SU, V)= of to equation| beQ (U,dataV (5),(u)= Raikou¯ is, u˜ availableet.) al.(uusedik u¯ik the(5), inu˜ik fol-)algorithms. The secondi=1 wask=1 donermseNrmse term using=| = acts the∈ popular as (x( errorijxij u i u. v ij. v) j) (9)(9) namely Variational Bayesian and Alternating Least SquaresQTo(U compute, V)=variational theP posterior( posteriorUP(u,ik(VXu¯ikik)X, distributionuik˜)=ik) ik Q| on(U,V)The theThe performance3E performancemetric,rmseXPERIMENTAL =The RMSE. of of the performance the RMSE algorithms(5) algorithmsMM is( definedxN ofijN wasD the was uESIGNi mea-.algorithmsas, v j mea-) sured−− (9) bywas  varying mea- key properties of the dataset, | L.H.S of equationQ (U, V)=N (5),N Raiko| | et.i=1M( al.kuM=1ikKusedNu¯Kik, u˜ theik| ) fol- algorithmsThe performanceN wasKM doneN  of using the algorithmsthe(i,j( popular)i,jO) O  was error mea- zero. lowingi approximate=1ik=1=1k=1 the variational| regularization posteriorP term dis-metric,(X) which RMSE.sured RMSE depends by varying is(i,j defined) O× on3E× key as,the− properties∈XPERIMENTAL∈ 1 of the dataset, DESIGNThe performance comparison of the candidate with Weighted-l-Regularization. In Sec. 3 we present ourL.H.S of equation (5),Mi Raiko=1K k=1 Net. al. |used thesured fol-sured by by varying varying key key properties× properties of ofthe the dataset, dataset,like2 noise in the data, sparsity of the data, lowing approximateN K variationalN K posterior dis- metric, RMSE. RMSE is∈ defined as, 1  tributionN QK(UQ, QV(U)(U, V, V)= )= (u(ikuiku¯iku¯,iku˜To,iku˜)ik) computesured byThe varyingThe the performance1( performancevjk posteriorv key¯jk, v˜ propertiesjk) of of the distribution the(6) ofalgorithms algorithms the2 dataset, was was on mea- mea- the experimental design and methods of generating synthetic Q (U, V)= N K (uik u¯ik, u˜ik) likelike noiseThe noise in performance in thelike the data, data, noise sparsity sparsity in comparison the of data, of the the data, sparsity2 data, of of the the candidate data, Using Bayes rule, the posteriorlowingtribution distribution approximateQ (U, V) sparsity variational(v v¯ ,i=1v˜i=1k) of=1 posteriork=1N( theNvjk v¯|jk data dis-|, v˜jk) set The through(6)× performancermsej=1 =k1=1 Nnu ofi | and theThe algorithmsn(xvijj performance. u i. v wasj) and1 mea- variance(9) comparison of the data.algorithms ofWhile the many candidate was exper- done using the popular error (vjk v¯jkjkN, jkv˜jk)jk| (6)(6) like noise in the data, sparsity2 of2 the data, To compute the posterior× i=1N distributionk=1×| ( vjk N v¯jk, onv˜jk| ) theand(6) variancermse = ofandsured thesuredM variance data. byN by varying Whilevarying( ofx the key u many keydata.. v − properties) properties exper- While(9) many of of the the exper-dataset, dataset, data for comparing these two methods. Sec. 4 contains thetribution ×QToj(=1Ujk=1,=1 computeVkN)=1 M | K j the=1NkN=1K posterior K andL.H.S distribution variance algorithmssured of of by equation the varying data.1 on was× key the While (5), done( propertiesi,j)ij RaikoO many usingi j exper- ofet. the2 theal. dataset,imentsused popular were the done fol- error on a synthetic dataset, with for U and V is given by ×NForj=1K k=1 details N | of minimization and varianceM stepslikelikeN noise readersof noise the in in data.∈ thealgorithms− the may data, While data, sparsity many sparsity was exper- of ofdone the the data, data, usingmetric, the popular RMSE. error RMSE is defined as, L.H.S of equation (5), RaikoM K used the(v(v fol-v¯imentsv¯,imentsv˜, v˜) were) werermse done(6) done=iments on× on a synthetic were a(i,j synthetic) O done( dataset,xij on dataset, a u i synthetic. v withj) withknown dataset,(9) number with of underlying features, some results of our experimental studies, proof of ALS - WR as a Q (U, V)= et. al. (uik u¯ik, u˜ikjk) Herejkjk jkulowing¯jkikjk,metric,v¯likejkThe,imentsu˜ik noise performanceapproximateand(6) RMSE. wereMv˜ injk (6)are theN done variational data,RMSE∈ of on the variational a sparsity syntheticalgorithms is parame-− defined of dataset, the was posterior data, as, mea- with dis- 1 L.H.SM ofK equation×N(×vjk v¯|jk (5),,Nv˜jkN) Raiko| | (6)et. al. usedandand the variance variance fol-(i,j) O of of the the data. data. While While many many exper- exper- HereHereu¯ u¯, vik¯ Q, v¯,(jku˜U, ,uãndVikHere)=andv˜ u¯×v˜ikarejk, v¯areijk variationalrefer=1, k variationalu˜=1ik(Nuandik toju¯=1ikjv˜|=1, the parame-kjku˜=1ikkare parame-=1) paper variational knownknown ofThe parame- R. numberand performance numberPan varianceet. ofknown of underlying al.× ofunderlying of[2] number the the data. algorithms∈ features,metric, of features, While underlying wassome many RMSE. some mea-experiments exper- features, RMSE some were is definedperformed as, on real datasets. 2 maximum-a-posteriori (MAP) estimator, andlowing explanation approximate ofik jk ik variationaljk j=1 kN=1 posterior| ters dis- whosesured valuesknown by are number varying found by key of minimizing underlying properties the of features, the dataset, some 1 QHere(U,u¯Vik)=, v¯jk,i=1u˜ik k=1andN Kv˜(jku areu¯ variational, u˜ ) parame-tribution Qexperimentsiments(imentsU, V were) were were done done performed on on a asynthetic synthetic on real dataset, dataset, datasets. with with 1 P (X U, V)tersPters( whoseU whose) P values(V values)lowingters are are whose found found approximate values by by minimizing minimizingik areik foundik thevariationalby theKullback-Leibler minimizingexperimentsexperimentssuredTheiments by performancethe posterior were varying weredivergence performed doneperformed key ofdis- properties on between theon a synthetic algorithmson real realQ of datasets.(U the datasets., dataset,V dataset,) wasFollowing mea- with implementation2 of the algorithms1 2 results using some of the theoretical results available in the ters whose valuesiN=1 K areN found| by minimizing the likeexperiments noise in were the data, performed sparsity on of real the datasets. data, rmse = 2 (xij u i. v j) (9) P (U, V X)= tribution| Q (U, V) Kullback-LeiblerHereHereHere(5) u¯iku¯k,ik=1v¯,jkv¯,jku˜,iku˜ divergenceikandand(vandjkv˜ v¯jkvjk˜jkare, v˜ arejkare variationalbetween) variational variationalFollowing(6)Q parame-(U parame-, Vparameters implementation) Followingknownknown 1 number numberimplementation of the of of underlying algorithms underlying of the features, features, algorithms2 some some Kullback-LeiblerKullback-LeiblerHere u¯ik, divergencev¯jk divergence, u˜ik and v˜ betweenQjk betweenare(U variational, VQ)(QU(,UV,)V parame-and)FollowingP like(Usured,knownV noise implementationX by) givennumberin varying the by data, keyof underlyingof properties sparsity the algorithms of offeatures, the the data, dataset,was some deployed for experimental purposes,   literature. In Sec. 5 we summarize our findings and present Kullback-LeiblertributionN×jK=1 divergencek=1(vNjk vXPERIMENTAL¯jk, |v˜ betweenjk) Q(6)(U, V) andFollowingrmseESIGN variance= implementation of the data. While of( thex many algorithms u exper-.1 v ) (9) 2 M N − | P (Xand) P (U,whoseV Xand)ters givenvalues×tersP whose(U whose by , areV3EN valuesX found values) given| areby are by foundminimizing found by bywas minimizingwas minimizingthe deployedandlike deployedKullback-LeiblerD variance| noise thefor thewas for experimentalinexperiments ofexperiments experimental deployedthe data. data,M wereWhile forK purposes, sparsity were purposes, experimental performed many performed ofijexper- the on purposes, data, oni real realj datasets. datasets. (i,j) O and P (Uters, V whoseX) given values byj=1 k are=1 found(v v¯ by, minimizingv˜ ) the experimentsimentswas deployed were wereM done for performed onN experimental a syntheticrmse on real purposes, dataset,= datasets.− Our with Java based( implementationxij u i. v j) of VBMF× (9) ∈ some future directions for research. and| |P (UKullback-Leibler,Kullback-LeiblerV X) given| byjk divergencejk divergencejk between between(6) QQ(U(U, V, V) ) FollowingFollowingOur Java implementation based implementation(i,j) implementationO of of• the the algorithmsof algorithms VBMF    divergenceM K × | betweenN Q (|U,V) and PQ ((UU,V, V|OurX)iments)andOur givenFollowing Java varianceJava were by based based done• implementation implementation of implementationon the× aQ data. synthetic(U, V While) of of dataset, the of VBMF many VBMF algorithms with exper-M withN standard Gaussian− priors. Kullback-LeiblerHere u¯ik, v¯jk,u˜j=1ik divergenceandk=1Thev˜jk performanceare variationalbetweenM K parame-• comparison• knownQ (UOur, numberV of Java)= the based of candidate underlying implementation∈(uik features,u¯ik, ofu˜ik someVBMF) (i,j) O The performance of the algorithms was mea- andandP P(U (U, V, VXQX) given) ,given by by Q (UKL, V[Q) P ]=• Q (Uwas,wasVwith) deployedln deployed standard for for Gaussian experimentaldUdV experimental priors. purposes, purposes,×  ∈ To compute the posteriorQ distribution(U, VHere)=andu¯ikP, (v¯ onUjk, Vu˜ik theXand) given(uv˜jk Qareu¯ by(U variational(,,UVu˜)V)) parame- withknown||imentswithwas standard standard deployednumber were Gaussian done ofGaussian for underlying onP experimentali=1( priors.U a, syntheticpriors.V XN features,) purposes, dataset,| some withMahout based implementation of ALS - 2. Variational Bayesian and Alternate leastKL KL[Q[QPters]=P ]= whoseQ (KLQU values(,UV[Q,)VlnP)algorithmsik]= areln | foundik| Q (Uik by,dUdVVQ minimizing)dUdV(lnU was, V) done theThedUdV usingexperiments performancewith the standard(7) wereOurpopularOur Java performed Java Gaussiank of=1 based basederror the implementation priors.on algorithms implementation real datasets.• of was of VBMF VBMF mea-  ters whoseKL[ valuesQ P| ]=Q are||(U foundQ,(UV,)=V by) ln minimizingP (U the,dUdVV(uXMahoutexperimentsik)Mahoutu¯ik, u˜ basedik based) were implementation•Mahout• implementation performed based | on ofimplementation real of ALS datasets. ALS - - of ALS - sured by varying key properties of the dataset, Here|| ||u¯iki=1, v¯jk, u˜ikNand v˜jkP |(arePU(,U variationalV, XV)X) parame-• • knownFollowingOur number Java• implementation based of underlyingN implementationKThe of features,performance the(7) ofalgorithms VBMF someWR [15]. of the algorithms was mea- L.H.S of equation (5), Raiko et. al. usedKullback-Leibler thek=1|| fol- divergence| betweenP(U, V XQQN)Q((UU(U,|,VV, V))|) • Mahoutp basedp implementation of ALS - squares methods of matrix factorization Kullback-Leibler divergence metric, between| Q RMSE.(iU=1, VQk(7)=1)(U(7)The RMSE, V) priorWRFollowingexperimentsWRsured is [15].variances defined[15].(7)• by implementationu˜ varying wereWR as,andwithwith [15].performedv˜ standard standard key, and of the properties variance Gaussianon Gaussian algorithms real datasets. priors. priors. ofLanczos the dataset, implementation of SVD, which is ters whoseNP (U valuesK, VKLKLX[ are)Q[QP found]=P ]= byQQ( minimizingUp( U, V, V) ln) ln|p the dUdVwasdUdVwith deployed standardik for Gaussian experimentaljk sured priors. bypurposes, varying• key propertieslike noise of the in dataset, the data, sparsity of the data, andKL[Q P ]=p pQgiven(U,pV byp) ln dUdV (7) 2 WR [15].MahoutMahout based based( implementationv implementationv¯ , v˜ ) of of ALS ALS(6) - - lowing approximate variationalTheThe priorand posterior prior variancesP ( variancesU, VTheTheX dis-u)˜ prior|givenuãnd|| andvariances || byv˜ v˜,p and, and u˜ varianceikN p varianceandK v˜PjkP(, U,and( andU, V,Lanczos varianceVwasX varianceLanczosX) ) deployed implementation parameter implementation for•Lanczos• experimental of implementation1 ofSVD, SVD, purposes, which whichjk isjk of is SVD,jk which is In this section we briefly describe the two Kullback-Leibler matrix || ik divergenceik jk jk betweenP (U, V XQ)(parameterU, V• ) • Followinglikeσx ofMahout noiseOur the observationsimplementation Java• based in based the implementation implementation are data, also of thefound sparsity algorithms of of ALS VBMFavailable of - the in data, R [16]. The2 2 prior| variances 2 u˜ik and v˜jk , and variance| | • • (7)LanczosWR× implementation [15]. 2 N of SVD,| which is and variance of the data. While many exper- parameterσ σ ofofparameter thethe observationsobservations(vjkσx ofv¯jk the, arev˜ are observationsjk also) also found| found(6) are by also minimizingavailable foundOur• Java(7) in the basedR available [16].KLWR implementationj=1 [15]. ink=1like R [16]. noise of VBMF in the data, sparsity of the data, factorization techniquetribution usedQ ( inU our, V ) comparison study.parameter Forand Px(Uofx, theV X observations) given2 by are alsoQp found(Up , V1)byp p minimizing((7)vavailablewas•v¯ deployed,WRv˜ inthewith R) [15]. KL [16]. standard divergence. for experimental(6) Gaussian priors. purposes,In the sequel, different experiments performed ×parameterTheTheNσx priorof prior the variances|p observationsvariancespu˜ u˜ and areand alsov˜ v˜ , found and,jk andand variancejk variance variancejkavailableLanczosLanczos in of R2 [16].the implementation implementation data. While of of SVD, manySVD, which which exper- is is byby minimizing minimizingKLdivergence.j=1[ theQbyk the=1|P KL minimizing]= KL divergence. divergence.Q (U, V the)Qln KL(U, divergence.ikVik) jkdUdVjkIn the sequel,with standard differentIn the• • sequel, Gaussian experiments different priors. performed experiments performed iments were done on a synthetic dataset, with details readers may refer to the papers of [3] and [2]. The prior variances u˜ik2 rmse2and×v˜jk=j=1, andk=1 varianceNInIn the Raiko sequel,| et.OurLanczos(Mahout differentx al.ij Java[3], u implementationbased thisibased experiments. v j) minimization implementation implementationand performed(9) of variance SVD, for of which ofon VBMF ALS synthetic of is - the and data. real While datasets many are described. exper- KLby[Q minimizingP ]=||parameterQ (U the, Vσ) KLσlnof divergence. theP observations(U, VdUdVX) are also• found•In• the sequel, available different in experiments R [16].p performed InIn Raiko Raikoet.||et. al. al.σIn[3],2parameter[3], Raiko this this minimizationet.x minimizationx al.Pof(U the[3],, V observations thisXM) for | minimization foronNon synthetic are syntheticiments alsoMahout forfound and and wereon based− real synthetic real datasets doneavailable implementation datasets and onare in are real described.a R described.[16].synthetic datasets of ALS are - dataset, described. with M K parameterIn xRaikoof the et. observations al. [3],Q ( U this, areV ) minimization alsofinding foundHere(7)(i,j the• p) for valuesOuwith¯ availableik findingWR, v¯ standardjk of [15].,u¯u ˜ik inik,the Rv¯ andjk Gaussian [16]., u˜ikv˜, jkv˜jkare, priors.u˜ik variationaland The performance parame- comparison,known as number mentioned of underlying features, some In Raikobyby minimizing minimizinget. al. [3], the this the KL KL minimizationp| divergence.p divergence.×p for∈on syntheticIn the and sequel, real differentiments datasets experimentsarewere described. done performed on a synthetic dataset, with findingKL the[Q valuesP ]=finding ofu¯Q (u¯ theUikv¯, V valuesv¯jk)u˜p,lnu˜ikv˜, ofv˜jku˜¯p,iku˜, v¯jkanddUdV, u˜ik(7), pThev˜jk, u˜ performanceikWRand [15].TheIn comparison, performance the sequel, different as comparison, mentioned experiments as mentioned performed 2.1 Variational Bayesian Matrix Factorizationu¯ findingv¯ u˜ thebyThe minimizing valuesvalues|| priorv˜ ofof variances ik the, p KLjk, u˜ divergence.ik,andPpjk(U,v˜, Vik , andXik and) v˜ variancejkThe waswas performance castknown castIn theinMahout in theLanczos sequel, the form number comparison, form based different implementationof of implementation a of set experiments as underlying of mentioned coupled of SVD, performed ofearlier, whichALS features, - is is solely some based on RMSE, and thus Here ik,p pjk, ikfindingand p thejk are values variational ofik u¯ik, v¯jkjk, u˜ parame-ik, v˜jk, u˜iktersand • whoseThe• performance values are comparison, found by as minimizing mentioned the experiments were performed on real datasets. Q (U, V)= (uik u¯v˜ik,Thewasu˜ik prior) castp invariancesHerev˜ theInwas2In formu¯ Raikoik castRaikou˜,ikv¯ ofandjk inet. a,et. theuv˜set al.jkik al., form[3], ofand and[3], coupled| this varianceofv˜ thisjk a minimizationare set minimizationearlier, of variational coupledLanczos is for solely forearlier, implementationon parame-on based synthetic synthetic is onsolely RMSE, andknown and of based SVD,real real and datasets on which datasets number thus RMSE, is are are and described. of described. thus underlying features, some (VBMF) v˜jk jkwas castparameterInain Raikoset the jkof σcoupled formet.of al. the ofThe[3], equations observations a set this performance of minimization coupledthat are needs alsoequationsearlier, to found(7)of forbe solved the• ison whichp solelyWR syntheticp algorithms iteratively.available needs[15]. based andto in be on R real solved was [16]. RMSE, datasets mea-iteratively. and are thus described.hardware specifications are not provided and N | v˜jk was2 castx in the form of au¯ setv¯ ofpu˜ coupledv˜ experimentsu˜ earlier, is solely were based performed on RMSE, on and real thus datasets. i=1 k=1ters whoseequationsequations valuesparameter which which areσequations needsxfindingof found needsfinding the to observations tobe the whichp bythe be solved values solved valuesminimizing needsp iteratively. ofare iteratively. of to alsoiku¯ be,ik, foundjksolvedv¯,jk thehardware, iku˜hardware,ikKullback-Leibler iteratively.,jkv˜,jk,availableik specificationsu˜ikand specificationsandhardwareThe inThe R performance[16]. are performance divergenceare specifications not not provided provided comparison, comparison, are betweenand andnot provided as as mentionedQ mentioned(U and, V) Following implementation of the algorithms CSI JOURNAL OF COMPUTING, VOL. XX, NO. XX, XXXXXXXXXThefindingby priorThese minimizing thevariancesters equationsp valuesp whose theu˜sured ofik KLareandu¯ valuesik divergence.,availablev¯v˜ byjkjk, ,u˜varying andik are, inv˜jk varianceRaiko found, Theseu˜ik and key et.3 equations byal. propertiesThe [3].In minimizingLanczos the performance are sequel, available implementation of different the the comparison, in Raiko dataset, experimentsexperimentset. of al. SVD, as[3]. mentioned whichperformedtime were to is run performed these algorithms on is real not reported. datasets. p equationsv˜ v˜ whichwaswas cast needs cast in in to the thebe form solved form of of iteratively. a a set set of of coupled• coupledhardwareearlier,earlier, specifications is is solely solely are based based not onprovided on RMSE, RMSE, and and and thus thus Variational approximation is a relativelyN KKullback-Leibler new approachTheseTheseby equationsv˜ equations minimizingwas divergence are castThese2 arejk available thejk in available equations theKL divergence. betweenform in Raikoin are Raiko of available aet. setet. al.Q al.[3]. of(U[3]. in coupled, RaikotimeVtime) In toet.Following totherunearlier, al. run sequel,[3]. these these istime algorithms differentsolely algorithms toimplementation run based experiments these is not is on not algorithms reported. RMSE, reported. performed of and is not the thus reported. algorithmswas deployed for experimental purposes, parameterjkTheseIn RaikoKullback-Leiblerσ equationsx of theet. observationsal. are[3], available this minimizationare divergence in Raikoalso foundet. al.and for[3]. betweenonPtimeavailable( syntheticU to, V runQ inX(U these and) R,given [16].V real algorithms) datasetsFollowing by is are not described. reported. implementation of the algorithms to estimate the posterior distribution in a Bayesian inferenceIn Raiko equationset.equations al. [3],like which this which noise minimization needs needs to in to be be the solved for solvedp data,on iteratively. iteratively. synthetic sparsityhardware andhardware real of datasets thespecifications specifications data, are described. are are not not provided provided and and where ui and vjanddenotesP( thevjk(Ui-thv¯,jk andV,byvj˜X-thequationsjk minimizingfinding) rows)given2.2 2.2 Alternate the which Alternate by(6) the values needsKL Least ofdivergence. tou¯ Squares be ,Squaresv¯ solved, u˜ with, iteratively.v˜ Weightedwithp , u˜ Weighted-andInwashardware theThe deployedl sequel,- performance specifications| different for comparison, experiments experimental are not provided as performed mentioned purposes, and 2 TheseThese equations equationsu¯ v¯ik areu˜ arejk available availablev˜ ik u˜jk in inik Raiko Raikoet.et. al. al.[3].[3]. timetime to to run run these these algorithms algorithms is is not not reported. reported. Our Java based implementation of VBMF problem [13], [14]. Here the ideaof is matrices to proposeU and anV approximaterespectively.finding And σpx is theRegularizationand valuesP of(Uand (ALS,ikV, jk - varianceX WR), )ikgiven, jk, ik of byand theThe data. performance While comparison,many exper-was as deployed mentioned for experimental• purposes, × j=1 k=1 N | |p Thesev˜ equationsRegularization are available (ALS in Raiko - WR)et. al. [3]. time to run these algorithms is not reported. the noise parameter in observations inv˜ XInandjk Raiko iswas castet. al. in[3], the thisform| minimization of a set of coupled for onearlier, syntheticOur is Java andsolely real based based datasets onimplementation RMSE, are described. and thus of VBMF posterior distribution which has a factorized form and isjk was castIn the in case the of form ALS - of WR, a theset elements of coupledp of the earlier,• is solely based on RMSE,Our andQ Java( thusU, V based) implementationwith standard of VBMF Gaussian priors. treated as a hyper-parameter. Also wefinding assumeequations the values which of needsimentsTu¯ik, v¯ tojk, beu˜ wereik solved, v˜jk, doneu˜ iteratively.ik and onThe ahardware synthetic performance specifications dataset, comparison, are with not as provided mentioned and equationsp matrix whichIn theU needs andcaseV toQofare beALS( foundU solved, V- byWR,) minimizing iteratively. the elements the hardwareKL of the[Q matrix specificationsP ]= U Q ( areU, notV•) providedln and dUdV parameterized by a set of variationalthe following parameters. factorized prior One distributions thenv˜ These for equations are available in Raiko et. al. [3]. timewith to run standard these algorithms Gaussian is not reported. priors. Mahout based implementation of ALS - KL[Q P ]=Thesejk was equationsQ cast(followingUT, inV are the) errorln available formknown function of in a Raiko numberset ofdUdVet. coupled al. [3]. oftime underlyingQearlier,( toU run, V|| is these) solely features, algorithms based on is some notRMSE,with reported.P ( andU standard, V thusX) Gaussian priors.• minimizes theHere Kullback-Leibleru¯ik, v¯jkU, uãndik divergenceVandrespectivelyv˜jk arebetween variational the and parame- VKL are found[Q P by]= minimizingQ (U the, V following) ln error functiondUdV | || equations which needsP to(U be, V solvedX2 ) iteratively. hardwareMahout specifications based are implementation not provided and of ALS(7) - WR [15]. f (U, V )= ||experiments(xij u i. v j) were performedP (U• , V X) on real datasets.Mahout based implementation of ALS - variational distributionters whose and valuesthe target are posteriorM foundK by by using minimizing the − | p • p These equations are( availablei,j) O in Raiko et. al.(7)[3]. time toWR run [15]. these algorithms is not reported. P (U)= (u u¯p , u˜p ) ∈ The prior| variances u˜ik and v˜jk , and variance Lanczos implementation of SVD, which is the variational parameters to arrive at a good approximateik ik ik (3)Q (Up, V) Followingp implementation of the(7) algorithmsWR [15]. • Kullback-Leibler divergencei=1 k=1 N | between 2 2 2 The prior variances u˜ and+ λ v˜ nu , u i and+ variancenv v pj (8) p Lanczos implementation of SVD, which is N K ik  jk i j  parameter• σ of(8) the observations are also found available in R [16]. solution for the posterior distribution. Raiko et. al. [3] were The priorwasi variances|| deployed|| j ||u˜ik|| forand experimentalv˜jk , and variancex purposes, Lanczos implementation of SVD, which is and P (U, V X) Pgiven(V)= by v v¯p , v˜p 2 (4) • parameterjk jk jkσx of the observations 2 are also found available in R [16]. the first to apply variational |method for j=1matrixN factorization| here nu and nv represents the number of ob-by minimizing the KL divergence. In the sequel, different experiments performed k =1 parameteri j σxOurof the Java observations based implementation are also found of VBMFavailable in R [16]. served entries in the• i-th row and j-th column based collaborative filtering problems. Hereby we minimizing describe their thehere KL nu and divergence. nv represents the number of observedIn the entries sequel, in different experiments performed Here u˜p and v˜p are the prior variances for the ofby thei minimizingX matrix,j and O represents the KL the divergence. set of In Raiko et. al. [3],In this the sequel, minimization different for experimentson synthetic performed and real datasets are described. method briefly, for more details readersik jkmay read theQ (paperU, V ) the i-th row and j-th columnwith standardof the X matrix, Gaussian and O represents priors. p KL[Q P ]=elementsQ of(UUand, VInV) respectively.ln Raiko Theet. mean al.dUdVordered[3], thispair of (row, minimization column) index for which for on synthetic and real datasets are described. p p finding the values of u¯ik, v¯jk, u˜ik, v˜jk, u˜ and The performance comparison, as mentioned prior values u¯ and v¯ are usually taken to be dataIn is available Raiko in X.Mahoutet. The second al. [3], termp based acts this as implementation minimization for of ALSon synthetic - and realik datasets are described. || ik jk P (U, V X) • p p zero. finding the values| the of regularizationu¯ik, v¯jk, termu˜ik which, v˜jk depends, u˜ik and on thev˜ Thewas performance cast in the form comparison, of a set as of mentioned coupled earlier, is solely based on RMSE, and thus p finding(7) the valuesWR [15]. of u¯ik, v¯jkjk, u˜ik, v˜jk, u˜ik and The performance comparison, as mentioned Using Bayes rule, the posterior distribution sparsity of the data set through nu and nv . v˜p wasp cast in thep form of a set of coupledi j earlier, is solely based on RMSE, and thus CSI Journal of Computing | Vol. 2 • No.jk 4, 2015 For details of minimization steps readers mayequations which needs to be solved iteratively. hardware specifications are not provided and The prior variancesfor U and V isu˜ givenik and by v˜jk , and variancev˜jk was castLanczos in the form implementation of a set of of coupled SVD, whichearlier, is is solely based on RMSE, and thus 2 equations which needsrefer to to thebe paper solved• of R. Pan et. iteratively. al. [2] Thesehardware equations specifications are available are in not Raiko provided[3]. andtime to run these algorithms is not reported. parameter σ of the observationsP (X U, V) P (U) P are(V) also foundequations whichavailable needs in to R be [16]. solved iteratively. hardware specificationset. al. are not provided and x P (U, V X)=These| equations are(5) available in Raiko et. al. [3]. time to run these algorithms is not reported. by minimizing the| KL divergence.P (X) 3ETheseXPERIMENTAL equationsDESIGN are available in Raiko et. al. [3]. time to run these algorithms is not reported. The performanceIn comparison the sequel, of the candidate different experiments performed In Raiko Toet. compute al. the[3], posterior this distribution minimization on the algorithms for wason done synthetic using the popular and error real datasets are described. L.H.S of equation (5), Raiko et. al. used the fol- pmetric, RMSE. RMSE is defined as, lowing approximate variational posterior dis- 1 finding the values of u¯ik, v¯jk, u˜ik, v˜jk, u˜ik and The performance2 comparison, as mentioned p tribution Q (U, V) 1 2 rmse = (x u . v ) (9) v˜ was cast in the form of a set of coupled earlier, ij isi solelyj  based on RMSE, and thus jk M N (i,j) O − M K × ∈   equations whichQ (U needs, V)= to be(uik solvedu¯ik, u˜ik) iteratively. hardware specifications are not provided and N | The performance of the algorithms was mea- i =1 k =1 sured by varying key properties of the dataset, N K These equations are available in Raiko et.like al. noise[3]. intime the data, to sparsity run of these the data, algorithms is not reported. (vjk v¯jk, v˜jk) (6) ×j=1 k=1 N | and variance of the data. While many experiments were done on a synthetic dataset, with

Here u¯ik, v¯jk, u˜ik and v˜jk are variational parame- known number of underlying features, some ters whose values are found by minimizing the experiments were performed on real datasets. Kullback-Leibler divergence between Q (U, V) Following implementation of the algorithms and P (U, V X) given by was deployed for experimental purposes, | Our Java based implementation of VBMF • Q (U, V) with standard Gaussian priors. KL[Q P ]= Q (U, V) ln dUdV Mahout based implementation of ALS - || P (U, V X) • | (7) WR [15]. p p The prior variances u˜ik and v˜jk , and variance Lanczos implementation of SVD, which is 2 • parameter σx of the observations are also found available in R [16]. by minimizing the KL divergence. In the sequel, different experiments performed In Raiko et. al. [3], this minimization for on synthetic and real datasets are described. p ﬁnding the values of u¯ik, v¯jk, u˜ik, v˜jk, u˜ik and The performance comparison, as mentioned p v˜jk was cast in the form of a set of coupled earlier, is solely based on RMSE, and thus equations which needs to be solved iteratively. hardware speciﬁcations are not provided and These equations are available in Raiko et. al. [3]. time to run these algorithms is not reported. CSI JOURNAL OF COMPUTING, VOL. XX, NO. XX, XXXXXXXXX 4

3.1 Synthetic Dataset with respect to the model property–rank or latent features–and dataset properties–noise and The primary advantage of using a synthetic sparsity–are described. dataset is the availability of ground truth against which the performance of algorithms can be analyzed. A synthetic dataset, X, 4.1 Rank of matrix factors was designed with 3 underlying user features As per the ground truth, the rank, r, of the (Ufi,i =1, 2, 3) and 3 underlying product matrix factors of X is 3. In this set of exper- features (Vfj,j =1, 2, 3). Three user types iments, r was varied and the corresponding (Uti,i =1, 2, 3) were defined, such that each RMSE was computed for the algorithms under user type, Uti, had only one corresponding consideration. Fig. 1 shows the resulting plot. underlying user feature Ufi. Similarly, three From the plot it is clear that VBMF gives product types (Vtj,j =1, 2, 3) were defined, lowest RMSE, i.e., best performance, for the each with one corresponding product feature different values of r. The difference in RMSE Vtj. Uti only liked Vtj, where i = j. The dataset between VBMF and the next best algorithm, had 1000 users, Um, and 100 products,Vn. Each i.e., ALS - WR is significant when r is less than Um was randomly assigned to a Uti. Likewise, the actual number of underlying features. As each Vn was randomly assigned to a Vtj. The expected RMSE is lowest for all the algorithms products were rated from 1 5, with a rating of when r =3. With r>3, RMSE is expected to 4 or 5 defined as high rating− and the rest as low increase. However, the rate of increase of RMSE ratings. Xmn, i.e., rating of Vn by Um was high is highest for SVD, followed by ALS - WR. only if the underlying user type of Um liked the For VBMF, the RMSE almost remains constant underlying product type of Vn. Note that X is with increasing r. This behavior of VBMF is CSI JOURNAL OF COMPUTING, VOL. XX, NO. XX, XXXXXXXXX 3 a dense matrix with pure ratings, i.e., without attributed to its discriminatory nature, whereby product of Normal distribution for each of its equations that needs to be solved iteratively. any noise. Thus X is the ground truth against unwanted latent features (columns of matrix elements These equations are available in Raiko et. al. M N [3]. which all the results will be compared. In our factors) are suppressed. Hence, the regulariza- T 2 P (X U, V)= xi,j ui vj ,σx (2) | i=1 j=1 N | · experiments, 80% of the dataset was used as tion in VBMF is called model-induced regular- 2.2 Alternate Least Squares with Weighted- where ui and vj denotes the i-th and j-th rows λ-Regularization (ALS - WR) the training set and the rest was used as the ization [10]. Thus VBMF clearly outperforms of matrices U and V respectively. And σ2 is x In the case of ALS - WR, the elements of the the noise parameter in observations in X and is probe set. the other two algorithms in this case. matrix U and V T are found by minimizing the treated as a hyper-parameter. Also we assume following error function For r =1, Fig. 2 shows the scatter plot of the the following factorized prior distributions for U and V respectively 2 underlying matrix factors, U and V , for all the f (U, V )= (xij u i. v j) M K (i,j) O − T p p ∈ algorithms. Note that X UV . The number P (U)= (uik u¯ik, u˜ik) (3) 4RESULTS i=1 k=1 N | 2 2 ≈ Shamsuddin+ λ N.n uLadha,i u i + et.nv jal. v j (8) R5 : 47 of user and the product features, in this case 3, N K  || || || ||  p p i j P (V)= vjk v¯jk, v˜jk (4)   In the following paragraphs, experimental re- are correctly identified by the VBMF estimator. j=1 k=1 N | here nui and nvj represents the number of ob- p p served entries in the i-th row and j-th column sults obtained on synthetic and real datasets Here u˜ik and v˜jk are the prior variances for thethe set of ordered pair of (row, column) index for which data the model property–rank or latent features–and dataset elements of U and V respectively. The mean of the X matrix, and O represents the set of p p is availableordered pair in of X (row,. The column) second index term for whichacts as the regularization properties–noise and sparsity–are described. 4.2 Sparsity of dataset prior values u¯ik and v¯jk are usually taken to be zero. termdata which is available depends in X. Theon secondthe sparsity term acts of as the data set through Using Bayes rule, the posterior distributionn theand regularization n . For details term which of dependsminimization on the steps readers may Sparsity of X was gradually increased from u sparsityv of the data set through n and n . for U and V is given by i j ui vj referFor to details the ofpaper minimization of R. Pan steps et. readers al. [2] may 10% to 95% and the RMSE was computed for P (X U, V) P (U) P (V) P (U, V X)= | (5) refer to the paper of R. Pan et. al. [2] r =3. As the sparsity increases, less data | P (X) 3. Experimental Design To compute the posterior distribution on the 3EXPERIMENTAL DESIGN points are available for training (and testing). L.H.S of equation (5), Raiko used the fol- The performance comparison of the candidate algorithms et. al. The performance comparison of the candidate This generally results in a gradual increase in lowing approximate variational posterior dis- wasalgorithms done using was done the usingpopular the popular error errormetric, RMSE. RMSE is tribution Q (U, V) definedmetric, RMSE.as, RMSE is defined as, RMSE with increasing sparsity. Fig. 3 shows M K 1 Q (U, V)= (u u¯ , u˜ ) 2 the performance plot of the 3 algorithms with ik ik ik 1 2 i=1 k=1 N | rmse = (x u . v ) (9)  ij i j  (9) varying sparsity. N K M N (i,j) O − × ∈ (vjk v¯jk, v˜jk) (6)   RMSE using SVD is always higher than the ×j=1 k=1 N | The performance of the algorithms was mea- suredThe by performance varying key properties of the of thealgorithms dataset, was measured by other two algorithms, and the RMSE gradually Here u¯ik, v¯jk, u˜ik and v˜jk are variational parame-varyingsuch as key noise properties in the data, sparsityof the ofdataset, the data, such as noise in the ters whose values are found by minimizing the data,and sparsity variance of of the the data. data, While and many variance experi- of the data. While increases with increasing sparsity as expected. Kullback-Leibler divergence between Q (U, V) ments were done on a synthetic dataset, with and P (U, V X) given by However, for VBMF and ALS - WR, initially | manyknown experiments number of underlying were done features, on a some synthetic dataset, with Fig. 1 : Plot of the RMSE against varying values of r. Q (U, V) knownexperiments number were of performed underlying on real features, datasets. some experiments Fig. 1. Plot of the RMSE against varying values the RMSE increases gradually but after a tip- KL[Q P ]= Q (U, V) ln dUdV Following implementation of the algorithms || P (U, V X) were performed on real datasets. Following implementation | (7) was deployed for experimental purposes, of r. ping point, in this case around 80%, both the p p of the algorithms was deployed for experimental purposes, 4.1 Rank of matrix factors The prior variances u˜ik and v˜jk , and variance Our Java based implementation of VBMF 2 • parameter σx of the observations are also found Ourwith Java standard based Gaussian implementation priors. of VBMF with standard by minimizing the KL divergence. Mahout based implementation of ALS - As per the ground truth, the rank, r, of the matrix factors • In Raiko et. al. [3], this minimization for GaussianWR [15]. priors. of X is 3. In this set of experiments, r was varied and the p finding the values of u¯ik, v¯jk, u˜ik, v˜jk, u˜ and Lanczos implementation of SVD, which is ik •Mahout based implementation of ALS - WR [15]. corresponding RMSE was computed for the algorithms under v˜p was cast in the form of a set of coupled available in R [16]. jk Lanczos implementation of SVD, which is available in R consideration. Fig. 1 shows the resulting plot. [16]. From the plot it is clear that VBMF gives lowest In the sequel, different experiments performed on RMSE, i.e., best performance, for the different values of r. synthetic and real datasets are described. The performance The difference in RMSE between VBMF and the next best comparison, as mentioned earlier, is solely based on RMSE, algorithm, i.e., ALS - WR is significant whenr is less than the and thus hardware specifications are not provided and time actual number of underlying features. As expected RMSE is to run these algorithms is not reported. lowest for all the algorithms when r = 3. With r > 3, RMSE is expected to increase. However, the rate of increase of RMSE 3.1 Synthetic Dataset is highest for SVD, followed by ALS - WR. For VBMF, the The primary advantage of using a synthetic dataset is the RMSE almost remains constant with increasing r. This availability of ground truth against which the performance behavior of VBMF is attributed to its discriminatory nature, of algorithms can be analyzed. A synthetic dataset, X, was whereby unwanted latent features (columns of matrix factors)

designed with 3 underlying user features (Ufi, i = 1, 2, 3) and 3 are suppressed. Hence, the regularization in VBMF is called

underlying product features (Vfj , j = 1, 2, 3). Three user types model-induced regularization [10]. Thus VBMF clearly

(Uti, i = 1, 2, 3) were defined, such that each user type,U ti, had outperforms the other two algorithms in this case. only one corresponding underlying user feature U . Similarly, fi For r = 1, Fig. 2 shows the scatter plot of the underlying three product types (V , j = 1, 2, 3) were defined, each with one tj matrix factors, U and V , for all the algorithms. Note that corresponding product feature V . U only liked V , where tj ti tj X ≈ UVT . The number of user and the product features, in this i = j. The dataset had 1000 users, U , and 100 products,V . m n case 3, are correctly identified by the VBMF estimator. Each Um was randomly assigned to a Uti. Likewise, each Vn was randomly assigned to a Vtj . The products were rated from 4.2 Sparsity of dataset 1–5, with a rating of 4 or 5 defined as high rating and the rest Sparsity of X was gradually increased from 10% to as low ratings. Xmn, i.e., rating of Vn by Um was high only if the 95% and the RMSE was computed for r = 3. As the sparsity underlying user type of Um liked the underlying product type increases, less data points are available for training (and of Vn. Note that X is a dense matrix with pure ratings, i.e., without any noise. Thus X is the ground truth against which testing). This generally results in a gradual increase in RMSE all the results will be compared. In our experiments, 80% of with increasing sparsity. Fig. 3 shows the performance plot of the dataset was used as the training set and the rest was used the 3 algorithms with varying sparsity. as the probe set. RMSE using SVD is always higher than the other two algorithms, and the RMSE gradually increases with 4. Results increasing sparsity as expected. However, for VBMF and In the following paragraphs, experimental results ALS - WR, initially the RMSE increases gradually but after obtained on synthetic and real datasets with respect to a tipping point, in this case around 80%, both the algorithms

CSI Journal of Computing | Vol. 2 • No. 4, 2015 CSI JOURNAL OF COMPUTING, VOL. XX, NO. XX, XXXXXXXXX 5 R5 : 48 An Empirical Study of Popular Matrix Factorization based Collaborative Filtering Algorithms CSI JOURNAL OF COMPUTING, VOL. XX, NO. XX, XXXXXXXXX 5

Fig. 2. Visualization of the matrix factors, U and P , for r =1. (a), (b), (c): Computed U using VBMF, ALS - WR, and SVD respectively. (d), (e), (f): Computed V using VBMF, ALS - WR, and Fig.Fig. 2. 2 : Visualization Visualization of ofthe the matrix matrix factors, factors, U andU V, andfor r =P 1., (a), for (b),r =1 (c):. Computed (a), (b), U (c): using Computed VBMF, ALSU - usingWR, SVDand SVD respectively. respectively. Underlying (d), (e), (f): Computed user and V productusing VBMF, features, ALS - WR, based and SVD on which respectively.X was Underlying generated, user areand crisplyVBMF,product visible ALS features, - as WR, based distinct and on SVD clusterswhich respectively. X was in thegenerated, visualization (d), are (e), crisply (f): of Computedthevisible matrix as distinct factorsV using clusters obtained VBMF, in the ALSusing visualization - VBMF WR, and (aof &SVDthe d), matrix although respectively. factorsr =1 obtained Underlyingis less using than userVBMF the and number (a &product d), although of actual features, r = features 1 is basedless than (3 in on the this which number case).X ofwas actual generated, features (3 arein crisply visible as distinct clusters in the visualizationthis case). of the matrix factors obtained using VBMF (a & d), although r =1is less than the number of actual features (3 in this case). ALSAs - WR observed against increasing in the sparsity. plot, Fig. 4 (left), the bottomAs observed two singular in the plot, values Fig. of 4 (left), VBMF’s the bottom predic- two singular values of VBMF’s predictive output are truncated to tiveAs output observed are truncated in the plot, to zero Fig. beyond 4 (left),80% the bottomzero beyond two 80% singular sparsity. This values results of in VBMF’s severe underfitting predic-. sparsity.In fact, the This predicted results output in is severe approximatelyunderfitting the mean. In of fact,tivethe input the outputpredicted dataset, are truncatedimplying output that is to approximatelypractically zero beyond no learning80% the meansparsity.occurred. of theTo This our input knowledge, results dataset, in underfitting severe implyingunderfitting in thatVBMF practi- has. been In callyfact,contemplated theno learning predicted in the literature occurred. output but is Tohas approximately ournot been knowledge, empirically the underfittingmeandemonstrated of the earlier. input in VBMF dataset, has implying been contemplated that practi- incally theWhereas, no literature learning for ALS but occurred.- WR, has Fig. not 4 To(right), been our all empiricallyknowledge, three singular values increase drastically beyond 80% sparsity, implying demonstratedunderfitting in earlier. VBMF has been contemplated insevere the overfitting literature of but the has notmodel been to the empirically input data. This behaviorWhereas, of ALS for - WR ALS can -be WR, inferred Fig. from 4 the (right), weighted all-λ- demonstrated earlier. threeregularization singular proposed values in [2]. increase Here the weights drastically are equal be- to yondtheWhereas, number80% sparsity,of entries for ALS for implying each - WR, row and severe Fig. column 4 overfitting (right), of X. As allthe sparsity increases, number of entries in rows and columns Fig. 3 : Plot of the RMSE against increasing sparsity ofthree the singular model to values the input increase data. drastically This behav- be- Fig. 3. Plot of thefor RMSE r = 3. against increasing yonddecrease,80% therebysparsity, reducing implying the effect severe of regularization overfitting iorleading of to ALS overfitting. - WR This can observation be inferred triggered from a question, the sparsity for r =3. of the model to the input data. This behav- showFig. ad 3. hoc Plot behavior. of theWhile RMSE the RMSE against for ALS increasing- WR shoots weightedwhether ALS λ- WRregularization can be representedproposed using a probabilistic in [2]. ior of ALS− − - WR can be inferred from the upsparsity significantly, for r indicating=3. possible overfitting, the RMSE for Hereframework? the weights In Sec. 4.3 are we show equal that to ALS the - WR number is indeed of a weightedMAP estimator.λ regularization proposed in [2]. VBMF shoots up and then remains constant. entries for− each− row and column of X. As the algorithmsIn order toshow understand ad the hoc behavior behavior. of VBMF While and ALS- the sparsityHereThe the two increases, weights plots in Fig. are number 5, show equal the of toeffect entries the of increasing number in rows r ofon WR, SVD was done on the product of computed factor matrices entriesthe RMSE for for each different row sparsity and columnlevels of the of inputX. Asdataset. the RMSET for ALS - WR shootsT up significantly,v v and columns decrease, thereby reducing the Ualgorithms and V . By the show definition, ad hoc X ≈ UV behavior. . Let U and While V be the sparsityThe RMSE increases, curve for VBMF number is identical of entries for both inr = rows3 and indicating possible overfitting,a thea RMSE for effectr = 5, ofdue regularizationto VBMF’s ability leadingto suppress to unwanted overfitting. latent factorsRMSE obtained for ALS using - VBMF WR shootsand U and up V significantly, be the factors and columns decrease, thereby reducing the VBMFobtained shootsusing ALS up - WR. and Hence, then predictive remains output constant. obtained Thisfeatures observation (for r = 5). triggered However, the a RMSE question, for ALS whether - WR is v v v usingindicating VBMF, X possible , is U (V overfitting, )T . Similarly, thepredictive RMSE output for ALSeffectmarginally - ofWR higher regularization can for be r = represented 5 than leading for r = 3 for using to sparsity overfitting. a proba-less than In order toa understanda a the behavior of VBMF VBMF shoots up andT then remains constant. This80%. observationBeyond that, the triggered RMSE curve a question, shoots up significantly whether andof ALS ALS - WR, - X WR, , is U SVD(V ) was . Fig. done 4 shows on a plot the of product singular bilistic framework? In Sec. 4.3 we show that values of the predictive output obtained using VBMF and ALSdue to -overfitting, WR can but be the represented RMSE is still less using than athat proba- ofr = 3. of computedIn order to factor understand matrices theU behaviorand V T of. By VBMF the ALS - WR is indeed a MAP estimator. bilistic framework? In Sec. 4.3 we show that definition,and ALS -X WR, SVDUV T was. Let doneU v and on theV v productbe the The two plots in Fig. 5, show the effect of T ALS - WR is indeed a MAP estimator. factorsof computed obtained factor≈ using matrices VBMFU andandUVa and. ByV thea increasing r on the RMSE for different sparsity T v v bedefinition, the factorsX obtainedUV . using Let U ALSand - WR.V Hence,be the levelsThe of two the plots input in dataset. Fig. 5, The show RMSE the curve effect for of ≈ a a predictivefactors obtained output using obtained VBMF using and VBMF,U andXv,V is VBMFincreasing is identicalr on the for RMSE both r for=3 differentand r =5 sparsity, due Ubev( theV v)T factors. Similarly, obtained predictive using ALS output - WR. of Hence, ALS - tolevels VBMF’s of the ability input to dataset. suppress The unwanted RMSE curve latent for v WR,predictiveXa, is outputU a(V a)T obtained. Fig. 4 shows using aVBMF, plot ofX sin-, is featuresVBMF is (for identicalr =5 for). However, both r =3 theand RMSEr =5, due for v v T gularU (V values) . Similarly, of the predictive predictive output output obtained of ALS - ALSto VBMF’s - WR is ability marginally to suppress higher unwanted for r =5 latentthan a a a T usingWR, X VBMF, is U and(V ALS) . Fig. - WR 4 shows against a plot increasing of sin- forfeaturesr =3 (forforr sparsity=5). However, less than the80% RMSE. Beyond for sparsity.gular values of the predictive output obtained that,ALS the- WR RMSE is marginally curve shoots higher up for significantlyr =5than using VBMF and ALS - WR against increasing for r =3for sparsity less than 80%. Beyond CSIsparsity. Journal of Computing | Vol. 2 • No. 4, 2015 that, the RMSE curve shoots up significantly CSI JOURNAL OF COMPUTING, VOL. XX, NO. XX, XXXXXXXXX 6

CSI JOURNAL OF COMPUTING, VOL. XX, NO. XX, XXXXXXXXX 6

ShamsuddinCSICSI JOURNALCSI JOURNAL JOURNAL OF OFN. OF COMPUTING, COMPUTING, COMPUTING,Ladha, et. VOL. VOL. VOL.al. XX, XX, XX, NO. NO. NO. XX, XX, XXXXXXXXX XXXXXXXXXXX, XXXXXXXXX 66 R5 : 496 CSI JOURNAL OF COMPUTING, VOL. XX, NO. XX, XXXXXXXXX 6

Fig. 4. Plot of singular values of the predictive output obtained using VBMF (left) and ALS - WR (right) against varying sparsity of the input dataset. Fig. 4. Plot of singular values of the predictive output obtained using VBMF (left) and ALS - WR (right) against varying sparsity of the input dataset.

Fig.Fig. 4. 4. Plot Plot of of singular singular values values of of the the predictive predictive output output obtained obtained using using VBMF VBMF (left) (left) and and ALS ALS - - WR WR (right)(right) against against varying varying sparsity sparsity of of the the input input dataset. dataset. CSI JOURNAL OF COMPUTING, VOL. XX, NO. XX, XXXXXXXXX 6 Fig. 4. Plot of singular values of the predictive output obtained using VBMF (left) and ALS - WR Fig.Fig. 4.4 : Plot Plot of of singular singular values values of the of predictive the predictive output obtai outputned obtained using VBMF using (left) VBMF and ALS (left) - WR and (right) ALS against - WR (right) against varying sparsity ofvarying the input sparsity dataset. of the input dataset. (right) against varying sparsity of the input dataset.

CSI JOURNAL OF COMPUTING, VOL. XX, NO. XX, XXXXXXXXX Fig. 5. Plot of the RMSE for6 different values of r for VBMF (left) and ALS - WR (right) against varying sparsity of the input dataset.

due to overfitting, but the RMSE is still less Using the Bayes theorem, the posterior distri- than that of r =3. bution, P (U, V X), is proportional to the prod- Fig. 5. Plot of the RMSE for different values of r for VBMFuct of (left) the prior and| distribution ALS - WR and (right) the likelihood against varying sparsity of the input dataset. function varying sparsity of4.3 the ALS input - WR dataset. as MAP Estimator Fig.Fig. 5. 5. Plot PlotTo of of begin, the the RMSE RMSE assume for for thedifferent different likelihood values values function of of rr forfor for VBMF VBMF (left)P (left)(U, andV andX ALS) ALSP - -( WRX WRU, (right) (right)V) P ( againstU against) P (V) (13) Fig. 4. Plot of singular values of the predictive outputvaryingvarying obtained sparsity sparsity usingX as of VBMFof defined the the input input (left) in dataset. anddataset.Eq. 2,ALS i.e., - WR similar to that of | ∝ | (right) against varying sparsity of the inputdue dataset. to overfitting,VBMF. but Also, the assume RMSE following is still factorized less Using prior theRewriting, Bayes theorem,P (U) and takingthe posterior its negative distri- log than that of r =3.distributions for U and V respectivelybution, Plikelihood(U, V X), is proportional to the prod- | duedue to to overfitting, overfitting, but but the the RMSE RMSEM is is still still less less UsingUsing the the Bayes Bayes| theorem, theorem, the the posterior posterior(N+1) distri- distri- uct of the prior distributionM and2 the1 T likelihood Fig. 4. Plot of singular values of the predictive output obtained using VBMF (left) and ALS - WR p p u ui 1 2˜u i Fig.thanthan 5 : Plot that that of of of therr=3=3 RMSE.. P for(U)= different( uvaluesi 0, ˜u Iof) r forbution, bution,VBMF(10) P(lPeft)((UU, ,andVV XX ALS)),, is is proportional- proportional WR (right) against to to the{− the prod-i prod- varying} sparsity of (right) against varying sparsity of the input dataset. N | i function P |(|U)= p e (14) 4.3 ALSFig. 5.- WR Plot as of MAP the Estimator RMSE fori=1 different valuesuctuct of of ofr the thefor prior prior VBMF distribution distribution (left)2π andu and˜ andi ALS the the likelihood likelihood - WR (right) against 4.3 ALS - WR as MAP Estimator the input dataset. i =1 Fig. 5. Plot of the RMSE forN different valuesfunctionfunction of r for VBMF (left) and ALS - WR (right) against To begin,varying4.34.3 assume ALS ALS sparsity - - WR WR the asof as likelihood MAPthe MAPP (V input)= Estimator Estimator dataset. functionv 0, ˜v pI for To begin,varying assume sparsity the of likelihood the input dataset. functionj j for P(11)(U, V X) P (X U, V) P (U) P (V) (13) j=1 N | | ∝ | X as4.3 defined ToToALS begin, begin, - WR in assumeas assumeEq. MAP 2, theEstimator i.e.,the likelihood likelihood similar to function function that offor for PP((UU,,VV XX)) PP((XXUU1,,VVM))PuPT((UuU))PP((VV)) (13)(13) (12) || ∝∝ || i i VBMF. Also,XX asas assume defined defined in infollowing Eq. Eq. 2, 2, i.e., i.e., factorized similar similar to to that prior that of of Rewriting,lnP ((UU)=) and takingp its negative log VBMF. Also,To begin, assume assume following the likelihood factorized function for prior X as definedRewriting, − P (U) and2 i=1 takingu˜i its negative log dueVBMF.VBMF. to overfitting, Also, Also, assume assumep butfollowing following thep RMSEfactorized factorized is prior prior still lessRewriting,Rewriting,CSIUsing JOURNALPP((UU the OF))and COMPUTING,and Bayes taking taking theorem, VOL. its its XX, negative NO.negative XX, theXXXXXXXXX log posteriorlog distri- 7 distributionsin Eq. 2, i.e., for similarUHereand to˜u thatiVand ofrespectively VBMF.˜v j are Also, the assume prior variancesfollowinglikelihood for CSI JOURNAL OF COMPUTING, VOL.M XX, NO. XX, XXXXXXXXX 7 duedistributionsdistributions to overfitting,r for for=3UUandand butVV therespectivelyrespectively RMSE is still lesslikelihoodlikelihoodUsingP the(U Bayes,NV+1X) theorem, the posteriorp distri-(15) factorizedthan that prior of distributionsrandom. variables for U uandi and V respectivelyvj of U and V respec-bution, + M, is proportionalln(2π)+ln(˜ui ) to(15) the prod- | (N+1) than that of rM=3. bution,In theP above(U, V equation,(2(NN+1)+1)X),i is=1 the{ proportional last term can} be to the prod- tively. MM uct of theM prior distribution2 1 andT the likelihood p InMM the above equation,22 1 the1 TT lastp termu ui can be pp 1 | ppuu uui2˜iup i i P (U)= (ui 0, ˜u I) (10) ignored,In the as above1 it1 does equation, not depend2˜2˜ uuthe{−ii onlast2˜uUi term. Similarly,} can be ignored, as P (U)=PP((UU)=)=(ui 0(,(uu˜u iii00I,,)˜u ˜u i iII)) (10)(10)(10) (10) PP(functionuct(U(UU)=)=ignored,)= of the as prior it doesp distribution not{− depend{− iie on}} iU and. Similarly, the likelihood(14) N NN| || itthe does negative not depend likelihoodpp onp U.ee ofSimilarly,Pe(V) is the (14)negative(14) (14)likelihood of Fig. 5. Plot of the RMSE for different values of r4.3for VBMF ALS (left) - WRi=1 and as ALSii=1=1 MAP - WR Estimator (right) against thei negativei=1=122ππu˜u˜2i likelihoodiπu˜i of P (V) is function i =1 i varying sparsity of the input dataset. 4.3 ALS - WRN asNN MAP Estimator P (V) is To begin, assume the likelihoodp pp function for N T PP((VV)=)= vvjjp00,,˜v ˜v II (11)(11) P (U, V X1) N vPj Tv(jX U, V) P (U) P (V) (13) P (V)= v 0, ˜v I jj (11) 1 v vj ToP begin,(V)= assumev theNjN0, likelihood˜v |j| I function(11) (11) for lnP (V)=| ∝ jp | Fig. 5. Plot of the RMSE for different values ofXr foras VBMF defined (left) and inj ALSj Eq.=1=1 - 2, WR i.e., (right) similar against to that of P−(UlnP, V(MVM)=X2) Pv˜ (pX U, V) P (U) P (V) (13) j=1 N | − 11 uuTTu2uj=1 v˜j varying sparsity of the input dataset. | Mii ∝iji=1T j | VBMF.X as defined Also, assume in Eq. following 2, i.e., similar factorized to(12)(12) that prior of lnlnPP((Rewriting,UU)=)= 1 Pppu (Uu)Ni and taking its negative log due to overfitting, but the RMSE is still less Using the Bayes theorem, the posterior distri- (12) 1 M +1ui uiN (12) −− lnP (U)=22 Mu˜u˜ii+1 p than that of r =3. VBMF.bution, P Also,(Upp, V assumeX),p isp proportional following to the factorized prod- prior lnPRewriting,(U)=ii=1=1+ P (UNp ) andln(2π taking)+ln(˜vj p) its(16) negative log distributions for U and V respectively likelihood+ 2 N ln(2π)+ln(˜vj ) (16) HereHereHere ˜u ˜u ii andandand| ˜v ˜v jj areareare the the the prior prior prior variances variances variances forfor forrandom− 2 i=1MM u˜i j=1{ } (16) uctp of the priorp distribution and the likelihood NN+1+1 i=12 j=1{ } due to overfitting, but the RMSE is still lessvariablesdistributionsUsing u the and Bayes vfor of theorem, UU uandu and V the respectively.vvV posteriorrespectivelyUU distri-VV likelihood pp Here ˜u i randomrandomand i ˜v variables variablesj arej theiMi andand priorjj ofof variancesandand respec-respec- for ++ MM lnlnM(2(2ππ)+)+lnln(N(˜(˜u+1)uii)) (15)(15) than that of r =3. functionbution, P (U, V X), is proportional to the prod- FromN22 [5],+1 it isM known{{ that 2 }} 1 T 4.3 ALS - WR as MAP Estimator tively.tively. p ii=1=1 (N+1) pp u ui random variablesUsing the uBayes| i and theorem,Mvj of U theand posteriorV respec- distribution, From+ [5], it isM known1ln that (2π)+ln(˜u2˜u ) i (15) randomuct variables of the priorP (U distributioni )=and j of and(u thei and0 likelihood, ˜u i I) respec- (10) + M ln(2π)+2 ln{−(˜ui1i) T (15)} p P (U)= p e p u ui (14) To begin, assume the likelihood function for P (U,V|X), is proportionalN to the| product of the prior 2 {M1 N 2˜u }i tively. functionP (U, V XP)(UP)=(X Ui=1, V) P(u(Ui )0P, ˜u(V)I) (13) (10) i=112πMu˜ N {− i2 }Fig. 6. Plot of the RMSE against Gaussian 4.3 ALS - WR as MAP Estimator tively. i P (U)=i=1 1 i e T 2 (14) X as defined in Eq. 2, i.e., similar to that of distribution| and∝ the likelihood| N function| lnP (X U, V)= p xi,j ui vj T Fig. 6. Plot of the RMSE against Gaussian iN=1 − lnP (X| U, V)=2 πu˜ xi,j− ui· vj noise with(17) zero mean and varying σ for r =3. i=1 i ij noise with zero mean and varying σ for r =3. VBMF.To Also, begin, assume assume following the likelihood factorized function prior for Rewriting,P (U, V XP) (UP)(XandU, takingV) P (U its) P negative(V)p (13) log (13) − | 2 i j − · P| (V∝ )=| N v 0, ˜v I (11) (17) distributionsX as defined for U inand Eq.V 2,respectively i.e., similar to that of likelihood j jp (17) Rewriting,P ( PV (U))= andj=1 takingN v its| 0negative, ˜v I log likelihood(11) Hence,Hence, the the negative negative log log likelihood likelihood of of Eq.Eq. 13 13 can now be VBMF. Also, assume following factorized prior Rewriting, P (U) and taking itsj negativej log Hence, the negativeM logT likelihood of Eq. 13 M (N+1)N | can now be written1 as u u distributions for U and V respectively likelihood M j=1 2 1 T written as i i p p u ui can now be writtenM as T 1 2˜u i (12) lnP (U)= p P (U)= (ui 0, ˜u i I) (10) {− i } (14) 1 ui ui M P (U)= p (N+1) e (14) − M N 2 u˜ M N | M 2 1 T (12) P (U)= i=12 i i=1 p 2πu˜i p u ui 1 lnM N 1p M p p i=1 1p 2˜u i T 2 T P (U)= (ui 0, ˜u I) (10) {− i } 1 x u vT + 1 u˜ pu Tu N i HereP (˜u U)=and ˜v p aree the prior(14) variances for − i,j i2 j u˜i i i i N | pi jp 2 xi,j− ui· iv=1j +M2 u˜i ui ui i=1 p i=1 2πu˜i 2 i j N− +1· 2i=1 p P (V)= vj 0, ˜v I (11) Here ˜u and ˜v are the prior variances for i j M i=1 N j randomi variablesj ui and vj of U and V respec- + M lnN (2π)+ln(˜u ) (15) N | p N +1 1 i j=1 M T N p T p P (V )= vj 0, ˜v j I (11)random variables ui and vj of U and V respec- + 2 M i+=1{1 (2v˜ pvπT)+vj (18)(˜u )} N | tively. 1 ui ui + ln v˜j vj v ln(18) i (15) j=1 (12) lnP (U)= M CSI Journal of Computing2 j=1 |j Vol.j j 2 • No. 4, 2015 1 uTup 2 i=1{2 } tively.− 2 iu˜i i j=1 p p (12) lnP (U)= i=1 p p p Here ˜u i and ˜v j are the prior variances for − 2 i=1 u˜Mi 1 1 p p N +1 Define, u˜i p = n 1 and v˜j p = n 1 , which im- Here ˜u and ˜v are the prior variances for p Define, u˜ = ui and v˜ = vj , which im- random variablesi ui andj vj of U and V respec- + M M ln(2π)+ln(˜u ) (15) i nu j nv N +1 p i plies that greater thei number of samplesj higher random variables ui and vj of U and V respec- + 2 M i=1{ln(2π)+ln(˜ui ) }(15) plies that greater the number of samples higher tively. 2 { } the precision. Now, Eq. 18, which is a MAP tively. i=1 the precision. Now, Eq. 18, which is a MAP estimate, is similar to the ALS - WR objective Fig. 7. VBMF: Plot of the RMSE against Gaus- estimate, is similar to the ALS - WR objective Fig. 7. VBMF: Plot of the RMSE against Gaus- function (Eq. 8). sian noise with zero mean and varying σ for function (Eq. 8). sian noise with zero mean and varying σ for different values of r. different values of r. 4.4 Noise in dataset 4.4 Noise in dataset Next, the performance of algorithms was as- 4.5 Verification against theoretical results Next, the performance of algorithms was as- 4.5 Verification against theoretical results sessed when entries of the input dataset are In [10], the authors have established theoretical sessed when entries of the input dataset are In [10], the authors have established theoretical corrupted by noise. D was corrupted by adding non-asymptotic bounds for singular values of corrupted by noise. D was corrupted by adding non-asymptotic bounds for singular values of a zero mean Gaussian noise with standard the predictive output of VBMF and MAP esti- a zero mean Gaussian noise with standard the predictive output of VBMF and MAP esti- deviation, σ. σ was varied from 0.5 to 5 with mators for dense matrices in presence of noise. deviation, σ. σ was varied from 0.5 to 5 with mators for dense matrices in presence of noise. an increment of 0.5. The plot in Fig. 6 shows The synthetic dataset, X, is dense. Further, it an increment of 0.5. The plot in Fig. 6 shows The synthetic dataset, X, is dense. Further, it effect of noise on RMSE when r =3. has been established earlier in Sec. 4.3, that effect of noise on RMSE when r =3. has been established earlier in Sec. 4.3, that As seen in Fig. 6, as the noise increases, ALS - WR is a MAP estimator. Hence, these As seen in Fig. 6, as the noise increases, ALS - WR is a MAP estimator. Hence, these the RMSE steadily increases. Again, RMSE is bounds can be verified for X corrupted by the RMSE steadily increases. Again, RMSE is bounds can be verified for X corrupted by highest for SVD. VBMF and ALS - WR have noise. For ALS - WR, as per [10], highest for SVD. VBMF and ALS - WR have noise. For ALS - WR, as per [10], almost similar RMSE upto σ =2, thereafter 2 almost similar RMSE upto σ =2, thereafter σ 2 γˆ = max 0,γ σ (19) RMSE for ALS - WR is higher than that for γˆh = max 0,γh (19) RMSE for ALS - WR is higher than that for h { h− cah cbh } VBMF. { − ca cb } VBMF. h h In addition, both r and sigma were varied where, γˆ is the hth largest singular value In addition, both r and sigma were varied where, γˆh is the hth largest singular value simultaneously and the corresponding changes obtained usingh MAP estimator, γ is the hth simultaneously and the corresponding changes obtained using MAP estimator, γh is the hth to RMSE are plotted in Fig. 7, Fig. 8, and Fig. 9 largest singular value of the noisyh dataset, σ to RMSE are plotted in Fig. 7, Fig. 8, and Fig. 9 largest singular value of the noisy dataset, σ for VBMF, ALS - WR and SVD respectively is the standard deviation of the Gaussian noise for VBMF, ALS - WR and SVD respectively is the standard deviation of the Gaussian noise CSI JOURNALCSI JOURNAL OF COMPUTING, OF COMPUTING, VOL. XX, VOL. NO. XX, XX, XXXXXXXXX NO. XX, XXXXXXXXX 7 7 CSI JOURNAL OF COMPUTING, VOL. XX, NO. XX, XXXXXXXXX 7 In theIn above the above equation, equation, the last the term last can term be can be ignored,ignored, as itIn does theas it abovenot does depend not equation, depend on U the. on Similarly, lastU. Similarly, term can be the negativetheignored, negative likelihood as likelihood it does of P ( notV of) P dependis(V) is on U. Similarly, the negative likelihood of P (V) is N T N T 1 v1j vj vj vj lnP (Vln)=P (V)= N p T p 1 vj vj − − 2 2v˜jj=1 v˜j lnPj =1(V)= p − N 2 j=1N v˜j M +1M +1 p p + + N Nln(2π)+lnN (2lnπ(˜)+v ) ln(˜(16)vj ) (16) M +1 j p 2 +2j=1{ j=1N{ ln(2π})+ln(˜}v ) (16) 2 { j } j =1 From [5],From it is[5], known it is known that that From [5], it is known that M N M N 2 1 1 T 2 T Fig. 6.Fig. Plot 6. of Plot the of RMSE the RMSE against against Gaussian Gaussian lnP (XlnUP,(VX)=U, V)= xi,j M uxNii,j vj ui vj 1 noiseT 2 withnoiseFig. zero with 6. mean zero Plot meanand of varying the and RMSE varyingσ for againstrσ=3for. r =3 Gaussian. − − | lnP| (X2Ui, Vj)=2 i j− · −x · u v i,j i j σ r =3 − | 2 i j (17)− · (17) noise with zero mean and varying for . Hence,Hence, the negative the negative log likelihood log likelihood of Eq. of 13 Eq. 13(17) can nowcan be nowHence, written be written the as negative as log likelihood of Eq. 13 can now be written as M N M M N M2 1 1 T 2 T1 p1 T p T xi,j M uxNii,j vj ui +vj +u˜i ui uu˜i Mi ui ui 2 2 1 − · − · 2 T 2 1 p T i j i j xi,j ui iv=1 +i=1 u˜ u ui CSICSI JOURNAL JOURNAL OF2 OF COMPUTING, COMPUTING, VOL. VOL. XX,− XX, NO. NO. XX,· XX, XXXXXXXXXj XXXXXXXXX 2 i i 7 7 i j N Ni=1 1 p1 T p T + +v˜j vj vjv˜jNv(18)j vj (18) InIn the the above above equation, equation, the the last last term term can can be1 be p T 2 j=1 2+j=1 v˜ v v (18) ignored,ignored, as as it it does does not not depend depend on onU U.. Similarly, Similarly,. Similarly, 2 j j j the negative likelihood of P (V) is j=1 thethe negative negativep likelihood likelihood1p of of1PP(pV(V) )isis 1p 1 Define,Define,u˜i = u˜i =andn v˜jand= v˜j ,= whichn , which im- im- nui p ui 1 nvj p vj 1 Define,NNvuT˜viT = and v˜j = , which implies thatplies greater that1 greater1 thevjvj numberjvjvj thej n numberui of samples of samples highernvj higher lnlnPP(V(V)=)= p p the− precision.−theplies precision. Now,2that2j=1 greaterv˜ Now,jj Eq.v˜j 18, the Eq. which number 18, which is of a samplesMAP is a MAP higher j=1j =1 the precision.NN Now, Eq. 18, which is a MAP estimate,estimate, is similarMM+1 is+1 similar to the to ALS the - ALS WRp -objective WR objective + N ln(2π)+ln(˜v p)p (16) Fig. 7.Fig. VBMF: 7. VBMF: Plot of Plot the ofRMSE the RMSE against against Gaus- Gaus- functionestimate,++ (Eq.N isN 8). similarlnln(2(2ππ)+)+ tolnln the(˜v(˜jjv)j ) ALS(16)(16) - WR objective function (Eq. 8).2 2 j=1{ { } } sian noisesianFig. noise with 7. zero VBMF: with mean zero Plot mean and of the varying and RMSE varyingσ againstfor σ for Gaus- j=1j =1 function (Eq. 8). differentdifferent valuessian values noise of r. with of r. zero mean and varying σ for FromFrom [5], [5], it it is is known known that that different values of r. 4.4 Noise4.4 Noise in dataset in datasetM N 1 MM NN 2 11 TT2 2 Fig.Fig. 6. 6. Plot Plot of of the the RMSE RMSE against against Gaussian Gaussian lnlnPP(4.4X(XUU,,V, NoiseV)=)= in datasetxi,ji,jxi,j uiui i vjvj j 4.5 Verification4.5 Verification against against theoretical theoretical results results Next,−−Next, the performance|| the| performance2 2 i ofj algorithms− of− algorithms·· · wasnoise as-noise was with with as- zero zero mean mean and and varying varyingσσforforforr r=3=3.. . i i j j 4.5 Verification against theoretical results sessedsessed whenNext, when entries the entriesperformance of the of input the of input dataset algorithms(17)(17) dataset are wasIn are [10], as-In the [10], authors the authors have established have established theoretical theoretical R5 : 50 An Empirical Study of Popular Matrix Factorization based Collaborative Filtering Algorithms corruptedHence,Hence,corruptedsessed bythe the noise. negative negative by when noise.D logwas log entries likelihoodD likelihood corruptedwas of corrupted of theof Eq. by Eq. input 13 adding 13 by dataset addingnon-asymptotic arenon-asymptoticIn [10], bounds the authors bounds for singular have for establishedsingular values values of theoretical of acan zerocan nowa now mean zerocorrupted be be written written mean Gaussian as by as Gaussian noise. noiseD noisewas with corrupted standardwith standard bythe adding predictivethenon-asymptotic predictive output output of VBMF bounds of VBMF and for MAP andsingular esti-MAP values esti- of Mdeviation,M NNaσ zeroσ σ mean. σ was Gaussian variedMM 0 from. noise5 05.5 withto 5 standardwithmatorsmators fortheIn dense [10], forpredictive densethe matrices authors matrices output in have presence of in established VBMF presence of noise. and theoretical of MAP noise. esti- non- deviation,1 1 . was varied2 2 1 1 fromp to with x u vTT + u˜pupTuT an incrementan incrementdeviation,CSIxi,ji,jxi,j ofJOURNALu0iui .i5v.jvj ofσj TheOF.CSI0σ+ COMPUTING,.+5 JOURNALwas. plot Theu˜ii variedu˜ inui plotiuiCSI OFiu Fig.VOL.iui COMPUTING, JOURNALi in XX,from 6 Fig. showsNO. OF0 XX,. 6 VOL.5COMPUTING, XXXXXXXXX showsto XX,The5 NO.with synthetic XX,asymptotic VOL.The XXXXXXXXXmators XX, synthetic NO. dataset, XX,bounds for XXXXXXXXX dataset,dense Xfor, is matricessingularX dense., is dense.values in Further, presence of Further, the it ofpredictive noise. it 7 7 7 2 2 i j −− ·· · 2 2i=1 effect ofeffecti i j noise anj of increment onnoise RMSE on ofRMSE when 0i=1 .i=15.when Ther =3 plot. r =3 in. Fig. 6has shows beenoutputhasThe established been of syntheticVBMF established and earlier dataset,MAP earlier estimators inX Sec.,in is 4.3, dense.for Sec. dense that 4.3, Further, matrices that itin 1 NN 11 p pTT presencehas of been noise. established The synthetic earlier dataset, inX, is Sec. dense. 4.3, Further, that effect of noise on++ RMSEv˜jv˜vjv whenvjv (18)r =3. ALS -ALS WR - is WR a MAP is a MAPestimator. estimator. Hence, Hence, these these As seenAs in seenIn Fig. the in 6, above Fig. asIn 6, the2 equation, the as noise abovej thej j Inj j noisej increases, the(18) equation,(18) lastabove increases, term(18) equation,the can last be term the canlast beterm can be 22jj=1j=1 it hasALS been - established WR is a earlier MAP in estimator. Sec. 4.3, that Hence, ALS - these WR is the RMSEthe RMSE steadilyignored,As seen steadily increases. as inignored, itFig. does increases. 6, notAgain, asignored, as it depend doesthe Again, RMSE noise asnot on itRMSE is dependU does increases,. Similarly,bounds is not onbounds dependU can. Similarly, be oncan verifiedU be. Similarly, verified for X forcorruptedX corrupted by by p p a MAPbounds estimator. can Hence, be verified these bounds for Xcancorrupted be verified for by X highestDefine,highest fortheu˜thep SVD.p= RMSE for negative1 1 VBMFSVD.and steadilythev˜VBMFp likelihood andp= negative1 increases. ALS1, and whichthe of likelihood- negativeALS WRP im-( Again,V - have) WRis likelihood of RMSE havePnoise.(V) isnoise.is For of P ALS( ForV) -is ALS WR, - as WR, per as [10], per [10], Define,Define,Define, u˜iiu˜i ==nun andand v˜jjv˜j ==nvn , , which, whichwhich im- im-implies that corrupted by noise. For ALS - WR, as per [10], highestuii forui SVD. VBMFvjjvj and ALS - WR have noise. For ALS - WR, as per [10], greateralmostpliesplies almost thatthe thatsimilar greaternumber greater similar RMSE the theof numbersamples number RMSE upto of ofhigher samplesσ upto samples=2 theσ, higher higher thereafter=2precision., thereafter Now, σ2 σ2 Eq.RMSEthe 18,the precision. RMSEwhich precision. foralmost ALSis fora Now,MAP Now, - ALS similar WR Eq.estimate, Eq. -is 18, WR18, RMSE higher which whichisN is similar higher isuptoT than is a a MAPto MAPσ thatthe thanN=2 ALS forT that, - thereafter WR forN T γˆh = maxγˆh = max0,γh 0,γh σ2 (19) (19) (19) 1 vj vj 1 vj vj 1 vj vj { −{ c c−}ca cb } objectiveestimate,estimate, functionRMSE is is similar similar (Eq.ln forP to8). to(VALSthe the)= ALS ALS -ln WR -P -WR ( WRV is objective)= objective higherlnP (V than)= that for γˆh = max ah0b,γh h h h (19) VBMF.VBMF. p Fig.Fig.p 7. 7. VBMF: VBMF:CSIp Plot Plot JOURNAL of of the the RMSE RMSE OF COMPUTING, against against Gaus- Gaus-{ VOL. XX,− NO.c c XX,} XXXXXXXXX 8 functionfunction (Eq. (Eq. 8).− 8). −2 j=1 v˜j− 2 j=1 sianv˜j noise2 j=1 withv˜j zero mean andth varyingthσ for ah bh In addition,InVBMF. addition, both r bothandrsigma and sigmawere variedweresiansian varied noise noise where, with with zerowhere, zerowhere,γˆh meanis mean theγˆh and is andis hthe varying the varying hlargestth largesth σσlargestforfor singular singular singular value value obtained value using 4.4 Noise in dataset N differentdifferentN values valuesN of ofr..r. th th th th simultaneouslysimultaneouslyIn addition, and the and correspondingM both the+1 correspondingr andMsigma+1 changeswere changesM +1obtainedp variedCSIMAPobtained JOURNAL using estimator,pwhere, using MAP OF γˆ COMPUTING,hp is MAP estimator,is the the h estimator, largesth VOL.γlargesth XX,singularis NO. theγh singular XX, isvalueh XXXXXXXXX the of hthe value noisy 8 + N + ln(2πN)++ ln(˜lnv (2) Nπ)+(16)lnln(˜v(2)π)+(16)ln(˜v ) (16) th to4.4 RMSE4.4Next,to Noise Noise RMSE arethesimultaneously in plotted inperformance dataset datasetare plotted in Fig. andof in2 7,algorithms Fig. the Fig. corresponding 7, 8,{ Fig. and 2was 8, Fig. andassessed 9{ Fig.2 changeslargestj } 9 dataset,largest singular{ obtainedj } σsingular is value the using standardj value of} MAPthe of deviation noisy the estimator, noisy dataset, of the dataset,γ Gaussianhσis theσ noiseh Similarly, for VBMF, as per [10], when entries of the input dataset are corruptedj =1 by noise.j =1 D j =1 forNext, VBMF,for the VBMF,to performanceALS RMSE - ALS WR are and-of plotted WR algorithms SVD and in respectively SVD Fig. was 7, respectively as- Fig.4.54.5 8, and Verification Verificationis Fig. the standardis 9 the againstlargest against standard deviation theoretical theoretical singular deviation of results value results the of Gaussian of the the Gaussian noisy noise dataset, noise σ 2 M wasNext, Next,corrupted the the performance performanceby adding a of ofzero algorithms algorithms mean Gaussian was was asas- noise with Similarly, forMσ VBMF,2 σ as per [10], Mσ2 sessedsessed when whenfor VBMF, entries entries of ALS of the the - input WRinput dataset and dataset SVD are are respectivelyInIn [10], [10], the the authors authorsis have have the established established standard theoretical deviationtheoretical of the Gaussian noise L standard deviation,From σ . [5], σ was itFrom is varied known [5], from itFrom that is0.5 known to [5], 5 with it isthat anknown that max 0,γh γˆh γh corruptedcorrupted by by noise. noise.DDwaswas corrupted corrupted by by adding adding non-asymptoticnon-asymptotic bounds bounds for for singular singular values values of of 2 M { − γh 2 − σcah cbh }≤ ≤ − γh 2 incrementa zero of mean 0.5. GaussianThe plot in noise Fig. with6 shows standardM effectN ofthe noisetheM predictive predictive Non outputM outputN of of VBMF VBMF and and MAP MAP esti- esti- Mσ L Mσ aa zero zero mean mean Gaussian Gaussian noise noise with with standard standard 2 2 2 (20) RMSE when r = 3. 1 mators1 forT dense1 matricesFig.T in 6. presence PlotFig.T of of 6. noise. the PlotFig. RMSE of 6. the against Plot RMSE of Gaussian the against RMSE Gaussian againstmax 0,γ Gaussianh γˆh γh deviation,deviation,σ.σ. .σσwaslnwasP varied( variedX U, fromV fromln)=P0.(.05X.5totoU5,5lnwithVwithP)=(xXi,jmatorsmatorsU,uVi for)= forv densex densei,j matricesu matricesi vxi,j in in presenceu presencei v of of noise. noise. γ c c γ j j j σ r =3σ r =3Forσ{ VBMF,r =3− standardh − ah bh Gaussian}≤ ≤ priors− areh ananAs increment incrementseen in ofFig.− of0 ..056,...5 . The Theas The |the plot plot plot− noise in in in Fig. Fig. Fig.2increases,| 6 6i 6− shows shows showsj theThe2|The −RMSEi synthetic syntheticj· 2 dataset,−i dataset,j noise· XX,,, is is− iswith dense. dense. dense.· zeronoise Further, Further, Further, mean with it itnoise itzero and varying mean with zero andfor mean varying and. for varying . for . has been established earlier in Sec. 4.3, that (20) steadilyeffecteffect increases. of of noise noise on onAgain, RMSE RMSE RMSE when when risr=3 highest=3.. . for SVD.hashas VBMF been been established established(17) earlier earlier(17) in in Sec. Sec. 4.3,(17) 4.3, that that used, hence cah = cbh =1, rest of the values ALS - WR is a MAP estimator. Hence, these For VBMF, standard Gaussian priors are and ALSAsAs seen seen - WR in inHence, haveFig. Fig. 6, almost6, theas as theHence, negativethe similar noise noise the increases, RMSE increases,logHence, negative likelihood upto theALSALS σlog negative -= - WRof likelihood 2, WR Eq. is is a log 13 a MAP MAP likelihood of estimator. Eq. estimator. 13 of Hence, Hence, Eq. 13 these these pertaining to corrupted X remains the same. boundsbounds can can be be verified verified for forXXcorruptedcorrupted by by thereafterthethe RMSE RMSE RMSE steadily steadily for ALS increases. increases. - WR is Again,higher Again, RMSEthan RMSE that is is for VBMF. used, hence cah = cbh =1, rest of the values CSI JOURNAL OF COMPUTING, VOL. XX, NO. XX, XXXXXXXXX highest forcan SVD. now VBMF becan and written now ALS -beascan WR written now have be asnoise.noise. written7 For For ALS asALS - -WR, WR, as as per per [10], [10], The estimated values using Eq. 20 are 841 highesthighestIn addition, for for SVD. SVD.both VBMFr VBMF and σ and were and ALS ALSvaried - - WR WRsimultaneously have have and pertaining to corrupted X remains the same.≤ almostalmost similar similar RMSE RMSE upto uptoσσ=2=2,,, thereafter thereafter thereafter 2 2 γˆ1 < 920.07, γˆ2 =0, and γˆ3 =0. The actual CSI JOURNAL OF COMPUTING, VOL. XX, NO. XX, XXXXXXXXX the corresponding changesM N to RMSEM N are plottedM NinM Fig. 7, Fig.7 M M σσ RMSE for ALS1 - WR is higher1 than1 that2 for1 2 1 γˆhγˆ==max2max10,γ,γ0,γh (19)(19) The estimated values using Eq. 20 are 841 RMSERMSE for for ALS ALS - - WR WR is is higher higher than thanT that that for for Tp T h hT p T { h h−pc Tc } values, obtained by performing SVD on the In the above equation, the last term can be8, and Fig. 9 for VBMF, ALSxi,j - WRu andi v SVDxi,j +respectivelyui vxu˜i,ju +uuii v u˜ u+ u{i{ u−˜−cuachachubchbi}h } ≤ VBMF.VBMF. 2 2− · j 2 − 2 · j i −i 2· j i i 2 i i γˆ1 < 920.07, γˆ2 =0, and γˆ3 =0. The actual ignored, as it does not depend on U. Similarly, In addition, bothi jr and sigmai jwere variedi ji=1 where, γî=1is the hthththi=1largest singular value predictive output obtained using VBMF is γˆ1 = In the above equation, the last term can be InIn addition, addition, both both r randandsigma sigma werewere varied varied where,where,γˆ hγˆhisis the thehh largestlargest singular singular value value values, obtained by performing SVD on the simultaneously and the corresponding changes obtainedN usingN MAP estimator,N γ is the hththth 5 5 the negative likelihood of P (V) is simultaneouslysimultaneously and and the the corresponding corresponding changes changes1 obtainedobtainedp T using1 using MAP MAPp T estimator,1 estimator,p Tγhγhisis the thehh 946.57, γˆ2 =1.76 10− , and γˆ3 =1.44 10− . ignored, as it does not depend on U. Similarly, + v˜ v +v (18)v˜ v+v (18)v˜ v v (18) predictive output× obtained using VBMF× is γˆ1 = toto RMSE RMSE are are plotted plotted in in Fig. Fig. 7, 7, Fig. Fig. 8, 8, and and Fig. Fig. 9 9 largestlargestj j singular singularj Fig. valuej value 8.j ofj of ALS the thej noisy noisy -j WR:j dataset, dataset, Plotσσ of the RMSE against These values are close to the estimated values, the negative likelihood of P (V) is 2 j=1 2 j=1 2 j=1 5 5 N T forfor VBMF, VBMF, ALS ALS - -WR WR and and SVD SVD respectively respectively isis the the standard standard deviation deviationFig. of 8 of the: ALS the Gaussian Gaussian - WR: Plot noise noise of the RMSE against σ 946.57, γˆ2 =1.76 10− , and γˆ3 =1.44 10− . 1 vj vj Gaussian noise with zero mean and varying predicted by [10]. Fig.Gaussian 8. ALS noise - WR: with Plot zero ofmean the and RMSE varying against σ × × lnP (V)= p p 1 p p 1 p1 p 1 1 p 1 r These values are close to the estimated values, − 2 N vT˜ Define, u˜ Define,= andu˜ Define,=v˜ = uãnd,= whichv˜ =and im-for,v˜ different which= im-, values whichfor different im- of . values of r. An important point to note is that, although 1 j=1 v jvj i nu i j nu niv jnu nGaussianv j nv noise with zero mean and varying σ P (V)= j i i j i j j predicted by [10]. ln Np plies that greaterplies that the greaternumberplies that the of greater samplesnumber the ofhigher number samples of higher samples higher γˆ1 has high value for both ALS - WR and − M2+1j=1 v˜j p for different values of r. An important point to note is that, although + N ln(2π)+ln(˜vj ) (16) the precision.the Now, precision.the Eq. precision. 18, Now, which Eq. Now, 18,is a which MAP Eq. 18, is awhich MAP is a MAP γˆ γˆ N VBMF, 2 and 3 have been suppressed by M 2+1 j=1{ } γˆ1 has high value for both ALS - WR and + N ln(2π)+ln(˜vp) (16) estimate, isestimate, similar to isestimate, the similar ALS isto - similarWR the ALS objective to - the WR ALSFig. objective - 7. WR VBMF: objectiveFig. Plot 7. VBMF: ofFig. the 7. RMSE Plot VBMF: of againstthe Plot RMSE ofGaus- the against RMSEVBMF Gaus- against estimator. Gaus- 2 { j } function (Eq.function 8). (Eq.function 8). (Eq. 8). VBMF, γˆ2 and γˆ3 have been suppressed by From [5], it is knownj =1 that sian noisesian with noise zerosian mean with noise zero and with mean varying zero andσ meanfor varying andσ varyingfor σ for different valuesdifferent of r. valuesdifferent of r values. of r. 4.6VBMF Real estimator. Dataset From [5], it is knownM thatN 1 T 2 Fig. 6. Plot of the RMSE against Gaussian Some of the experiments pertaining to vary- lnP (X U, V)= xi,j ui vj Fig. 6 : Plot 4.4of the NoiseRMSE4.4 inagainst dataset Noise Gaussian4.4 in dataset Noise noise in with dataset 4.6 Real Dataset M N − | 2 − · 2 noise with zero mean and varying σ for r =3. 1 i j T Fig. 6.zero Plot mean of theand RMSEvaryingagainst σ for r = Gaussian3. 4.5 Verification4.5 Verification against4.5 theoreticalVerification against theoretical results against theoretical resultsing noise results and sparsity levels were conducted lnP (X U, V)= xi,j ui v Next, the performanceNext, theNext, performance of algorithms the performance of algorithms was asof algorithms was as- was as- Some of the experiments pertaining to vary- j (17) σ r =3 − | 2 i j − · noise with zero mean and varying for . In [10], theIn authors [10], the haveIn authors [10], established the have authors established theoretical have established theoreticalon two theoretical real world datasets, viz., MovieLens Hence, the negative log likelihood of Eq. 13 sessed whensessed entries whensessed of the entries when input of entries dataset the input of are the dataset input are dataset are ing noise and sparsity levels were conducted (17) non-asymptoticnon-asymptotic boundsnon-asymptotic for bounds singular for boundsvalues singular of for values singular(ML) of ratings values 1M of dataset and an E-commerce can now be written as corrupted bycorrupted noise. Dcorrupted bywas noise. corruptedD bywas noise. by corrupted addingD was corrupted by adding by adding on two real world datasets, viz., MovieLens Hence, the negative log likelihood of Eq. 13 the predictivethe output predictivethe of VBMFpredictive output and of output VBMF MAP esti- andof VBMF MAP(Ecomm) and esti- MAP client esti- dataset. ML dataset has 6040 a zero meana zero Gaussian meana zero noise Gaussian mean with Gaussian noise standard with noise standard with standard (ML) ratings 1M dataset and an E-commerce can nowM N be written as M deviation, σdeviation,. σ was varieddeviation,σ. σ was from variedσ.0σ.5 wasto from5 variedwith0.5 tomators from5 with0.5 forto densemators5 with matrices formators dense in matricesfor presence dense in ofmatrices presence noise. in ofusers presence noise. and of noise.3706 items, with 1000209 ratings 1 T 2 1 p T (Ecomm) client dataset. ML dataset has 6040 xi,j ui vj + u˜i ui ui an incrementan of increment0.5.an The increment of plot0.5 in. The Fig. of plot0 6.5 shows. in The Fig. plotThe 6 shows in synthetic Fig. 6The shows dataset, syntheticTheX, dataset,synthetic is dense.X dataset,, Further, is dense.X it, Further, is dense. it Further, it M N M between 1 5. Ecomm dataset has 472419 users 2 − · 2 2 1 i j T 1 i=1 p T effect of noiseeffect on of RMSE noiseeffect when on of RMSE noiser =3 onwhen. RMSEr =3 when. hasr been=3. establishedhas beenhas established earlier been in established Sec. earlier 4.3, in that earlier Sec. 4.3,users in that Sec. and 4.3,−3706 that items, with 1000209 ratings xi,j ui v + u˜ u ui and 239732 items, with 2720947 ratings between 2 − · j 2 1 N i i ALS - WRALS is a - MAP WRALS isestimator. a - MAP WR is Hence, estimator. a MAP these estimator. Hence,between these Hence,1 these5. Ecomm dataset has 472419 users i j i=1 p T As seen inAs Fig. seen 6, inasAs Fig. the seen noise 6, in as Fig. increases,the 6, noise as the increases, noise increases, 1 3. + v˜j vj vj (18) − N the RMSEthe steadily RMSE increases.the steadily RMSE Again, increases. steadily RMSE increases. Again, isFig.bounds RMSE 9 Again, : SVD: is can Plot RMSEbounds be of the verified is canRMSEbounds be foragainst verified canX corrupted Gaussian be for verifiedX corrupted by forandX− corrupted239732 by items, by with 2720947 ratings between 12 j=1 Fig. 10 and Fig. 11 show plot of RMSE p T Fig. 9.noise.noise SVD: For with PlotALSnoise. zero -of mean WR, For thenoise. ALS asand RMSE per -varying For WR, [10], against ALS as σ for per - WR, [10], Gaus- as per [10], + v˜j vj vj (18) highest forhighest SVD. VBMF forhighest SVD. and VBMFfor ALS SVD. - and WR VBMF ALS have and- WR ALS have - WR have 1 3. 2 different values of r. against varying Gaussian noise and increasing p 1 p j=11 almost similaralmost RMSE similaralmost upto RMSEσ similar=2 upto, RMSE thereafterσ =2sian upto, thereafter noiseσ =2 with, thereafter zero mean and2 varying σ2 for −2 Define, u˜i = and v˜j = , which im- σ σ σFig. 10 and Fig. 11 show plot of RMSE nui nvj Fig. 9. SVD: Plot of the RMSE against Gaus- sparsity respectively for ML ratings dataset. RMSE forRMSE ALS - forWRRMSE ALS is higher - for WR ALS than is higher - that WR for isthan higher that than for γˆ thath = max for γˆ0h,γ=h max γˆh0,γ=hmax (19)0,γh (19) (19) p 1 p 1 differentwith which valuesdataset is of corrupted,r. c and c are prior variances against varying Gaussian noise and increasing pliesDefine, that greateru˜ = theand numberv˜ = of samples, which higher im- sian noise with zero mean{ ah − andbchah c{ varyingbh } − caσ{h cbforh } − cah cbh } i nu j nv VBMF. VBMF. VBMF. th Variance of the derived datasets, with specific i j of the h singular vector of U and V matrices respectively. For sparsity respectively for ML ratings dataset. the precision. Now, Eq. 18, which is a MAP Fig. 7 : VBMF: Plot of the RMSE against Gaussian different values of r. th th th plies that greater the number of samples higher In addition,In both addition,r andIn bothaddition,sigmar andwere bothsigma variedr σand =were 5, sigmaandwhere, varied r = were3; γˆh = variedwhere,is 946.48, the hγˆh where,=largestis 46.88, the γˆh singularh =is 46.53largest the for valueh singularthelargest noise value singular and sparsity, value is plotted alongside. VBMF, estimate, is similar to the ALS - WR objective Fig. 7.noise VBMF: with Plotzero ofmean the and RMSE varying against σ for Gaus- 1 2 3 th Varianceth of theth derived datasets, with specific the precision. Now, Eq. 18, which is a MAP simultaneouslysimultaneously and thesimultaneously corresponding and the corresponding and changes thecorrupted correspondingobtained changes X dataset. using changesobtained For ALS MAP - usingobtainedWR, estimator, c MAP = 100, using γ estimator,i.e.,h is number MAP the estimator,h ofγh isin the generalγhh is the outperformsh the other two methods function (Eq. 8). different values of r. ah estimate, is similar to the ALS - WR objective sian noiseto with RMSE zero areto mean plotted RMSE and are into Fig. RMSE plottedvarying 7, Fig. are inσ plotted 8,Fig.for and 7, Fig. inproducts; Fig. 8,9 and 7,largest Fig. and Fig. 8,c singular9= and 1000,largest Fig. i.e., value 9 singularnumberlargest of theof valueusers. singular noisy The of dataset, theestimated value noisy ofσ the dataset,noise noisy andσ dataset, sparsity,σ is plotted alongside. VBMF, Fig. 7. VBMF: Plot of the RMSE against Gaus- with whichb dataseth is corrupted, c and c are and in particular at higher noise and sparsity different values of r. ah bh in general outperforms the other two methods function (Eq. 8). sian noisefor with VBMF, zerofor ALS mean VBMF, - WR andfor ALSand varying VBMF, SVD - WR respectively ALSσ andfor - SVD WR respectively and SVDis the respectively standardis the deviation standardisth the of deviation standard the Gaussian ofdeviation the noise Gaussian of thelevels. noise Gaussian It is noise observed that the shapes of VBMF prior variances of the h singular vector of U and in particular at higher noise and sparsity 4.4 Noise in dataset different values of r. with which dataset is corrupted, cah and cbh are and ALS - WR RMSE curves are similar to and V matrices respectively.th For σ =5, and CSI4.5 Journal Verification of Computing against | Vol. theoretical 2 • No. 4, 2015 results prior variances of the h singular vector of U levels. It is observed that the shapes of VBMF Next,4.4 Noise the performance in dataset of algorithms was as- r =3; γ1 = 946.48, γ2 = 46.88, γ3 = 46.53 for the that of the variance curve. This effect is also and V matrices respectively. For σ =5, and and ALS - WR RMSE curves are similar to sessed when entries of the input dataset are 4.5In [10], Verification the authors against have established theoretical theoretical results corrupted X dataset. For ALS - WR, c = 100, seen in Ecomm dataset, as shown in Fig. 12, Next, the performance of algorithms was asah that of the variance curve. This effect is also corrupted by noise. D was corrupted by adding non-asymptotic bounds for singular values of r =3; γ1 = 946.48, γ2 = 46.88, γ3 = 46.53 for the for VBMF. This suggests a strong correlation sessed when entries of the input dataset are In [10], the authors have established theoretical i.e., number of products; and cbh = 1000, i.e., the predictive output of VBMF and MAP esti- corrupted X dataset. For ALS - WR, ca = 100, seen in Ecomm dataset, as shown in Fig. 12, a zero mean Gaussian noise with standard non-asymptotic bounds for singular values of number of users. The estimated valuesh using between the performance of these algorithms corrupted by noise. D was corrupted by adding mators for dense matrices in presence of noise. c = 1000 for VBMF. This suggests a strong correlation deviation, σ. σ was varied from 0.5 to 5 with the predictive output of VBMF and MAP esti- i.e., number of products; and bh , i.e., (VBMF and ALS - WR) and variance of the a zero mean Gaussian noise with standard The synthetic dataset, X, is dense. Further, it Eq. 19 are γˆ1 γ1, γˆ2 γ2, γˆ3 γ3. The between the performance of these algorithms an increment of 0.5. The plot in Fig. 6 shows mators for dense matrices in presence of noise. number of users.≈ The estimated≈ values≈ using datasets. However, no such relationship was deviation, σ. σ was varied from 0.5 to 5 with has been established earlier in Sec. 4.3, that actual values, obtained by performing SVD on effect of noise on RMSE when r =3. The synthetic dataset, X, is dense. Further, it Eq. 19 are γˆ1 γ1, γˆ2 γ2, γˆ3 γ3. The (VBMF and ALS - WR) and variance of the an increment of 0.5. The plot in Fig. 6 shows ALS - WR is a MAP estimator. Hence, these the predictive output≈ obtained≈ using≈ ALS - WR observed for SVD. effectAs ofseen noise in on Fig. RMSE 6, as when the noiser =3. increases, has been established earlier in Sec. 4.3, that actual values, obtained by performing SVD on datasets. However, no such relationship was the RMSE steadily increases. Again, RMSE is bounds can be verified for X corrupted by is γˆ1 = 945.03, γˆ2 = 50.37, and γˆ3 = 49.38. These observed for SVD. As seen in Fig. 6, as the noise increases, ALS - WR is a MAP estimator. Hence, these the predictive output obtained using ALS - WR 5CONCLUSIONS highest for SVD. VBMF and ALS - WR have noise. For ALS - WR, as per [10],X values are significantly close to the predicted the RMSE steadily increases. Again, RMSE is bounds can be verified for corrupted by is γˆ1 = 945.03, γˆ2 = 50.37, and γˆ3 = 49.38. These almost similar RMSE upto σ =2, thereafter 2 theoretical values, further strengthening our In this paper, the performance of two popular highest for SVD. VBMF and ALS - WR have noise. For ALS - WR, as per [10],σ values are significantly close to the predicted 5CONCLUSIONS RMSE for ALS - WR is higher than that for γˆh = max 0,γh (19) almost similar RMSE upto σ =2, thereafter 2 claim of ALS - WR as a MAP estimator. sparse matrix factorization algorithms VBMF { − caσh cbh } theoretical values, further strengthening our In this paper, the performance of two popular VBMF. γˆ = 0,γ RMSE for ALS - WR is higher than that for h max th h (19) r sigma γˆ h{ − ca cb } claim of ALS - WR as a MAP estimator. sparse matrix factorization algorithms VBMF VBMF.In addition, both and were varied where, h is the largesth singularh value simultaneously and the corresponding changes obtained using MAP estimator, γ is the hth In addition, both r and sigma were varied where, γˆ is the hth largest singularh value to RMSE are plotted in Fig. 7, Fig. 8, and Fig. 9 largest singularh value of the noisy dataset, σ simultaneously and the corresponding changes obtained using MAP estimator, γ is the hth for VBMF, ALS - WR and SVD respectively is the standard deviation of the Gaussianh noise to RMSE are plotted in Fig. 7, Fig. 8, and Fig. 9 largest singular value of the noisy dataset, σ for VBMF, ALS - WR and SVD respectively is the standard deviation of the Gaussian noise CSI JOURNAL OFCSI COMPUTING, JOURNAL OF VOL. COMPUTING, XX, NO. XX, VOL. XXXXXXXXX XX, NO. XX, XXXXXXXXX 7 7 CSI JOURNAL OF COMPUTING, VOL. XX, NO. XX, XXXXXXXXX 7 In the aboveIn equation, the above the equation, last term the can last be term can be In the above equation, theignored, last term asignored, it can does be not as it depend does not on dependU. Similarly, on U. Similarly, ignored, as it does not dependthe negative on U. Similarly,the likelihood negative of likelihoodP (V) is of P (V) is the negative likelihood of P (V) is CSI JOURNAL OF COMPUTING, VOL. XX, NO. XX, XXXXXXXXX 8 N T N T 1 vj vj 1 vj vj T N lnP (V)= lnP (V)=p p Similarly, for VBMF, as per [10], 1 vj vj − −2 v˜ 2 v˜ lnP (V)= p j=1 j j=1 j − 2 v˜ 2 2 M 2 j=1 j N N Mσ σ L Mσ M +1 M +1 p p max 0,γ γˆ γ N + N + (2πN)+ (˜v (2) π)+ (˜v ) h h h M ln ln lnj (16)ln j (16) { − γh − cah cbh }≤ ≤ − γh +1 p 2 CSI JOURNAL{ 2 OF COMPUTING,{ } VOL. XX, NO. XX,} XXXXXXXXX 7 + N ln(2π)+ln(˜vj ) (16) j=1 j=1 (20) CSI JOURNAL OF2CSI COMPUTING, JOURNAL{ OF VOL. COMPUTING, XX, NO. XX, VOL. XXXXXXXXX} XX, NO. XX, XXXXXXXXX 7 7 j =1 For VBMF, standard Gaussian priors are From [5], itFrom is known [5],In it the that is aboveknown equation, that the last term can be used, hence cah = cbh =1, rest of the values FromCSIIn JOURNAL [5], the it above is OFCSI known COMPUTING,In JOURNAL equation, the that above OF VOL. COMPUTING, XX,the equation, NO.CSI last XX,JOURNAL VOL. termXXXXXXXXX XX, the OF canNO. last COMPUTING, XX, be XXXXXXXXXterm can VOL. XX, be NO. XX, XXXXXXXXX 7 pertaining7 to corrupted7 X remains the same. ignored,M N as it doesM N not depend on U. Similarly, 2 2 The estimated values using Eq. 20 are 841 ignored, asignored, it does not as depend it does not on U depend. Similarly,the on1 negativeU. Similarly, likelihood1 T of P (VFig.) Tis 6. PlotFig. of 6. the Plot RMSE of the against RMSE Gaussian against Gaussian M N lnP (X U, Vln)=P (X U, V)=xi,j ui v xi,j ui v ≤ In the aboveIn equation, the1 above the equation, lastIn termtheT the2 above can last beFig. equation, term 6. can Plot thebe of last thej term RMSE cannoise against bej with Gaussianzeronoise mean with zeroand varying mean andγˆσ1 for< varying920r =3.07.,σγˆ2for=0r =3, and. γˆ3 =0. The actual thelnP negative(X U, Vthe)= likelihood negative of likelihoodx−P (Vu) is|v of−P (V)2is| i j 2−i j· − · i,j i CSIj JOURNAL OF COMPUTING, VOL. XX, NO. XX, XXXXXXXXX σ r =3 values, obtained by performing7 SVD on the −ignored,| asignored, it does2 noti asj itdepend does−ignored, not on· dependU. Similarly, as it on doesnoiseU. not Similarly, with depend zeroN mean onTU(17). and Similarly, varying(17)for . CSI JOURNALCSI JOURNAL OF COMPUTING, OF COMPUTING, VOL. XX, NO.VOL.1 XX, XX, XXXXXXXXX NO.vj XX,vj XXXXXXXXX predictive output7 obtained7 using VBMF is γˆ = T T 1 the negativethe likelihood negativeN of likelihoodHence,P (NtheV) theis negative ofHence, negativeP(17)(V likelihood) theis logln negativeP likelihood(V)= of P log(V of likelihood) is Eq.p 13 of Eq. 13 5 5 1 vj vj 1 vj vj 946.57, γˆ =1.76 10− , and γˆ =1.44 10− . lnP (V)= lnP (V)= In the above− equation,2 j=1 thev˜ lastj term can be 2 3 Hence, the negative log likelihoodcanp now be ofcanp written Eq. now 13 asbe written as Fig. 8. ALS - WR: Plot of the RMSE against These values are× close to the estimated× values, − −2 jN=1 Tv˜j 2InjN=1 theTignored,v˜In abovej the above equation, as it doesN equation, theT not last depend the term lastN on can termU be. Similarly, can be can now be written1 as vj vj 1 vj vj 1 vj vjM +1Gaussian noise with zerop mean and varying σ predicted by [10]. lnP (V)=M ln+1P (V)=Np Mignored,M+1N ignored,the aslnpN P it negativeM( doesV as)=N it not does likelihood depend not+pM depend on ofUPforN.( Similarly, differentonVM) Uisln.(2 Similarly, valuesπ)+ln of(˜vrj.) (16) − −2 v˜ 1 2 −v˜ 1 p 2 2 p1v˜ p22 1 { p } An important point to note is that, although M N + j=1N +j theMln(2j negative=1π)+Nthejln negative(˜lnv likelihood(2) π)+(16)T likelihoodlnj=1(˜ ofv )Pj (V(16)T of) isPT (Vj)=1is T 1 2 1 xi,j jui vjxi,j +jui vuj˜i ui+ui u˜i ui ui γˆ 2 T jN=12{ 2 p T jN=12{− } · − 2}N· 2 1 has high value for both ALS - WR and xi,j uMi v+1j + M +1i u˜i juiu i i j M +1 Ni=1vTv i=1 2 − · 2 p From1 [5],p itj isj known thatp VBMF, γˆ2 and γˆ3 have been suppressed by i j + N + iln=1(2πN)+ln(˜lnvlnj(2)P+π()+NV(16))=lnT(˜vNNj ) T(16)Nlnp(2π)+lnN(˜vj ) (16) 2 { 2 { }1 2vj1vj }v1j{vj p T 1 p} T VBMF estimator. From [5], itFrom is known [5],j=1 it that isN knownlnPj(=1V− thatln)=P (V)= 2+j=1j=1 v˜j v˜ v +v (18)v˜ v v (18) 1 p p j j j M N j j j − p T− 2 v˜2 2v˜ 2 2 + v˜j vj vj (18)j=1 j j=1 jj=1N 1 j=1 T Fig. 6. Plot of the RMSE against Gaussian M N M N MP +1(X U , V)= xp u v 4.6 Real Dataset From [5], itFrom is known [5], it that2 isj=1 knownFrom that [5], it2 isln known that2 i,j i j 1 1 T +− N Fig.T|N 6.N PlotlnFig.(22π of)+ 6. theln(˜ Plotv RMSEj ) − of(16) the against· RMSE Gaussian againstnoise with Gaussian zero mean and varying σ for r =3. lnP (X U, Vln)=P (X U, V )=xi,j p ui Mv +1xi,j Mp u+1ip v p p i j p Define, u˜ +Define,= 1j +anduÑ =v˜2 lnN=1j(2andπj1=1)+,ln{ whichv˜(2lnπ(˜v=)+ ) im-ln1 (˜(16),v which) }(16) im- σ r =3 σ r =3Some of the experiments pertaining to vary- − | − 2 | 2−i · nu −i jnoise· nu n withv zerojnoisej meannv withj and zero varying mean andfor varying. for . p 1 pMi Nj 1 Mi Nj 2i 2 { iM Nj{ } j } (17) ing noise and sparsity levels were conducted Define, u˜ = and 1v˜ = , which1 im-T 2 j=1Fig.1 T j 6.2=1 PlotFig. of 6. the PlotT RMSE2 of the againstFig. RMSE 6. Gaussian Plot against of the Gaussian RMSE against Gaussian i nu j nv lnP (X U,i Vln)=P (XpliesU, V that)=jxi,j greaterplieslnuPi ( thatXv thejxUi,j(17) greaterHence,number, V)=ui v thej of samplesnumber(17) negativexi,j ofhigher logu samplesi likelihoodvj higher of Eq. 13 on two real world datasets, viz., MovieLens plies that− greater| the− number2 | of samples2−− From higher· | [5],− it isnoise· known2 with that zeronoise− mean with· andzero varying meannoise andσ withfor varyingr zero=3. meanσ for r and=3. varying σ for r =3. Hence, theHence, negative the the logi negative j precision. likelihoodFrom [5],the From logi it Now,j precision.of likelihood is [5], Eq.can known Eq.it 13 now is 18,known Now, ofthat be whichEq.i written Eq. j that13 18,is as a which MAP is a MAP (ML) ratings 1M dataset and an E-commerce the precision.can now be Now,can written now Eq. asbe 18,estimate, written which as isestimate,similar a MAP to(17) is the similar ALS to(17) - WRM theN ALS objective - WR objective(17) (Ecomm) client dataset. ML dataset has 6040 1 Fig.T 7.2 VBMF:Fig.Fig. Plot 7.6. VBMF: of Plot the of RMSE Plot the of RMSE against the RMSE against Gaus- against Gaussian Gaus- estimate,Hence, is similar theHence, negative to the the ALS log negative likelihood - WRHence, objective log ofln likelihood theP Eq.(X negative 13UM , MVN of)=N Eq. logM 13 likelihoodN x ofu Eq.Mv 13 function (Eq.function 8). (Eq.1Fig.1 8). 7. VBMF:1 PlotT ofi,j2 the2 1i RMSEsianj 2p noiseT againstsian with Gaus- noise zero mean with zero andusers mean varying and andσ 3706for varyingitems,σ for with 1000209 ratings M N M N M − M | 2 T− · T Fig. 6.Fig.noise Plot 6. ofwith Plot the zero of RMSE meanthe RMSE against and varying against Gaussianσ for Gaussianr =3. functioncan now (Eq. 8).becan written now asbe written2 canln asP now(X2lnUP be, V(X written)=U, V)= asxi,jxi,ji uijuxvi i,jj v +ui v u˜i ui ui between 1 5. Ecomm dataset has 472419 users 1 1 T 1 Tp T 1 2sianp T noise− with · zero j mean2 differentj and varying valuesdifferentσ offorr. values of r. x u v x + −u vu˜ −u|+u | u˜ 2iu iju j 2 i j− · − · i=1 noise(17) withnoise zero with mean zero and mean varying and varyingσ for r−σ=3for. r =3. 2 i,j2− i · j i,j − 2 i · j i i i2 different i i i values of r. and 239732 items, with 2720947 ratings between Mi Nj Mi Nj iM=1 M N iM=1 M (17) N (17) 1 1 T4.42 Noise1 1 T4.4p in2Hence,T dataset Noise1 thep in negativeT datasetT 2 log1 likelihoodp T 1 of Eq.p 13T 1 3. xi,j ui vjxi,j +Hence,ui vNu˜j Hence,i u thei+u negativei x thei,j Nu˜i negativeuii loguvi j likelihood log+ likelihoodu˜ ofi u Eq.i ui+ of13 Eq.v˜ 13j vj vj (18) − 4.4 Noise2 in dataset2− · − 2 1· 2 canp nowT2 1 be− writtenp · T as 2 Fig. 9. SVD:24.5 Plot Verificationof the4.5 RMSE Verification against Gaus- theoretical againstFig. theoretical results 10 and Fig. results 11 show plot of RMSE i j i j Next,+ thei=1 performanceNext,iv˜ vj +v theji=1(18) performancev˜ ofv algorithmsvj (18) ofi=1 algorithms was as- j was=1 ascan now 2 can bej now writtenj be2 written4.5 asj j Verification as sian against noise with theoretical zero mean results and varying σ for against varying Gaussian noise and increasing Next, the performance ofsessed algorithms1 whenj N=1sessed was entriesM as-N1 whenj N=1 of the entries input of dataset the1 N inputM are datasetIn [10], are theIn authors [10], the have authors established havesparsity established theoretical respectively theoretical for ML ratings dataset. 1p T p T p T 2 1 different1 p pT valuesp T 1 of r. + M N v˜ Mv +Nvj (18)Define,Inv˜ [10],v vu˜j the(18)=M authors+ Mandv have˜ vv˜ v established=j (18), which theoretical im- sessed when entries of thecorrupted input dataset bycorruptedj noise.j areD byxwasi,jj noise. corruptedj ui iDvjwasn+u by corrupted addingj u˜ijjui byunon-asymptotici n addingv non-asymptotic bounds for bounds singularVariance for values singular of the of derived values datasets, of with specific p 1 p p1 12 11 p 2 1 T 2 1T 2 p21i T p T j Define, u˜ Define,= andu˜ =v˜ = j=1and,2 whichv˜i =j j=1 im-,− which· im- j2=1i=1 corrupted by noise.i Dnwasu corruptedi j nu n byv addingxi,jj pliesux ini,jvnon-asymptoticv thatj ui + greatervj +u˜ thei u boundsi numberuiu˜i ui u for ofi the samplessingular predictive higher valuesthe output predictive of of VBMF output andnoise of VBMF MAP and sparsity,esti- and MAP is plotted esti- alongside. VBMF, i a zero2 i meana2j zero Gaussian− mean· j − noise Gaussian· 2 with noise2 standard with standard i j i j the predictivei=1 outputi=1 N of VBMF and MAP esti- in general outperforms the other two methods a zeroplies mean that greater Gaussianpliesp that the1 numbernoise greaterdeviation,p p with the1 of samples number1σ standarddeviation,. σ pwas higher ofpthe variedsamples1σ precision.. 1σ was from higher varied Now,p0.5 to from1 Eq.51 with 18,0.p5 whichtoTmators5 with is for a densematorsMAP matrices for dense in matrices presence in of presence noise. of noise. Define, u˜i Define,= andu˜i =v˜j = Define,and, whichv˜j u˜=i = im-, whichand im-v˜j N= + N, whichv˜ v v im-j (18) deviation,the precision.σ. σ wasthe Now, precision. variednui Eq. from 18, Now,nu0 whichi.5nv Eq.toj 5 is 18,with a which MAPnvmatorsj nui is a for MAP1 densen1 matricesvj j injThe presence syntheticThe of noise. dataset, syntheticc X,c dataset, is dense.andX, Further, in is particular dense. it Further, at higher it noise and sparsity an incrementan of increment0.estimate,5. The of plot0 is.5similar in. The Fig. plotwithp 6toT2 shows the whichinp Fig. ALST dataset 6 - shows WR is objective corrupted, Fig.ah and 7. VBMF:bh are Plot of the RMSE against Gaus- plies that greaterplies that the numbergreater the ofplies samples number that greater higherof samplesThe the synthetic number higher+ of+ dataset,v˜j samplesvj vj=1jv˜jX(18)vj, higherv isj dense.(18) Further,th it levels. It is observed that the shapes of VBMF an incrementestimate, ofisestimate, similar0.5. The to is ploteffect the similar in ALS of Fig. noiseto -effect WR 6 the shows on objectiveALS of RMSEfunction noise - WR when onFig. objective (Eq. RMSE 7.r2 8).=3 VBMF: when. priorFig.2 Plot r 7.variances=3 VBMF: of. thehas of RMSE Plot been the ofh establishedagainst thehassingular RMSE been Gaus-sian vector established against earlier noise of U Gaus- within Sec. earlier zero 4.3, mean in that Sec. and 4.3, varying that σ for the precision.the Now, precision. Eq. 18, Now,the which Eq. precision. is18, a which MAP Now,has is been aEq. MAP 18,establishedj =1 whichj =1 is earlier a MAP in Sec. 4.3, that and ALS - WR RMSE curves are similar to effectfunction of noise (Eq. onfunction RMSE 8). when(Eq.As 8).r =3 seen. inAs Fig. seen 6,p inas Fig. thesian1 noise 6, noise asand increases,p thesian withV noise noisezero1matrices increases, meanALS with respectively. - zero and WRALS meanvarying is a For - MAP and WRdifferentσσ for=5 varying isestimator. a, values and MAPσ for Hence, estimator.of r. these Hence, these Define, u˜i = n and v˜j = n , which im- that of the variance curve. This effect is also Asestimate, seen in is Fig.estimate, similar 6, as to is thethe similar noise RMSEALSestimate, to - increases, WRthe steadilythep RMSEALS objective is similar increases.1 -p ALS WR steadilyFig.different objective1to -up WRi the Again, 7. increases. ALS isVBMF:1 valuesp arFig. RMSEdifferent -MAP=3 WR PlotAgain,1; 7. ofγv1jr objectiveestimator. isVBMF:=. of values 946 thebounds RMSE.48 RMSE Plot, ofγ2 Hence,Fig.r is= of. can 46 against thebounds 7..88 be, VBMF:RMSE theseγ3 verified= Gaus- can46 against. Plot53 befor for ofverified the the Gaus-X RMSEcorrupted for againstX bycorrupted Gaus- by Define,Define,u˜i = nu˜i =andn v˜jand= nv˜j ,= whichn , which im- im- function (Eq.function 8). (Eq. 8). functionplies (Eq.thatugreater 8).i ui the numbervj corrupted ofv samplesj X dataset. higher For ALS - WR, ca = 100, seenCSI in JOURNAL Ecomm OF dataset, COMPUTING, as VOL.shown XX, NO. in XX, Fig. XXXXXXXXX 12, 9 the RMSE steadily increases.highest Again, for RMSEhighest SVD. VBMFis4.4 forbounds SVD. Noise andsianShamsuddin VBMF ALS incan noise dataset be- and WRsian with verified N. ALS have noise zeroLadha, - for WRmeannoise. with et.X haveal. zero andcorrupted For sian ALSmean varyingnoise. noise - WR,byand For withσ for ALS varyingas zeroh per - WR, mean[10],σ for as and per [10],varying σ for R5 : 51 plies thatplies greater that greater the number the number of samples of samples higher higher c = 1000 for VBMF.CSI JOURNAL This OF suggests COMPUTING, a VOL.strong XX, NO. correlation XX, XXXXXXXXX 9 4.4 Noise4.4 in dataset Noise in datasetthe precision.noise. Now, For Eq. ALS 18,i.e., - WR, which number as per is of a [10], products; MAP and bh , i.e., highest for SVD. VBMF andalmost ALS similar - WRalmost have RMSE similar uptodifferent RMSEσ =2 upto values, thereafterdifferentσ of=2r. values, thereafter of rdifferent. values4.5 of Verificationr. 2between against the2 theoretical performance results of these algorithms the precision.theestimate, precision. Now,Next, is similar Eq. Now, the 18, performance Eq. to which the 18,number ALS which is a - of MAPWR of isalgorithms users. a objective MAP The was estimatedFig. as- 7. VBMF: values Plot using ofσ the RMSEσ against Gaus- almost similar RMSE uptoRMSEσ =2 for, thereafterRMSE ALS - forWR ALS is higher4.5 - WR Verification than is higher4.5 that Verification for thanagainst that theoretical2 for againstγˆh = theoretical resultsmaxIn [10],γˆ0h,γ=h themax results authors(VBMF0,γh have and(19) established ALS - WR)(19) theoretical and variance of the 4.4Next, Noise the performance4.4Next, in dataset Noise the performance in ofestimate, dataset algorithms4.4estimate,function isNoise of similar algorithms wassessed isin (Eq. similar datasetas- to 8). the whenvalues was toALS the entries as- using - ALSWREq. Eq. ofobjective - 19 19 WR the areare objective inputγˆ1 σ datasetγ1, γˆ2 areγ2, γˆ3 { γ3. −The Thec c{ } − c c } RMSE for ALS - WR is higher than that for γˆh = max 0,γh ≈Fig. 7.Fig. VBMF:sian≈ 7. noise(19) VBMF: Plot≈ of with Plot the zero RMSE ofah the meanbh RMSE against anda againsth Gaus-b varyingh Gaus-σ for sessed whensessed entries when ofVBMF. the entries input ofVBMF. dataset the inputcorrupted are datasetInactual by [10], noise. are values, theactual authorsInD obtainedwas [10], values, corrupted the have by authors obtainedperforming established by have adding by SVD performing established theoretical on non-asymptoticthe predictive SVD theoretical on datasets. bounds However, for singular no such values relationship of was functionfunction (Eq. 8). (Eq. 8). 4.5 Verification4.5{ Verification against− csiana theoreticalh cbh against4.5 noise}sian Verification withnoise theoretical results zero with mean against zero results meanand theoretical varying and varyingσ resultsfor σ for VBMF.Next, the performanceNext, the performance of algorithmsNext, of the algorithms was performancer as- non-asymptoticsigma was ofr as- algorithmsnon-asymptoticsigma bounds was asfor bounds singularγˆdifferent for values singular valueshγˆth of of valuesr. hthobserved of for SVD. corrupted bycorrupted noise. D bywasIn noise. corrupted addition,D wasIn by both corrupted addition, addinga zeroand by both meanoutput adding Gaussianandobtainedwerethe varied predictive using noisewere ALS output with where, varied- WR standardis obtained h1 =where,is 945.03, the usingthe ALSh2 predictive =largestis 50.37, - the WR singular outputlargest of value singular VBMF and value MAP esti- th differentdifferent values values of r. of r. th th In addition, both r and sigma were varied where,In [10],γˆh theisIn authors the [10],h the havelargest authors established singularIn have [10], established theoretical valuethe authors theoretical have established theoretical sesseda zero when meansesseda entrieszero Gaussian when mean ofsimultaneously the entries noise Gaussian inputsessed with ofsimultaneously dataset the when noise standardand input the entrieswith are corresponding datasetthe andstandardand of predictive the the are3 = corresponding input 49.38.isthe changesγˆ1 output = predictive Thesedataset 945.03 values ofobtained, changesγˆ are2 VBMF output= are 50. 37 significantly andusing,obtained of and VBMF MAPγˆ MAP3 =mators49esti- using andclose.38 estimator,. MAPThese for MAPto dense the esti-γ estimator,h matricesis the hγ inh presenceis the h of noise. 4.4 Noisedeviation, in datasetσ. σ was varied from 0.5 to 5 with th simultaneouslycorruptedCSI JOURNAL bycorrupted and noise. OF the COMPUTING, correspondingD bywas noise. VOL. corrupted XX,corruptedD NO.was changes XX, by XXXXXXXXX corrupted addingby noise.obtained bynon-asymptoticDpredicted addingwas using corrupted theoreticalvaluesnon-asymptotic MAP bounds by are estimator, values, adding significantly for further bounds singularγnon-asymptotich isstrengthening close the for values toh singular the8 ofourpredicted bounds claim values for5C of singularONCLUSIONS values of deviation, σdeviation,. σ was variedσto. 4.4 RMSEσ was from Noise4.4 are variedto0.5 plotted RMSE Noiseinto fromdataset5anwith are in increment0 Fig.dataset. plotted5 tomators 7,5 Fig.with in of for 8,Fig.0. and5 densemators. 7, The Fig. matrices plot for8,9 andlargest dense in Fig. in matrices presence singular9 6 shows4.5largest in Verificationof value presenceThe noise.singular ofsynthetic the of value against noise. noisy dataset, of dataset, theoretical the noisyX, isσ dataset,dense. results Further,σ it Next, the performancetheof ALS predictive - ofWRtheoreticalthe algorithms as outputa predictive MAP estimator. values, of was VBMF output as- furtherthe and predictiveof VBMF MAPstrengtheningσ esti- and output MAP our of esti- VBMFIn this and paper, MAP the esti-performance of two popular to RMSEaan zero increment are mean plottedaan zero of incrementGaussian in0.5 Fig. mean. Thefor 7, noise Fig.of VBMF,Gaussianplot0a. 8,5 in. zerowith and The ALSfor Fig.noise Fig. VBMF, meanplot standard 6- WR showseffect 9 in with Gaussianlargest andALS Fig. ofThe standardSVD -noise 6 WR showssingular synthetic respectively noise andon RMSE SVDThe valuewith dataset, synthetic respectively when standard of theisrX the4.5=3, noisy dataset, is. standard dense. Verification4.5 dataset,isX the Further,,Verification deviation is standard dense.has against itbeen of Further, deviation against the theoretical established Gaussian it theoretical of the results earlier noise Gaussian results in Sec. noise 4.3, that Next, theNext,sessed performance the when performance entriesmators of algorithms of of for the algorithmsclaim densemators input was of matrices for ALS datasetas- was dense - WR inas- are matrices aspresencemators aIn MAP [10], for in of estimator. presencedense thenoise. authors matrices of noise. havesparse in established presence matrix of factorization theoretical noise. algorithms VBMF for VBMF,deviation,effect of ALS noiseσdeviation,effect -. WRσ onwas andof RMSE noise variedσ SVD. σ when onwas respectively fromdeviation, RMSEr varied=30.5. whento from5σ.withrσAs=3is0was.5 the seen.tohas varied standard5Similarly, inwith beenSimilarly, Fig. from established deviationhas for 6, 0for. asVBMF,5 been VBMF,to the5 of asestablishedwith earliernoiseas the per per Gaussian [10], [10], increases, in Sec. earlier noise 4.3,ALS in that Sec. - WR 4.3, is that a MAP estimator. Hence, these sessedsessed whencorrupted whenentries by entries noise. of theD of inputwas the corrupted datasetinput dataset are byX addingIn are [10],In thenon-asymptotic [10],X authors the authors have boundsestablished haveX established for theoretical singular theoretical values of an incrementan of increment0.5. The of plot0.an5 in. Theincrement Fig. plot 6 shows in of Fig.0.5TheALS. 6 The shows synthetic - plot WRThe inALS is Fig. dataset,a synthetic -MAPWR 62 shows estimator. isM , dataset, a is MAP dense.The Hence,synthetic estimator., Further, is dense.bounds these dataset, itHence, Further, can these be, is it verified dense. Further, for X corrupted it by As seen inAs Fig. seen 6, asin theFig. noise 6, as increases, thethe noise RMSE increases, steadily increases.Mσ2 σ Again,non-asymptotic RMSEnon-asymptotic isMσ bounds2 bounds for singular for singular values values of of effect of noiseeffect on of RMSE noise whencorrupted on RMSEeffectrcorrupted=3a byof zero when. noisenoise. mean byr =3 onD noise.was RMSE. GaussianhasboundsD corrupted beenwas when cancorrupted noise establishedhasboundsr by=3 be addingbeen with. verified by can established standardadding earlierL be for verifiedhasX inthe Sec.corrupted earlierbeen predictive for 4.3, established inX that bySec.corrupted output 4.3, earlier that of by VBMF in Sec. and 4.3, MAP that esti- the RMSE steadilythe RMSE increases. steadily Again, increases. RMSEhighest Again, is for RMSEmax SVD.0,γ ish VBMF and ALS -γˆ WRh γ haveh noise. For(20) ALS - WR, as per [10], a zeroa meandeviation, zero mean Gaussianσ. Gaussianσ ALSwas noise varied{ - WR withnoise−ALS is fromγ standard awithh - MAP−0 WR.5c standardatoh c estimator.isbh5the}≤ awith MAP predictiveALSthe≤mators Hence, estimator. - predictive− WRγ outputh for is these a dense Hence, MAPoutput of VBMF matrices estimator. these of VBMF and in MAP presence Hence, and esti- MAP these of esti-noise. highestAs seen forhighest in SVD.As Fig. seen VBMF for 6, asin SVD. and Fig.the VBMF ALS noise 6,As as seen- andincreases, WRthealmost in ALSnoise have Fig. - increases, similarWRnoise. 6, as have Forthe RMSE ALS noisenoise. upto - WR, increases, Forσ ALSas=2 per -, WR, [10],thereafter as per [10], 2 deviation,deviation,anσ increment. σ wasσ. σ varied ofwasbounds0.5 varied from. The can0 plot from.bounds5 beto in50 verified.5 Fig.with canto 65 be showswithmators for verifiedboundsXmators forcorruptedThe dense for syntheticcan for(20)X matricesdense be bycorrupted verified dataset, matrices in presence for byX in,X is presence dense.corrupted of noise. Further,ofσ noise. by it thealmost RMSE similarthe steadilyalmost RMSE RMSE similar increases. steadily upto RMSEσthe Again, increases.=2 RMSE upto, thereafter RMSEσ steadily Again,=2 is , thereafter RMSE increases.For VBMF,is Again, standard RMSE Gaussian is 2 priors are 2 γˆh = max 0,γh (19) an incrementan increment ofRMSE0.5. of The for0.5 plot. ALS TheFor in VBMF, -plot Fig. WR in 6 standard is Fig.shows higher 6 showsGaussianThe than syntheticσ that Thepriorshas for synthetic are been dataset, used,σ established hence dataset,X, c is dense.X earlier, is dense. Further, in Sec. Further, it 4.3, that it highest forhighest SVD. VBMF for SVD. and VBMFhighest ALSeffect - and WR for of ALSSVD. havenoise - VBMF on WRnoise. RMSE have and For when ALSnoise.r - - WR,=3 ForWR. ALS as have per - WR, [10],noise. as For per ALS [10], - WR, asah perFig. [10],Fig. 10{ 10.: ML− MLratingscah ratingscbh }dataset: dataset: Plot Plotof the of RMSE the RMSE against RMSE for ALSRMSE - WR for ALS is higher - WR than is higher thatVBMF. for thanused, that forhenceγˆhca=h max= cbh =10γˆ,γh =h, restmax of0 the,γh values(19) (19) Fig. 12. VBMF on Ecomm dataset: Plot of the effect ofeffect noise of on noise RMSE on RMSE when= cb = 1,r when =3rest. ofr the=3 values. { pertaininghas− c beenhasc{ ALSto} establishedcorrupted been− - WRc establishedc X is remains} earliera MAP earlier in estimator. Sec. in 4.3, Sec. Hence, that 4.3, that these almost similaralmost RMSE similar upto RMSEσalmost=2 uptoAs, similarthereafter seenσ =2 inRMSE, Fig.thereafterpertaining upto 6,h asσ the to=2 corrupted noise, thereafter increases,X remainsah2 bh the same.ah2 bh varianceagainstFig.2 th 10. variance and ML Gaussian ratings and dataset: Gaussian noise with Plot noise zero of the mean with RMSE zeroand Fig. 12. VBMF on Ecomm dataset: Plot of the VBMF. VBMF. In addition, both r and sigma wereσ varied σwhere, γˆh is theσh largest singular value RMSE against Gaussian noise with zero mean the same. The estimated valuesALS using -ALS WRbounds Eq. is- WR 20 a MAParecan is 841 a be estimator. MAP< verified1 estimator. Hence, for X Hence,corrupted these thesevarying by σ for r = 10. RMSE for ALSRMSE - WRfor ALS is higherAs -RMSE WR seenthe thanAs is for in higher RMSE seen that Fig. ALS in for steadilythan 6, - Fig. WR asThe that the 6, is increases. estimated as forhigher noise theγˆh increases,noise= than Again,valuesmax that increases, usingth0γˆ RMSE,γh for=h max Eq. is 20th0 are,γh 841γˆh =(19)max 0,γ(19)h meanagainst and variance varying(19) andσ for Gaussianr =th 10. noise with zero RMSE against Gaussian noise with zero mean In addition,In both addition,r and bothsigmar andweresigma variedsimultaneouslywere

VBMFCSI Journal underﬁts of Computing the data. Theoretical | Vol. 2 • No. bounds 4, 2015 on singular values of dense but noisy data were veriﬁed on our dataset. In future, we intend to establish theoretical bounds on singular values for matrix factorization of sparse and noisy data.

REFERENCES [1] S. Zhang, W. Wang, J. Ford, and F. Makedon, “Learning from incomplete ratings using non-negative matrix factorization,” in SIAM Conference on Data Mining, 2006, pp. 549–563. Fig. 11. ML ratings dataset: Plot of the RMSE [2] Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan, “Large- scale parallel collaborative filtering for the netflix prize,” against variance and sparsity for r = 10. in Algorithmic Aspects in Information and Management, ser. Lecture Notes in Computer Science, 2008, vol. 5034, pp. 337–348. and ALS - WR based on the characteristics of [3] T. Raiko, A. Ilin, and J. Karhunen, “Principal component analysis for large scale problems with lots of missing the datasets and the model parameters have values,” in European Conference on Machine Learning, ser. been compared. Various experiments were con- Lecture Notes in Computer Science, 2007, vol. 4701, pp. ducted by varying noise and sparsity levels 691–698. [4] J. M. Hernndez-Lobato and Z. Ghahramani, “Netbox: a of the synthetic and the real datasets and probabilistic method for modeling and mining market model parameters (hidden features) to mea- basket data,” 2013. sure the performance of both the algorithms, [5] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics), 1st ed. Springer, 2007. using RMSE as the benchmark performance [6] X. Su and T. M. Khoshgoftaar, “A survey of collaborative metric. As per the results obtained, Variational filtering techniques,” Advances in Artificial Intelligence, vol. Bayesian based matrix factorization relatively 2009, pp. 4:1–4:19, January 2009. [7] J. Lee, M. Sun, and G. Lebanon, “A comparative study outperforms the ALS - WR based method espe- of collaborative filtering algorithms,” Computing Research cially with increasing sparsity. We have demon- Repository, vol. abs/1205.3193, 2012. strated that ALS - WR is a MAP estimator. [8] Z. Huang, D. Zeng, and H. Chen, “A comparison of collaborative-filtering recommendation algorithms for e- VBMF is more robust to over-fitting than MAP commerce,” Intelligent Systems, vol. 22, no. 5, pp. 68–78, (ALS - WR) due to its underlying Bayesian September 2007. averaging. Thus, our experimental results are [9] J. S. Breese, D. Heckerman, and C. Kadie, “Empirical analysis of predictive algorithms for collaborative filtering,” in in conformance with the theoretical behav- Conference on Uncertainty in Artificial Intelligence, 1998, pp. ior. We have also showcased instances, when 43–52. R5 : 52 An Empirical Study of Popular Matrix Factorization based Collaborative Filtering Algorithms

REFERENCES [8] Z. Huang, D. Zeng, and H. Chen, “A comparison of collaborative- filtering recommendation algorithms for ecommerce,” Intelligent [1] S. Zhang, W. Wang, J. Ford, and F. Makedon, “Learning from Systems, vol. 22, no. 5, pp. 68–78, September 2007. incomplete ratings using non-negative matrix factorization,” in [9] J. S. Breese, D. Heckerman, and C. Kadie, “Empirical analysis of SIAM Conference on Data Mining, 2006, pp. 549–563. predictive algorithms for collaborative filtering,” in Conference [2] Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan, “Largescale on Uncertainty in Artificial Intelligence, 1998, pp. 43–52. parallel collaborative filtering for the netflix prize,” in [10] S. Nakajima and M. Sugiyama, “Theoretical analysis of bayesian Algorithmic Aspects in Information and Management, ser. matrix factorization,” Journal of Machine Learning Research, Lecture Notes in Computer Science, 2008, vol. 5034, pp. 337– vol. 12, pp. 2583–2648, September 2011. 348. [11] R. H. Keshavan, A. Montanari, and S. Oh, “Matrix completion [3] T. Raiko, A. Ilin, and J. Karhunen, “Principal component from noisy entries,” Journal of Machine Learning Research, vol. analysis for large scale problems with lots of missing values,” in 11, p. 2057, 2010. European Conference on Machine Learning, ser. Lecture Notes [12] E. Candes and Y. Plan, “Matrix completion with noise,” in IEEE, in Computer Science, 2007, vol. 4701, pp. 691–698. vol. 98, 2010, p. 925. [4] J. M. Hernndez-Lobato and Z. Ghahramani, “Netbox: a [13] T. S. Jaakkola, “Variational methods for inference and probabilistic method for modeling and mining market basket estimation in graphical models,” Ph.D. dissertation, 1997, data,” 2013. aAI0598367. [5] C. M. Bishop, Pattern Recognition and Machine Learning [14] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, (Information Science and Statistics), 1st ed. Springer, 2007. “An introduction to variational methods for graphical models,” [6] X. Su and T. M. Khoshgoftaar, “A survey of collaborative Machine Learning, vol. 37, no. 2, pp. 183–233, 1999. filtering techniques,” Advances in Artificial Intelligence, vol. [15] S. Owen, R. Anil, T. Dunning, and E. Friedman, Mahout in 2009, pp. 4:1–4:19, January 2009. Action, 1st ed. Manning Publications Co., 2011. [7] J. Lee, M. Sun, and G. Lebanon, “A comparative study of [16] R Development Core Team, R: A Language and Environment for collaborative filtering algorithms,” Computing Research Statistical Computing, R Foundation for Statistical Computing, Repository, vol. abs/1205.3193, 2012. Vienna, Austria, 2008, ISBN 3-900051-07-0.

About the Authors

Shamsuddin N. Ladha is presently working at TATA Research Development and Design Center, as a Research Associate, in Remote Sensing. Earlier, he was at Infosys Labs. His research interests are in the area of Image Processing, Computer Vision, and Machine Learning. He has a PhD in Computer Science and Engineering from IIT Bombay and Monash University. Prior to joining the PhD program, he has extensively worked in Software Research and Development in the USA and India.

Hari Manassery Koduvely is a Principal Engineer with Samsung R&D Institute, Bangalore, India. His research interests are in the area of Machine Learning and Statistical Models for Natural Language Understanding. Prior to Samsung he has worked in Infosys and Amazon. Dr. Hari has a PhD in Statistical Physics from Tata Institute of Fundamental Research in Mumbai and has done Postdoc at Weizmann Institute of Science, Israel and Georgia Tech, Atlanta.

Lokendra Shastri is a Consultant (VP level), Advanced Technology Labs at Samsung Research Institute, Bangalore. Earlier, he was Assoc. Vice-President and GM, Research at Infosys Limited. Before entering industry, Dr. Shastri was a Senior Research Scientist at the International Computer Science Institute, a research organization affiliated with the CS Division of the UC Berkeley, and on the faculty of the CIS Department at the Univ. of Pennsylvania. Dr. Shastri has made significant contributions to massively parallel models of inference, knowledge representation, semantic networks, episodic memory, and computational neuroscience. He has authored a book and over 80 invited chapters and peer-reviewed articles, with over $3100$ citations, and he also holds multiple patents. Dr. Shastri received his PhD in computer science from the University of Rochester, an MS in computer science from the Indian Institute of Technology, Madras, and a B.E. (honors) in Electronics (with distinction) from the Birla Institute of Technology and Science, Pilani.

CSI Journal of Computing | Vol. 2 • No. 4, 2015 Manju. V.C., et. al. R6 : 53 Mitigation of Jamming Attack in Wireless Sensor Networks Manju. V.C.*, Dr. Sasi Kumar** * Research Student, Kerala University, Email: [email protected] ** Professor. Marian Engineering College, Email: [email protected]

The categorization of jamming attacks is based on different network performance degradation phenomena like network congestion or channel fading. The development of multimodel detection technique can assist in detecting jamming attacks with lower false alarm rate and high precision. This paper proposes a fuzzy model based on RSSI, PDR and PSR for detection and classification of jamming attack precisely and examine the combination of different Jamming methods and Detection of service in Wireless Sensor Networks (WSN). Keywords: Jamming Attacks, Wireless Sensor Networks, Multi-model detection technique 1. Introduction A safe and sound sensor network is a difficult task due concepts are given in the table below (Table 1 and Table 2) to the limited components connected with low-cost sensor Table 1 : Existing suggested models hardware. An intruder intercepts a transmission and jams the transmission. An extensive set of wireless sensor Model Suggested by Types of Models applications can be built for monitoring hostile environments Xu et al. [12] Constant Jammer, Deceptive like wild forests, medical healthcare to promote health Jammer, Random Jammer, Reactive services, and industries to attain high quality management. Jammer In sensor networks, we may come across a type of node Law et al. [11] Periodic Data Packet Jammer, considered as faulty node and this node is measured to be Periodic Control Interval Jammer, faulty if it is providing the results which significantly deviate Periodic Listening Interval Jammer, from the results of the neighbor nodes. The academics seems Periodic Cluster Jammer to believe that the jammer has limited resources, who may be an intruder or the own compromised nodes, fully or partially Wood et al. [1] Scan Jammer, Interrupt Jammer, knowing the protocols, and attacking the network from a Pulse Jammer, Activity Jammer location well within the geographical extent of the WSN. Muralidharan et al. Pulsed-Noise Jammer, Single-Tone The military believes that it is neither worth the effort to [10] Jammer, Electronic Intelligence learn the WSN protocols nor essential to move into the WSN (ELINT) and Multi-Tone Jammer geographical area for jamming, since it is so easy to jam the nodes with brute RF power from a far-off safe distance, Table 2 : Existing suggested concepts especially when the RF frequency is known. Therefore, there is a need to balance the two approaches in modelling Suggested by Suggested concepts the jamming attack to make our counterjamming efforts, Xu et al. [12] MICA2 Mote platform. like jamming detection and jammer localization, suitable for information warfare. Muralidharan et al. The swarm intelligence and ant system [10] II. Literature Review Cakiroglu et al. [8] Threshold values of three detection Accurate detection of radio jamming attacks is a parameters: Bad Packet Ratio (BPR), challenge in mission critical scenarios. Many detection Packet Delivery Ratio (PDR), and techniques have been proposed in the literature, but the Energy Consumption Amount (ECA). precision component is always an issue and numerous Reese et al.[6] A method to differentiate a legitimate scholars have suggested different mechanisms to detect received signal, called clean signal(t) jamming attacks. Each of these recommended methods are from a signal received from a jammer, to be applied at the individual node level to crisply conclude called jammed signal whether the node is jammed or not. Their technique is either based on threshold values or to use digital signal processing Strasser et al. [9] Detecting a reactive jammer (which techniques to differentiate between a legitimate and an otherwise is so difficult to be detected) illegitimate (jamming) signal and thus conclude about the through Received Signal Strength (RSS) presence or absence of the jammer. The existing models and and Bit Error Rate (BER) samplings

CSI Journal of Computing | Vol. 2 • No. 4, 2015 R6 : 54 Mitigation of Jamming Attack in Wireless Sensor Networks

VariousSignal techniquesstrength arevariation used for ( assessmentS) is the of the node the total numbercollected of packets by received. the detector in a given settingschange with those in signalof its neighbours strength to fine taken tune in their results The PDRsample of the window known ofsampling time to window detect thecan be and wedB, examine the existing results since proposed by various calculated asjamming follows: attack and its type: (1) scholars in the following sub sections. PDR = (1−Pj)(1−Pc), (2) PDR, 2) S, and (3) pulse width The review of the mentioned prototypes reveals that Where Pj is the jamming probability computed for i.e.,(S)= SSobserved - SSnetwork while the military prototypes targets to attack the network at the The different jammingsubject jammers to rate  inS  > the is0. the subsequent rate the sections and THEN (a set of consequences can be the physical layer with RF power being their main weapon. Pc isjammer the collision squash probability the of channel the packets and when there inferred). The educationalwhere SS networkprototypes is target the the data-link signal layer with are many transmitting nodes at the same time and in RF power levels at par with the existing average transmitted If x is PDR the is time the ratio Where of theRj istotal the number strength achieved during training our experimentation of single transmitter and receiver In view of the fact that the power of a WSN node. is concerned,jammingof Pc packets is velocity. always correctly zero. ThenHowever, received Forit comes to the into session without jamming The military believes that it is neither worth the effort to considerationexample,total when number the number jammingof packets of contending received. pulse stations For for the antecedents and the consequents of learnand the WSNSSobserved protocols is nor the essential signal to strength move into the WSN channel is more than one. these IF-THEN rules are associated geological region for jamming. As an observation, the fact precedingan D for environment 1 s in a total with window noise and observed when the network is For which S> 0 and y is the whole sampling window with fuzzy concepts and they are that it is easy to jam the nodes with brute RF power from the time.of Subsequently, 1,000interference, s, Rjit is is written said the PDRasto follows:be is1/1,000 measured at safe distance.suspected to be under attack. PW is and forthe a receiver constantRj side= x/y, jammer as the the ratio of (3) frequently called fuzzy conditional III. Proposedthe measure Methodology of time for which (S) The jammingnumber rate of ispackets the rate received the jammer that squash pass the statements Then in FLC channelcontinuous and If x is the transmission time Where Rjof is jammingthe jamming velocity. Theis greater proposed than model the is threshold that the RFvalue. jamming It detects terminology, a fuzzy control rule is a Thenpulse, For example, cyclic the the redundancyrate jamming is 1.pulse check Jamming preceding (CRC) for 1 to μs in the occurrence of the jammers and to classify them and the is engaged in microseconds. a total windowthe oftotal 1,000 number μs, Rj isof said packets to be 1/1,000received. and for fuzzy conditional statement in which model is inherited on multimodal approach. It integrates a constantrate Rj jammer for the the time continuous T can transmissionbe derived of jamming PDR and signal strength as the detection constraints and the the antecedent is a condition in its The jamming pulse acts as high- pulse,through the rate. the is 1.following Jamming rateequation: Rj for the time T can be PDR is calculated over the given illustration window of time. derived through the following equation: application domain and the Signalpower strength Gaussian variation (S) noise and PW which are the canmodel-specific The PDR of the known sampling parameters. Rj= (4) consequent is a control action for the appear quite a few period over the window can be calculated as follows: (4) Signal strength variation (S) is the change in signal  system under control. The inputs of strengthchannel taken in dB, and to calculate the Where PWPDR is (PWi =the (1−Pj)(1−Pc), jammer + 1 − pulse PWi)T time and (PWi+1(2) − PWi) is theWhere subwindow PW timeis the during jammer which pulseDS> 0. Thetime total sample fuzzy rule-based systems should be i.e.,(N samplesDS)= SSobserved of channels - SSnetwork established windowand time (PWi+1 is T. i) is the sub- given by fuzzy sets. Therefore, we whereenergy SSnetwork s (t) are is the composed signal strength and achieved the during Where Pj is the jamming probability training session without jamming and SSobserved is the III. windowa. Fuzzy time Logic during Controllers which S> 0. have to fuzzify the crisp inputs, signalillustrations strength observed , as when a result the network figure is a suspected computed for the different jammers TheThe total determining sample window effort bytime Zadeh is T. [7] on fuzzy furthermore the output of a fuzzy to be under attack. PW is the measure of time for which larger window of samples s (k), s (k - algorithmsin the subsequentintroduced the sections idea of formulating and Pc is the (S) is greater than the threshold value. It is engaged in system is always a fuzzy set. As a 1),.....,s (k - N + 1), taken at controlthe algorithm collision by logical probability rules. of the microseconds. result to get crisp value we have to Theconsecutive jamming pulse smaller acts as sampling highpower timeGaussian noise An packetsFLC consists when of a set of there rules of the are form: many IF (a set defuzzify it. Fuzzy logic control whichwindows. can appear quite a few period over the channel and to III.a.Fuzzyof conditionstransmitting Logic are satisfied) Controllers nodes at the same time calculate the N samples of channels established energy s (t) THEN (a set of consequences can be inferred). systems usually consist of four major are composed and the illustrations , as a result figure a larger The determiningand in our effortexperimentation by Zadeh [7]of single The detection is done using Equation parts: Fuzzification interface, Fuzzy window of samples s (k), s (k - 1),.....,s (k - N + 1), taken at on fuzzyIn transmitterview algorithmsof the fact that and introduced the antecedents receiver the and is the consecutive1. smaller sampling time windows. consequents of these IF-THEN rules are associated rule base, Fuzzy inference engine idea concerned, of formulating Pc is the always control zero. The detection is done using Equation 1. with fuzzy concepts and they are frequently and Defuzzification interface, as is algorithmcalledHowever, fuzzy by logical conditional it comes rules. statements into consideration Then in FLC T(k) = (1) (1) terminology, a fuzzy control rule is a fuzzy conditional presented in the figure 1 and the when the number of contending  An statementFLC consists in which of athe set antecedent of rules isof a condition functions of fuzzification interface Where T(k)=  is  the average (()2) jamming pulse observed in stations its application for the domain channel and is the more consequent than is Where T(k) is the average jamming the form: IF (a set of conditions are and defuzzification interface is for casement of N illustrations and to make a decision the a controlone. action for the system under control. The occurrencepulse of a observedjammer T(k) is forevaluated casement with a quantity of satisfied)inputs of fuzzy rule-based systems should be given shown in figure 2 and figure 3. thresholdof N and illustrations the threshold and is cautiously to make calculated a to avoid by fuzzyFor which sets. Therefore, S> 0 weand have y isto fuzzifythe whole the crisp false detection. The following are the relevant parameters inputs,sampling furthermore the window output of a fuzzy time. system collecteddecision by the detector the in occurrence a given sample ofwindow a of time to is always a fuzzy set. As a result to get crisp value detectjammer the jamming T(k) attack is evaluated and its type: with(1) PDR, a 2) DS, and we Subsequently,have to defuzzify it. itFuzzy is logic written control systems as (3) pulse width subject to DS> 0. usuallyfollows: consist of four major parts: Fuzzification PDRquantity is the ofratio threshold of the total  andnumber the of packets interface, Fuzzy rule base, Fuzzy inference engine correctlythreshold received  tois the cautiously total number calculated of packets received. and Defuzzification interface, as is presented in the figure 1 andRj the = x/y,functions of fuzzification (3) interface For anto environment avoid falsewith noise detection. and interference, The the PDR is measured at the receiver side as the ratio of number of and defuzzification interface is shown in Fig. 2 and packetsfollowing received thatare passthe relevantcyclic redundancy parameters check (CRC) to Fig. 3. 4

CSI Journal of Computing | Vol. 2 • No. 4, 2015

A fuzzification operator has the III.b. Mamdani Fuzzy Logic result of transforming unique data Controller into fuzzy sets. Many of the cases fuzzy singletons are used as The most frequently used fuzzy fuzzifiers as per the figure 4. inference technique is Mamdani Manju. V.C., et. al. R6 : 55 method [5]. It was proposed, by Mamdani and Assilian, as the very are used as fuzzifiers as per the Fig. 4.  first attempt to control a steam       engine and boiler combination by   synthesizing a set of linguistic control rules obtained from

  experienced human operators.    

Fig.Figure 4 : Fuzzy 4: Fuzzy singletons singletons as fuzzifiers as Their work was inspired by an     In other words, fuzzifiers equally influential publication by  Fuzzifier (x0) = x0, (5) Zadeh [7]. Interest in fuzzy control     μx0 (x) =1 f orIn xother = x0 words,  has continued ever since, and the

0 f or x ≠ x0 literature on the subject has grown Figure 1: Fuzzy Logic Control. Fuzzifier (x0) = x0, (5) Fig.Measures 1 : Fuzzy the Logic values ofControl. inputs variables, Where, x0 is a crisp input value from a process. rapidly. A survey of the field with Cumulative the individual rules outputs in order to x0 (x) =1 f or x = x0 fairly extensive references may be Performs a scaleMeasures mappings the values that of transfers inputs variables, the range of values of inputs obtain the overall system output. variables into corresponding universes of discourse z0 = defuzzifier(C) (6) found in [6] or, more recently, in [8]. Performs a scale mappings that transfers the range of values of inputs The fuzzy control0 f or action x  x0C inferred from the fuzzy control variables into corresponding universes of discourse In Mamdani’s model the fuzzy Performs the function of fuzzyfication system is transformed into a crisp control action. Where, defuzzifierWhere, stands x0 for is a adefuzzification crisp input value operator from and the most implication is modelled by Performs the function of fuzzyfication used defuzzificationa process. operators. For a discrete fuzzy setC Mamdani’s minimum operator, the The rule base comprises knowledge of the application domain having the universe of discourse V. conjunction operator is min, the t- The rule base comprises knowledge of the application domain Their work was inspired by an equally influential publicationCumulative by Zadeh [7]. the Interest individual in fuzzy rules control has norm from compositional rule is min toto define define linguistic control control rules rules and fuzzy and fuzzydata manipulation data manipulation in a FLC in a FLCcontinuedoutputs ever since, in order and theto obtainliterature the on overall the subject has and for the aggregation of the rules grown rapidly. A survey of the field with fairly extensive system output. the max operator is used.In order to TheThe rule rule basebase characterizescharacterizes the controlthe control goals andgoals the andcontrol the policy control of the policy of thereferences may be found in [6] or, more recently, in [8]. In domain experts by means of a set of linguistic control rules. domain experts by means 6of a set of linguistic control rules. Mamdani’s model the fuzzy implication is modelled by explain the working with this model Mamdani’sz0 = minimum defuzzifier(C) operator, the conjunction (6) operator is of FLC will be considered the Fig. 2 :Figure The 2: functions The functions of fuzzification fuzzification interface interface min, the tnorm from compositional rule is min and for the Figure 2: The functions of fuzzification interface aggregation of the rules the max operator is used.In order to example from [7] where a simple The fuzzy deduction mechanism is concepts and of inferring fuzzy The fuzzy deduction mechanism is the kernel of a FLC explain the working with this model of FLC will be considered the kernel of a FLC and it has the control actions employing fuzzy two-input one-output problem that The fuzzy deduction mechanism is concepts and of inferringthe fuzzy example from [7] where a simple two-input one-output capabilityand ofit simulatinghas the humancapability of simulatingimplication human and the rules decision- of inference The fuzzy control action C inferred the kerneldecision-making ofmaking a FLC and based it has ofof thefuzzy fuzzy concepts and ofin fuzzyinferringcontrol logic. fuzzy actions control employing problem fuzzy that includes three rules is examined: includes three rules is examined: capability actions of simulating employing human fuzzy implication and theimplication rules of andinference the rules of inferenceRulefrom 1: IF x the is A3 fuzzy OR y is controlB1 THEN systemz is C1 is Rule1: IF x is A3 OR y is B1 THEN decision-makingin fuzzyA scale based logic. mapping, of which fuzzy converts the range in fuzzy logic. Ruletransformed 2: IF x is A2 AND into y is aB2 THEN crisp z controlis C2 of values of output variables into Rule 3: IF x is A1 THEN z is C3. z is C1 A scalecorresponding mapping, universes which of converts discourse the range action. Where, defuzzifier stands for Step 1:a Fuzzification defuzzification operator and the of values of output variables into Rule2: IF x is A2 AND y is B2 correspondingDefuzzification, universes of which discourse yields a non Themost first usedstep is defuzzificationto take the crisp inputs, operators. x0 and y0, and fuzzy control action from an inferred determineFor the a degreediscrete to whichfuzzy these set Cinputs having belong the to each of THEN z is C2 fuzzy control action. the appropriate fuzzy sets universe of discourse V. Defuzzification, which yields a non μA1 (x0) = 0.5, μA2 (x0) = 0.2, μB1 (y0) = 0.1, μB2 (y0) = 0.7 (7) Figurefuzzy 3:control The functions action of from defuzzification an inferred interfac e 8 fuzzy control action. Step 2: Rules evaluation

7 The fuzzified inputs are applied to the antecedents Fig. 3 : The functions of defuzzification interface Figure 3: The functions of defuzzification interface of the fuzzy rules and If a known fuzzy rule has multiple antecedents, the fuzzy operator (AND or OR) is used to A fuzzification operator has the result of transforming obtain a single number that represents the result of the unique data into fuzzy sets. Many of the cases fuzzy singletons antecedent evaluation and to assess the disjunction of the 7

CSI Journal of Computing | Vol. 2 • No. 4, 2015 R6 : 56 Mitigation of Jamming Attack in Wireless Sensor Networks rule antecedents, one uses the OR fuzzy operation and (COG) of the aggregated fuzzy set A, on the interval [a, b]. characteristically, the classical fuzzy operation union is used: A rational estimate can be found by calculating it over a μAÈB(x) = max {μA(x), μB(x)}. (8) sample of points and obtain a value of Similarly, in order to evaluate the conjunction of thefunction. rule We have chosen trapezoid COG=0.64 The following graphical antecedents, the AND fuzzy operation intersection is applied:shape basically as Throughit has a capability NS2 simulations, multiplerepresentation set of three present sets the trapezoid μAÇB(x) = min {μA(x), μB(x)} (9) of crisp inputs, PDR, PSR and RSS are first mapped into function.to manipulate We have easily chosen to trapezoid be an Thefunctions in following respect to PDR, graphical PSR Now the result of the antecedent evaluation function.can be We fuzzy have chosenmembership trapezoid function. We haveThe chosen followingtrapezoid graphical shapeunsymmetrical basically functionas it has anda capability can be representationand RSS.(figure present 5,6,7) the trapezoid applied to the membership function of the consequent.shape The basically shape as it hasbasically a capability as it has a capability torepresentation manipulate presenteasily the trapezoid tomathematically manipulate to easily resemble to benatural an functions in respect to PDR, PSR most common method is called clipping. Because topto of manipulatethe to be easily an unsymmetrical to be an function and canfunctions be mathematically in respect to PDR, PSR unsymmetricalfunction such as function Bell function and can and be and RSS.(figure 5,6,7) membership function is sliced and the brief fuzzy setunsymmetrical loses to functionresemble and natural can be function such as Belland functionRSS.(figure and 5,6,7) the mathematicallythe Gaussian function. to resemble natural a few information. On the other hand clipping is preferredmathematically Gaussian to resemble function. natural function such as Bell function and because it involves fewer complexes and generatesfunction an such as BellThe functionfollowing and graphical representation present the the Gaussian function. aggregated output surface that is easier to defuzzify. Anotherthe Gaussian function.trapezoid functions in respect to PDR, PSR and RSS.(figure method, named scaling. It offers an improved approach for 5,6,7) preserving the original shape of the fuzzy set: the original membership function of the rule consequent is adjusted by multiplying all its membership degrees by the value of the rule antecedent. The set of fuzzy rules used in this model is given in table3.

Table 3 : Fuzzy rules Figure 5: The trapezoid function for the input PDR

Rule PDR PSR RSSI Jamming Figure 5: The trapezoid function for the input PDR 1 Low Low Low Deceptive Fig. Figure5 : The 5: trapezoid The trapezoid function function for the the input input PDR PDR 2 Low Low High No 3 Low Medium Low Constant 4 Low Medium High No 5 Low High Low Reactive

6 Low High High No 7 Medium Low Low Constant Figure 6: The trapezoid function for the input PSR

8 Medium Low High No Fig. 6 : The trapezoid function for the input PSR 9 Medium Medium Low Random Figure 6: The trapezoid function for the input PSR Figure 6: The trapezoid function for the input PSR 10 Medium Medium High No

11 Medium High Low Reactive 12 Medium High High No 13 High Low Low No 14 High Low High No 15 High Medium Low No

16 High Medium High No Fig. 7 : The trapezoid function for the input RSS Figure 7: The trapezoid function for the input RSS 17 High High Low No

18 High High High No 11 IV. PerformanceFigure 7: The trapezoidEvaluation function for the input RSS Figure 7: The trapezoid function for the input RSS Step 3 : Aggregation of the rule outputs The study also evaluated11 the performance of a reactive jammer that jams all packets.11 We further studied the The membership functions of every rule consequents robustness of our approach under the condition that the beforehand clipped or scaled are combined into a single fuzzy jammer does not succeed to jam all packets, but is still able set. to destroy 90% of the packet. Figure 8 shows the impact of these forms of reactive jamming on the correlation between Step 4 : Defuzzification the PDR, and RSS. Green colour indicates no jamming, while The most popular defuzzification method is the centroid blue is reactive, followed by red, represents random, while technique. It finds a point representing the centre of gravity black indicates deceptive and magenta is a constant.

CSI Journal of Computing | Vol. 2 • No. 4, 2015 Manju. V.C., et. al. R6 : 57

The RSS distributions are further sorted based on have PDR of -83 dBm or less. Thus, by using a ‘cut-off’ value their data-rates; for purposes of clarity we only show data- of - 60 dBm, it would possible to achieve 0% collision loss. rates of -30 and -60 dbm. For example, in this experiment, It would be possible to capture about 90% of collision cases 0% of packets in error due to signal have an PDR of about while incurring a false-positive rate of 2%. Thus, RSS can act -60 dBm or less, while only 40% of no jamming due to signal as a good metric for inferring the cause of packet loss.

PDR Vs RSS for PSR = 30 % PDR Vs RSS for PSR = 50 % -60 -60

-62 -62

-64 -64

-66 -66

-68 -68

-70 -70

RSS(dBm) -72 RSS(dBm) -72

-74 -74

-76 -76

-78 -78

-80 -80 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 PDR(%) PDR(%)

PDR Vs RSS for PSR = 90 % PDR Vs PSR for RSS = -30 dbm -60 100

-62 90

-64 80

-66 70

-68 60

-70 50 PSR(%) RSS(dBm) -72 40

-74 30

-76 20

-78 10

-80 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 PDR(%) PDR(%)

PDR Vs PSR for RSS = -60 dbm PDR Vs PSR for RSS = -80 dbm 100 100

90 90

80 80

70 70

60 60

50 50 PSR(%) PSR(%) 40 40

30 30

20 20

10 10

0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 PDR(%) PDR(%) Fig. 8 : Input and Output surface corresponding to the membership values of input and output

Figure 8: Input and Output surface correspondingCSI to Journal the membership of Computing values| Vol. 2 •of No. input 4, 2015 and output

to describe the quality of a link. To detect jamming attack, we choose We used metrics such as PDR, PSR PDR and SS as jamming attack and RSSI to evaluate the metrics for our system. In a normal performance of our approaches. scenario with no interference, high Packet delivery ratio (PDR) is used SS corresponds to a high PDR. to measure the effectiveness of Though, if the SS is low, the PDR jamming attacks, and consequently will also be low. Parallel to this, a 13

R6 : 58 Mitigation of Jamming Attack in Wireless Sensor Networks

We used metrics such as PDR, PSR and RSSI to evaluate network parameters on algorithms, so that the jammer was the performance of our approaches. Packet delivery ratio surrounded by multiple (PDR) is used to measure the effectiveness of jamming The effectiveness of our virtual force iterative localization attacks, and consequently to describe the quality of a link. To approach under both the region based model (RBM) and the detect jamming attack, we choose PDR and SS as jamming SNR-basedWe first model studied(SBM) is assessed the effects through simulation of attack metrics for our system. In a normal scenario with no various network node densities on interference, high SS corresponds to a high PDR. Though, if Parameters used in Parameter value the SS is low, the PDR will also be low. Parallel to this, a the simulations localization performance. To low PDR does not essentially indicate low SS: it may be that adjust theEelec network node densities,5 nJ/bit we all of a neighbour nodes have died (from consuming battery E 10 pJ/ bit/m2 resources or devices faults) otherwise the node is jammed and varied thefs total number of nodes, N, V.Result E 0.0013 pJ/bit/m4 the key observation here is that in the jammed case. The SS deployed mp in the simulation. In have to be high and the PDR is low and the mixture of PDR, E 0.5 J particular,o we chose N to be 200, PSR and RSSI is capable of detecting any form of jamming EDA 5 nJ/bit/message attack and as discussed in the earlier sections. 300, and 400, respectively. d 70 m o Simulation evaluation Message size 4000 bits

By using MATLAB, our own stimulator was Popt 0.1 implemented. A wireless network environment in a 300-by- 300 feet field is simulated within which network nodes were V. Result distributed uniformly. Each node’s transmission range was We first studied the effects of various network node set to 30 feet. The performance of threeTable algorithms,5.Simulation CL, Deployment VFIL- withdensities 50 nodes on the localization performance. To adjust the Tr and, VFIL-NoTr was assessed by us in several network network node densities, we varied the total number of nodes, conditions,Node including IDs Xposvariable (in jammer’sYpos (in NLB PDR radius (%) and PSRN, (%) deployed RSSI in the (dBm) simulation. Result In particular, we chose N to be network node densities. We placed the jammer at the centre 200, 300, and 400, respectively. of simulation area inmeters) order to studymeters) the impact of those 1 9.1214 72.0548 34 57 -115.1749 'Reactive' Table 5 : Simulation Deployment with 50 nodes 2 65.4512 65.2501 70 30 -125.7777 'NO' Node IDs Xpos (in Ypos (in PDR (%) PSR (%) RSSI (dBm) Result 3 8.5464meters) 63.3436meters) 26 90 -109.7472 'NO' 41 30.02699.1214 71.505772.0548 33 3634 -117.234857 'Random'-115.1749 ‘Reactive’ 52 59.234865.4512 90.518565.2501 26 9570 -128.509330 'Reactive'-125.7777 ‘No’ 3 8.5464 63.3436 26 90 -109.7472 ‘No’ 6 73.8729 0.4461 85 70 -129.6269 'NO' 4 30.0269 71.5057 33 36 -117.2348 ‘Random’ 7 56.2432 33.7716 33 7 -120.9359 'Constant' 5 59.2348 90.5185 26 95 -128.5093 ‘Reactive’ 86 45.830173.8729 65.71900.4461 57 4085 -119.998770 'Random'-129.6269 ‘No’ …7 …56.2432 … 33.7716 … …33 … 7 …-120.9359 ‘Constant’ 8 45.8301 65.7190 57 40 -119.9987 ‘Random’ 50 84.4873 75.3107 71 24 -131.1934 'NO' ...... 50 84.4873 75.3107 71 24 -131.1934 ‘No’

Jamming Results (in %) We originatedWe originated that that the the outcome outcome ofof different type of Deceptive Reac t i ve differentjammers type is different of jammers on the WSN is different and conforms to the expected pattern and in which their effect and in order of decreasing onforce, the WSN and conforms to the expected pattern and in which their

No Jamming effect and in order of decreasing Random force,

Constant Fig. 9 : Percentage of Jamming attack present in network model Figure 9: Percentage of Jamming CSI Journal of Computing | Vol. 2 • No. 4, 2015 attack present in network model 15

It shows the values of PDR We mention here that 712 for 100 nodes configuration for simulations (table 4) like (238 out of different types of jammers, JIs and 240 for 25-node and 237 out of 240 jnr. It shows the values of PDR for 50-node and 237 out of 240 for 0 100 nodes configuration for We mention here that 712 RSSI simulationsfor100-node (table configurations) 4) like (238 outout of different-20 types of jammers, JIs and 720 simulations passed the chi- jnr. 240 for 25-node and 237 out of 240 It shows the values-40 of PDR

square test. It shows the values of 0 for 50-node and 237 out of 240 for 100 nodes configuration-60 for We mention PDR here for that100 nodes 712 configuration for RSSI for100-nodeManju. V.C., et. al. configurations) out of different types of jammers,-20 JIs andR6 : 59 simulations (table 4) like (238 out of Value(%) -80 720different simulations types of jammers,passed the JIs chi- and 240 for 25-node and 237 out of 240 jnr. -40 squarejnr. test. It shows the values of -100 for 50-node and 237We mention out ofhere that 240 712 simulations (table 4) like (238 0 -60 -120 PDRout of100 240for for 100 25-node nodes and 237 configuration out of 240 for 50-node for and 237 RSSI for100-node configurations)out of 240 for100-node out configurations) of out of 720 simulationsPSR -20 90 Value(%) -80 passed the chisquare test. It shows the values of PDR for 100 -140 different types of jammers, JIs and 0 10 20 30 40 50 60 720 simulations nodes passed80 configuration the chi-for different types of jammers, JIs and -40 Node ID jnr.jnr. -100 square test. It shows 70 the values of -60 Figure-120 12: Received signal strength PDR for 100 nodes configuration10060 for PSR

90 Value(%) -80 50 -140 different types of jammers, JIs and 0 10 20 30 40 50 60 Value(%) 4080 Value of packets sendNode ratio ID observed jnr. -100 3070 atFigure each node12: Received is defined signal in figure strength 10, -120 100 2060 Packet delivery ratio and received PSR 90 1050 -140 Value(%) 0 10 20Valuesignal30 of strength packets40 of50 send various60 ratio nodesobserved are 80 400 Node ID 0 10 20 30 40 50 60 Node ID depicted in Figure 11 and 12 .The 70 30 at each node is defined in figure 10, Figure 12:Fig. Received 12: Received signal strength strength 20 percentage of jamming attack present 60 Figure 10: Packets send ratio of Packet delivery ratio and received Value of packets send ratio observed at each node is 50 various10 nodes defined in figure 10, Packetsignalin delivery the strength network ratio and received of model various signal is defined nodes are in Value(%) 40 0 Valuestrength of of variouspackets nodes send are depicted ratio in Fig.observed 11 and 12 .The 0 10 20 30 40 50 60 percentage of jamming attackfigure present 9. in the network model Node ID depicted in Figure 11 and 12 .The 30 atis each defined node in Fig. is9. defined in figure 10, It shows the values of PDR for 100 percentage. of jamming attack present 20 FigureFig. 10 10: : Packets Packets send ratio send of various ratio nodes of PacketVI. Conclusions delivery ratio and VI. received Conclusions 10 nodes configuration for different in the network model is defined in various nodes signalThe strengthcourse of action of variousof diagnosing nodes jamming are attacks 0 It shows the values of PDR for 100 nodes configuration has been investigated in this paper. The focus was kept on 0 10 20types30 of jammers40 50 like 60JIs and jnr. figure 9. for different types of jammers like JIs and jnr. depictedlocalizing the in jammer Figure afterThe a 11 jamming course and attack 12 of is.The action identified of diagnosing Node ID specifically. The development. of multimodal detection It shows100 the values of PDR for 100 percentagetechnique can helpof jammingin detectingjamming jamming attack attacks present attacks with lower has been Figure 10: Packets send ratio of PDR 90 false alarm rate and high precision to mitigateVI. jamming Conclusions effects nodes configuration for different inand the false networkdata, we are formulated. modelinvestigated isTo mitigate defined in the this jamming in paper. The focus various nodes 80 effects and false data we are formulated two jamming models, types of jammers like JIs and jnr. figuresignal to9. noise ratio SNRwas based keptand location on based localizing and to the jammer 70 The course of action of diagnosing efficiently tackle the after Jamming a jammingattacks in multiple attack radios is identified It shows the values of10060 PDR for 100 . WSN; we develop a noveljamming mitigation for identifying attacks nodes. has been PDR This paper is to investigate the scalability and stability of 9050 VI. Conclusionsspecifically. The development of nodes configuration for different this method to various WSNs.investigated The outstanding in performances this paper. The focus Value(%) 8040 showed to strengthen themultimodal capability and a routing detection protocol technique can types of jammers like JIs and jnr. to control trigger nodes towas receivers kept so as to on remain localizing jammers the jammer 7030 The course of action of diagnosing idle. We study the idealizedafterhelp case of a inperfect jamming detecting knowledge attackby both jamming is identified attacks 100 6020 jammingthe jammer and the attacks network. The network has about beenthe strategy PDR of each other and the case withwhere thelower jammer false and the networkalarm rate and high 50 90 10 lack this knowledge andspecifically. demonstrated the probability The development of of

Value(%) investigated in this paper. The focus 80 400 jamming attacks by performingprecision.to fuzzy logic controllers. mitigate jamming effects 0 10 20 30 40 50 60 multimodal detection technique can Node ID was kept on localizing the jammer 70 30 References helpand false in detectingdata ,we are jamming formulated. attacks To [1] A.D. Wood, J.A. Stankovic, G. Zhou, G. DEEJAM. “Defeating 60 20 Fig. 11 : Packet delivery ratio after a jamming attack is identified Figure 11: Packet delivery ratio Energy-Efficient Jammingwithmitigate inlower IEEE the 802.15.4-basedfalse jamming alarm Wireless rate effects and high and 10 Networks”, The 4th Annual IEEE Communications Society 50 specifically. The development of It shows the values of PDR for 100 nodes configuration Conference on Sensor,false San Diego, data CA: Mesh we and are Ad Hoc formulated two Value(%) for different0 types of jammers, JIs and jnr. Communications and Networks,precision.to 2007. mitigate jamming effects 40 0 10 20 30 40 50 60 multimodal detection technique can Node ID 30 and false data ,we are formulated. To help 16 in detecting jamming attacks 20 Figure 11: Packet delivery ratio mitigate the jamming effects and with lower false alarm rate and high 10 false data we are formulated two 0 precision.toCSI Journal mitigate of Computing jamming | Vol. 2 effects • No. 4, 2015 0 10 20 30 40 50 60 Node ID and false16 data ,we are formulated. To Figure 11: Packet delivery ratio mitigate the jamming effects and

false data we are formulated two

R6 : 60 Mitigation of Jamming Attack in Wireless Sensor Networks

[2] A. Rakic, Fuzzy Logic, Fuzzy Inference, ETF Beograd, http:// wireless sensor networks”, Proceedings of the 3rd International www.docstoc.com/docs/52570644/Fuzzy-logic-3, 2010. Conference on Scalable Information Systems, 2008. [3] A. Sala, T.M. Guerra, R. Babuska, “Perspectives of fuzzy [9] M. Strasser, C, Popper, S. Capkun, M. Cagalj, Jamming- systems and control”, Fuzzy Sets and Systems, vol.156, pp. 432- resistant “Key Establishment using Uncoordinated Frequency 444, 2005. Hopping”, IEEE Symposium on Security and Privacy, pp.64-78, [4] C.C. Lee, “Fuzzy logic in control systems: fuzzy logic controllers- 2008. parts I and II”, IEEE Transactions on Systems, Man, and [10] V. Muralidharan-Chari, J. Clancy, C. Plou, M. Romao, P. Cybernetics, vol.20, pp. 404-435, 1990. Chavrier, G. Rapo so, C.D. Souza-Schorey, “ARF regulated [5] E.H. Mamdani, S. Assilian, “An experiment in linguistic shedding of tumor cell-derived plasma membrane microvesicles”, synthesis with a fuzzy logiccontroller”, Int. J. Manmachine Current Biology, vol.19, pp. 1875-1885, 2009 Studies, vol.7, pp. 1-13, 1975. [11] Y.W. Law, L. van Hoesel, J. Doumen, P. Hartel, P. Havinga, [6] J.A.Reese, X. Li, M. Hauben, “Identifying drugs that cause Energy efficient link-layer jamming attacks against wireless acute thrombocytopenia: an analysis using 3 distinct methods”, sensor network MAC protocols, Proceedings of ACM security Blood, vol.116, pp. 2127-2133, 2010. sensor ad-hoc networks, 2009. [7] L.A. Zadeh. “Outline of a new approach to the analysis complex [12] W. Xu, W. Trappe, Y. Zhang, T. Wood, “The feasibility of systems and decisionprocesses”, IEEE Trans System Man launching and detecting jamming attacks in wireless networks. Cibern SMC, vol.3, pp. 28-44, 1973. MobiHoc ’05”, Proceedings of the Sixth ACM International [8] M. Cakiroglu, A.T. Ozcerit, Jamming “detection mechanisms for Symposium on Mobile ad hoc Networking and Computing, 2005.

About the Author

Manju V. C. received BE Degree from M.S university, India and M.E degree from M.K University. Currently she is doing PhD under University of Kerala, India. Her research interest includes wireless networks .wireless sensor networks and link layer protocols

M. Sasikumar received his BSc (Engg) degree from College of Engineering, Trivandrum in 1968, ME degree from IISc Bangalore in 1976 and PhD degree from IIT Madras in Electronic Circuits and Signal Processing in 1984. He had worked in Govt. Colleges of Kerala. He worked as the Principal of College of Engineering, Trivandrum in 1998. He was the Senior Joint Director in Directorate of Technical Education, Trivandrum during 1998-2000. He had published many articles in national/international conferences and journals. He had published a book entitled Power supplies. He had coordinated and guided many research projects in the field of signal processing and image processing. He is a senior member of Journal of Computer Society of India

CSI Journal of Computing | Vol. 2 • No. 4, 2015 Shyamli Rao, et. al. R7 : 61 An Efficient Approach for Storing and Accessing Huge Number of Small Files in HDFS Shyamli Rao* & Amit Joshi**

* Department of Computer Engineering and Information Technology College of Engineering Pune, Maharashtra, India Email: [email protected] ** Department of Computer Engineering and Information Technology College of Engineering Pune, Maharashtra, India Email: [email protected]

Hadoop is a distributed framework which uses a simple programming model for the processing of huge datasets over the network of computers. Hadoop is used across multiple machines to store very large files, which are normally in the range of gigabytes to terabytes. High throughput access is acquired using HDFS for applications with huge datasets. In Hadoop Distributed File System(HDFS), a small file is the one which is smaller than 64MB which is the default block size of HDFS. Hadoop performance is better with a small number of large files, as opposed to a huge number of small files. Many organizations like financial firms need to handle a large number of small files daily. Low performance and high resource consumption are the bottlenecks of traditional method. To reduce the processing time and memory required to handle a large set of small files, an efficient solution is needed which will make HDFS work better for large data of small files. This solution should combine many small files into a large file and treat these large files as an individual file. It should also be able to store these large files into HDFS and retrieve any small file when needed.

Keywords: hadoop distributed file System, small file

1. Introduction Hadoop is an open source distributed framework Java for the distributed framework of Hadoop. HDFS has which stores and processes large data sets and is developed read-many-writeonce model that allows high throughput by Apache Software Foundation. It is built on clusters of access, simplifies data consistency and eases concurrency commodity hardware. Each single machine server stores large control requirements. HDFS helps to connect nodes which data and provides local computation which can be extended to are personal computers present in a cluster in which data is thousands of machines. It is derived from Google’s file system distributed. Then, the data files can be accessed and stored as and MapReduce. It is also suitable to detect and handle an one seamless file system. HDFS work is done efficiently by failures. It can be used by the application which processes distributing data and logic to on nodes for parallel processing. large amount of data with the help of large number of It is well grounded as it retains multiple copies of data and independent computers in the cluster. In Hadoop distributed assigns processing logic in the event of failures on its own. architecture, both data and processing are distributed Hadoop also comes with the MapReduce engine. For across multiple computers. Hadoop consists of the Hadoop writing applications, Hadoop MapReduce is used to process Common package, HDFS and MapReduce engine. A Hadoop large data parallely on large clusters. A MapReduce job cluster consists of a NameNode and DataNodes. Function of normally divides the input data into separate chunks. These NameNode is to manage the metadata of file system and the chunks are then processed parallely by the map tasks. The actual data is stored at the DataNodes. framework classifies the results of the maps. These are fed as It is obligatory to divide the data across different input to the reduce tasks. File system stores both the input DataNodes when large amount of data is stored on a single and the output of the job. The framework manages scheduling machine. Distributed file systems are the ones which are of tasks, supervising them and also the unsuccessful tasks responsible for the management of data storage over a are reexecuted. It includes JobTracker and TaskTracker. network. The distributed file system used by Hadoop is called Client applications sends requests of the MapReduce jobs HDFS and it is a storage system. HDFS has been considered as to JobTracker and the JobTracker assigns these jobs to a highly reliable file system. HDFS is a scalable, distributed, available task tracker nodes in the cluster. The JobTracker high throughput and portable file system programmed in tries to keep the data and its processing in close proximity to

CSI Journal of Computing | Vol. 2 • No. 4, 2015 An Efficient Approach for Storing and Accessing R7 : 62 Huge Number of Small Files in HDFS each other. MapReduce processing need not be done in Java checkpoint and then goes back to sleep. If the NameNode goes unlike Hadoop which needs Java base. down, the SecondaryNameNode helps to lower the outage duration and data loss. If a NameNode fails and to use the II. Background SecondaryNameNode as the primary NameNode, we need to manually reconfigure the cluster. A. Hadoop Distributed File System The function of the JobTracker is to distribute the The Hadoop Distributed File System provides several MapReduce tasks between particular nodes containing the services like NamemNode, DataNode, SecondaryNameNode, data in the cluster. These nodes might be present in the same JobTracker, TaskTracker, etc. rack. Job tracker acquires jobs from the client applications. The NameNode is the main part of an HDFS. It keeps Once the code is submitted to the cluster, the JobTracker the metadata information which is the directory tree of all determines the enactment strategy by finding out which files present in the file system. It also tracks where the file files to process. It then allocates nodes to different tasks and data is kept across the cluster. However, the data of these superintends all running tasks. To determine the location files is not stored, but the metadata is stored. There is a single of the data, the JobTracker consults the NameNode. The NameNode running in any DFS deployment. Namenode is the JobTracker discovers TaskTracker nodes that available for master of HDFS. It manages the slave DataNode daemons to processing the data. The work is then given to the located perform input output tasks at the low-level. It also manages TaskTracker nodes. In case of task failure, that task is and keeps the information about on which nodes the data automatically instigated by the JobTracker on some other blocks of file are actually stored, on what basis the files are node. The number of retries has already been defined. The split into file blocks, and the functioning of the distributed only JobTracker in Hadoop cluster is run on the master node. files ystem. In the HDFS Cluster, the NameNode is a single The TaskTracker in the cluster is the one that receives point of failure. Whenever client applications need to locate tasks such as Map, Reduce and Shuffle operations from a a file or say they want to add or copy or move or deletea JobTracker. The number of tasks that TaskTracker can file, they always talk to the NameNode. Then the NameNode accept is configured with a set of slots. To ensure that process responds to the valid and successful requests by returning a failure does not bring down the task tracker, it manages a list of DataNode where the data actually resides. separate Java Virtual Machine (JVM) processes to do the In HDFS, DataNode stores data. Generally there are work. It supervises these processes and captures the output many DataNodes, with data replicated across them to and exit codes. It notifies the JobTracker whether the process recover the data in case of data failure. When the cluster completed successfully or not. After every fixed interval of is started, DataNode connects to Namenode and waits till time, the TaskTrackers also send out pulse messages to the the acknowledgement comes from the Namenode. For file JobTracker, to conform that the JobTracker is still alive. system operations, the Datanode responds to requests from the NameNode. Once the NameNode has given the location B. Small File Problem where the data is stored, client applications can instantly Normal approach for storing large number of small talk to a DataNode. Whenever MapReduce tasks are files includes directly storing the files in HDFS without any submitted to TaskTracker which is near a DataNode, they preprocessing done, this has many disadvantages, following immediately contact to the DataNode for getting the files. are few amongst them: TaskTracker operations should generally be performed on 1) In HDFS, if data in files is significantly smaller than the same machine where Datanode resides. This helps for the the default block size, then the performance reduces MapReduce operations to be executed in close proximity to the drastically. data. To duplicate the data blocks, DataNode communicate with each other and thus redundancy is increased. The 2) If small files are stored normally in HDFS, then it wastes backup store of the blocks is provided by the Datanodes. To a lot of space in storing metadata of all files. When small keep the metadata updated, Datanodes constantly report files are stored in HDFS, there will be lots of seeks and to the NameNode. If any one DataNode crashes or becomes jumps from one datanode to other to get a file, which is unreachable over the network, backup store of the blocks ineffective data access method. assures that, file reading operations can still be performed. III. System Design In HDFS, the Secondary NameNode acts as an assistant node to supervise the state of the cluster. In each We have tried to eliminate the drawbacks of storing large cluster, there is one SecondaryNameNode. It resides on number of small files in HDFS, so that the time required each machine in the cluster. The SecondaryNameNode to read and write the files would be much less and also the differs from the NameNode in the context that, any real- metadata storage decreases significantly. time changes to HDFS are not received or recorded by this While writing the files into HDFS ,first we have sorted the process. After definite time intervals defined by the cluster files and then stored them in order to get better prefetching. configuration, SecondaryNameNode communicates with the Our program stores the similar files(similarity of files is on NameNode to get the instances of the HDFS metadata. It the basis of their extension) zipped together into HDFS. Once is a daemon that periodically wakes up, triggers a periodic the sorting has been done, we have zipped the small files till

CSI Journal of Computing | Vol. 2 • No. 4, 2015 Shyamli Rao, et. al. R7 : 63

the size of zipped file i.e combined file is equal to 64MB or till 8) while(line != null) the extension of small files change. This approach helps, as it 9) if (sum(size of read files) is less than 64MB and leads to the concept of locality of reference i.e. user will refer extension doesn’t change) similar types of files for later use. Storing files into HDFS 10) then requires less time since we are zipping the files. Therefore, a) : add files into zip for example, instead of writing thousands of small files into HDFS, we will write only few zip files into HDFS, each zip 11) else containing hundreds of small files. 12) close zip For the first read, a map gets created for each extension 13) add entry into hashtable.txt on the basis of extension , then this map is searched for the given small file name and 14) copy zip file from local to HDFS returns the combined file name .This combined file is copied 15)B. Fileindex++ Reading Algorithm to local machine. Since whole combined file is being copied 16) create a new zip file archive+index.zip to local machine the similar files present in the zipped file 17) 1)end ifelseStart the program are being prefetched. For the next reads, time required to get 18) 2)read nextSelect line the of output.txt Read button to read from HDFS combined filename from small file name would be less since 3) Enter the small file name 19) end while map is already created and search time in a map is o(1) i.e 4) For the first read it creates a map by looking at the 20) close zip file constant. If a similar small file is read then same combined hashtable.txt created by merger program file name is returned hence it takes further less time since the 21) 5)copy zipFor file later from reads local itto checks HDFS for the match in map (which combined file is already present in local machine. B. File Readingrequires Algorithm O(1) time) B. File6) Readinggets the Algorithm archive index number 1) Start the program Fig. 1: Write Operation 1)7) Startif (archive+index.zip the program is not in local machine) 2) 8)Selectcopy the Read the button archive to fromread from HDFS HDFS to local machine 3) 2)EnterSelect the small the file Read name button to read from HDFS 3)9) Enterend if the small file name 4) 10) For theread first the read small it creates file a from map by the looking zip presentat the in local are being prefetched. For the next reads, time required to get 4) For the first read it creates a map by looking at the hashtable.txtmachine created by merger program combined filename from small file name would be less since 5) For laterhashtable.txt reads it checks created for the by match merger in map program (which map is already created and search time in a map is o(1) i.e 5)requiresFor O(1) later time) reads it checks for the match in map (which constant. If a similar small file is read then same combined 6)C. Filegets the Sortingrequires archive Algorithm O(1) index time) number 6) gets the archive index number file name is returned hence it takes further less time since the 7) if (archive+index.zip is not in local machine) Fig. 1: Write Operation 7)1) ifRead(archive+index.zip the file contents is from not in the local filename machine) (given by combined file is alreadyFig. 1 present: Write Operation in local machine. 8) copy the archive from HDFS to local machine 8) copyuser) the archive from HDFS to local machine 9) end if 9)2) endRead if a line 10)10) 3) read thereadwhile small the(line file small ! = from null) file fromthe zip the present zip present in local in local are being prefetched. For the next reads, time required to get 4)machinemachinesplit on . (split on extension) combined filename from small file name would be less since C. File5) Sortinginterchange Algorithm extension and name map is already created and search time in a map is o(1) i.e 6) add this to list constant. If a similar small file is read then same combined 1)C. File7) Read Sorting readthe file next Algorithm contents line from the filename (given by file name is returned hence it takes further less time since the 8)user) end while combined file is already present in local machine. 2) 1)Read aRead line the file contents from the filename (given by 9) user)use Colection.sort (in build java function) to sort list 3) while (line ! = null) 10)2) Readfor ( each a line list item ) 4) split on . (split on extension) 11)3) whilesplit on(line extension ! = null) Fig. 2 : Read Operation 5) 12)interchangeintechange extension extension and name and name Fig. 2: Read Operation 4) split on . (split on extension) 6) 13)5)add thisinterchangewrite to list into output.txt extension and name 7) 14)read next line A. File Merging Algorithm 6) addend this for to list 8) 7)end whileread next line 1) Start the program /subsectionFile Searching Algorithm A. File Merging Algorithm 9) 8)use Colection.sortend while (in build java function) to sort list 2) Select the Write button to write the files into HDFS 10) for ( each list item ) 1)9) finduse Colection.sort the smallest key (in whichbuild java is greater function) than to or sort equal list 1) 3)Start Enter the the program file name (this file contains pathnames of 11)10) split onfor extension( each list item ) all the files which you need to store in HDFS) to given target (small file name) 2) Select the Write button to write the files into HDFS 12)11) intechangesplit onextension extension and name 4) Our program reads the file ,sorts it on the basis of 2) if (smallfilename lies within the range of smallfile- 3) Enter the file name (this file contains pathnames of 13)12) write intechangeinto output.txt extension and name { extension andFig. stores 2: Read it Operationin output.txt file name1 to smallfilename2 ) all the files which you need to store in HDFS) 14)13) end forwrite into} output.txt{ } 5) initalisation of variable: index = 100 (some arbitary 3) return the index 4) Our program reads the file ,sorts it on the basis of /subsectionFile14) end Searching for Algorithm extensionnumber) and stores it in output.txt file 4) else 6) Reads each line from output.txt 1) 5) find theprint smallest file doesn’t key which exist is greater than or equal 5) initalisation of variable: index = 100 (some arbitary /subsectionFileto given target Searching (small file Algorithm name) A. File7) Mergingnumber) creates Algorithm a zip file archive+index.zip 6) exit program 7) end ifelse 1)6) StartReads the each program line from output.txt 1) find the smallest key which is greater than or equal 2)7) Selectcreates the a zip Write file button archive+index.zip to write the files into HDFS to given target (small file name) 8) while(line != null) CSI2) Journalif (smallfilename of Computing lies | withinVol. 2 • the No. range 4, 2015 of smallfile- 3) Enter the file name (this file contains pathnames of IV. E VALUATION AND RESULTS { 9) if (sum(size of read files) is less than 64MB and name1 to smallfilename2 ) all the files which you need to store in HDFS) A. Experimental} Environment{ } 4) Ourextension program doesn’t reads change) the file ,sorts it on the basis of 3) return the index 10) extensionthen and stores it in output.txt file The4) multinodeelse cluster had 4 nodes, one master and 3 slave 5) initalisationa) : add files of into variable: zip index = 100 (some arbitary nodes.Each5) print of file these doesn’t machines exist had following configurations: 11)12) number)else 6) exit program 13)6) Readsclose zip each line from output.txt 1)7) Processor:end ifelse Intel Core i7-2600 CPU @ 3.10GHz 14)7) createsadd entry a zip into file hashtable.txt archive+index.zip on the basis of extension 2) RAM: 4 GB 15) copy zip file from local to HDFS 8) while(line != null) 3) HDD:IV. 500 E GBVALUATION AND RESULTS 16)9) ifindex++(sum(size of read files) is less than 64MB and 4) Operating System: Linux 17) extensioncreate a new doesn’t zip file change) archive+index.zip A. Experimental5) Version: Environment Ubuntu 12.04 LTS 10)18) thenend ifelse 6) Java Version: 1.6.0 24 19) read next line of output.txt The multinode cluster had 4 nodes, one master and 3 slave a) : add files into zip nodes.Each7) Hadoop: of these 1.2.1 machines version had following configurations: 11)12)20) elseend while 8) Eclipse IDE: Helios 21) close zip file 13) close zip 1)9) Processor:Network Protocol: Intel Core Secure i7-2600 Shell CPU All @ 3.10GHzthe machines 22) copy zip file from local to HDFS 14) add entry into hashtable.txt on the basis of extension 2) RAM:were connected 4 GB through ethernet. 15) copy zip file from local to HDFS 3) HDD: 500 GB 16) index++ 4) Operating System: Linux 17) create a new zip file archive+index.zip 5) Version: Ubuntu 12.04 LTS 18) end ifelse 6) Java Version: 1.6.0 24 19) read next line of output.txt 7) Hadoop: 1.2.1 version 20) end while 8) Eclipse IDE: Helios 21) close zip file 9) Network Protocol: Secure Shell All the machines 22) copy zip file from local to HDFS were connected through ethernet. The time taken for read and write operations was measured for both the original HDFS and the proposed solution for single node as well as for multinode cluster. A set of 1000, 2000, 4000, 6000, 8000 and 10000 files. Here we have considered a mix of text and pdf files.These files were first copied into HDFS and then read back to the local machine. The time taken to complete these operations was noted down and similar readings were obtained for the proposed solution. This was repeated three times and an average of the results was calculated and used for analysis.

B. Single-Node Cluster 1) Read Operation: The results obtained for read operation are summarized below: Fig. 5: Time taken for reading .txt ﬁles in Multi-Node Cluster

In the following graph, we have compared the read time required in HDFS and in proposed solution for .pdf ﬁles:

Fig. 3: Time taken for read operation in single node cluster

2) WriteAn Efficient Operation: ApproachThe results for obtainedStoring and for writeAccessing opera- R7 : 64 tion in single nodeHuge cluster Number are summarized of Small in Files a similar in HDFS manner which is shown below:

2) if (smallfilename lies within the range of Fig. 6: Time taken for reading .pdf ﬁles in Multi-Node Cluster { smallfilename1 } to { smallfilename2 g } 3) return the index 4) else 2) Write Operation: The results obtained for write opera- 5) print file doesn’t exist tion in a multinode cluster are summarized in a similar manner which is shown below: 6) exit program 7) end ifelse

IV. Evaluation And Results

A. Experimental Environment The multinode cluster had 4 nodes, one master and 3 slave nodes. Each of these machines had following configurations: Fig. 4 : TimeFig. 4: Timetaken taken for for write write operation operation in single in node single cluster node 1) Processor: Intel Core i7-2600 CPU @ 3.10GHz cluster 2. RAM: 4 GB 3. HDD: 500 GB The data obtained clearly indicates that the proposed 4. Operating System: Linux solutionThe datais faster obtained than clearlythe default indicates original that HDFS the proposed for read solu-as tion is faster than the default original HDFS for read as well 5. Version: Ubuntu 12.04 LTS well as write operation in single node cluster setup. as write operation in single node cluster setup. 6. Java Version: 1.6.0 24 Fig. 7: Time taken for Write operation in Multi-Node Cluster The time taken for read and write operations was measured for C. Multi-Node Cluster 7. Hadoop: 1.2.1 version both the original HDFS and the proposed solution for single C. Multi-Node1) Read Operation: Cluster In the following graph, we have Thenode8. time asEclipse well taken IDE: as for for Helios read multinode and write cluster. operations A set was of measured 1000, 2000, for compared the read time required in HDFS and in D. READ Analysis 4000,9. Network 6000, 8000 Protocol: and 10000 Secure files. Shell Here All wethe havemachines considered were 1) Read Operation: In the following graph, we have com- Theboth time the taken original for read HDFS and andwrite the operations proposed was solution measured for singlefor proposed solution for .txt files: a mixconnected of text through and pdf ethernet. files.These files were first copied pared the read time required in HDFS and in proposed solution The first read will take more time since the map is created bothnode the as original well as HDFS for multinode and the proposed cluster. A solution set of for 1000, single 2000, into HDFS and then read back to the local machine. The for .txt files: but the next reads take significantly lesser time as depicted in node4000, asThe well 6000, time as 8000 for taken multinode and 10000for read cluster. files. and Here A write set we of operations have 1000, considered 2000, was time taken to complete these operations was noted down and 4000,ameasured mix 6000, of 8000 textfor andboth and 10000 pdfthe files.Theseoriginal files. Here HDFS files we and have were the considered first proposed copied similar readings were obtained for the proposed solution. This a mixintosolution ofHDFS text for andsingle and then pdf node read files.These as well back as tofor files themultinode were local first machine.cluster. copied A Theset was repeated three times and an average of the results was intotimeof HDFS 1000, taken 2000, and to complete then 4000, read 6000, these back 8000 operations toand the 10000 local was files. machine. noted Here down we The have and calculated and used for analysis. timesimilarconsidered taken readings to complete a mix were of these textobtained and operations forpdf the files.These was proposed noted solution.files down andwere This first similarwascopied readingsrepeated into HDFS were three obtainedand times then and for read an the averageback proposed to the of solution. thelocal results machine. This was wasB.calculatedThe repeated Single-Node time andtaken three used Clusterto times complete for and analysis. anthese average operations of the was results noted wasdown calculatedand similar and used readings for analysis. were obtained for the proposed solution. B.This Single-Node1) Readwas repeated Operation: Cluster threeThe times results and obtained an average for readof the operation results arewas summarized calculated and below: used for analysis. B. Single-Node1) Read Operation: Cluster The results obtained for read operation Fig. 5: Time taken for reading .txt files in Multi-Node Cluster B. Single-Node Cluster are1) Read summarized Operation: below:The results obtained for read operation are summarized1) Read below: Operation: The results obtained for read Fig. 5: TimeFig. 5: taken Time taken for for reading reading .txt .txt files infiles Multi-Node in Multi-Node Cluster operation are summarized below: InFig. the 5: following Time taken for graph, readingCluster .txtwe files have in Multi-Node compared Cluster the read time required in HDFS and in proposed solution for .pdf files: In the following following graph, graph, we we have have compared compared the the read read time time requiredrequiredIn the followingin in HDFS HDFS and graph, and in in proposed we proposed have solution compared solution for for .pdf the .pdf readfiles: files: time required in HDFS and in proposed solution for .pdf files:

Fig. 3: Time taken for read operation in single node cluster

Fig.2) WriteFig. 3 : 3:Time Time Operation: takentaken for readforThe operationread results operation in single obtained node in cluster for single write node operation in single node cluster arecluster summarized in a similar manner which2) Write is shown Operation: below: The results obtained for write opera- tion2) Write in2) single Operation: Write node Operation: clusterThe areresults The summarized obtainedresults obtained in for awrite similar for opera- mannerwrite Fig. 6: Time taken for reading .pdf ﬁles in Multi-Node Cluster tionwhich in single is shownoperation node cluster below: in single are summarized node cluster in are a similar summarized manner in Fig. 6 : Time taken for reading .pdf files in Multi-Node which is showna similar below: manner which is shown below: Cluster Fig. 6: Time taken for reading .pdf ﬁles in Multi-Node Cluster

2)Fig. Write 6: Time Operation: taken for readingThe .pdf results ﬁles in Multi-Nodeobtained Cluster for write operation in a multinode cluster are summarized in a similar manner CSI Journal of Computing | Vol. 2 • No. 4, 2015 which2) Write is shown Operation: below: The results obtained for write opera- tion2) Write in a multinode Operation: clusterThe resultsare summarized obtained in for a write similar opera- manner tionwhich in a multinode is shown clusterbelow: are summarized in a similar manner which is shown below:

Fig. 4: Time taken for write operation in single node cluster

Fig. 4: Time taken for write operation in single node cluster TheFig. data 4: Time obtained taken for clearly write operation indicates in single that node the cluster proposed solution is faster than the default original HDFS for read as well as write operation in single node cluster setup. The data obtained clearly indicates that the proposed solu- Fig. 7: Time taken for Write operation in Multi-Node Cluster tionThe is data faster obtained than the clearly default indicates original that HDFS the proposed for read solu- as well as write operation in single node cluster setup. tionC. is Multi-Node faster thanCluster the default original HDFS for read as well Fig. 7: Time taken for Write operation in Multi-Node Cluster as write operation in single node cluster setup. D. READFig. 7: Analysis Time taken for Write operation in Multi-Node Cluster C. Multi-Node1) Read Operation: Cluster In the following graph, we have compared the read time required in HDFS and in proposed solution The first read will take more time since the map is created D. READ Analysis C.for Multi-Node1) .txt Read files: Operation: Cluster In the following graph, we have com- but the next reads take significantly lesser time as depicted in D. READ Analysis pared1) Read the Operation: read time requiredIn the following in HDFS and graph, in proposed we have com-solution The first read will take more time since the map is created paredfor the .txt read files: time required in HDFS and in proposed solution butThe the first next read reads will take take significantly more time since lesser the time map as is depicted created in for .txt files: but the next reads take significantly lesser time as depicted in The time taken for read and write operations was measured for both the original HDFS and the proposed solution for single node as well as for multinode cluster. A set of 1000, 2000, 4000, 6000, 8000 and 10000 files. Here we have considered a mix of text and pdf files.These files were first copied into HDFS and then read back to the local machine. The time taken to complete these operations was noted down and similar readings were obtained for the proposed solution. This was repeated three times and an average of the results was calculated and used for analysis.

B. Single-Node Cluster 1) Read Operation: The results obtained for read operation are summarized below: Fig. 5: Time taken for reading .txt ﬁles in Multi-Node Cluster

In the following graph, we have compared the read time required in HDFS and in proposed solution for .pdf ﬁles:

Fig. 3: Time taken for read operation in single node cluster

2) Write Operation: The results obtained for write operation in single node cluster are summarized in a similar manner which is shown below:

Fig. 6: Time taken for reading .pdf files in Multi-Node Cluster Shyamli Rao, et. al. R7 : 65 2) Write Operation: The results obtained for write opera- tion2) in aWrite multinode Operation: cluster The aresummarized results obtained in a similar for write manner file to reduce the file count. An indexing mechanism has been which isoperation shown below: in a multinode cluster are summarized in built to access the individual files from the corresponding a similar manner which is shown below: combined file. Efficient management of metadata for small files helps for greater utilization of HDFS resources. However, it does not sort the files on the basis of their extension which makes it difficult while reading similar types of files. Yang Zhang and Dan Liu [3] proposed an approach for small files processing strategy and proposes files efficient merger, which builds the file index and uses boundary file Fig. 4: Time taken for write operation in single node cluster block filling mechanism. It successfully accomplishes it’s goal of effective files separation and files retrieval. Their experimental results clearly show that their proposed design has improved the storing and processing of huge number of The data obtained clearly indicates that the proposed solu- small files efficiently in HDFS. However, it does not take into tion is faster than the default original HDFS for read as well account file correlation mechanism which can reduce access as write operation in single node cluster setup. Fig. 7: Time taken for Write operation in Multi-Node Cluster time in HDFS. This solution can be enhanced further by Fig. 7 : Time taken for Write operation in Multi-Node effectively implementing file correlation mechanism. C. Multi-Node Cluster Cluster VI. Conclusions D. READ Analysis 1) Read Operation: In the following graph, we have com- HDFS was originally developed for storing large D. READ Analysis pared the read time required in HDFS and in proposed solution The first read will take more time since the map is created files. When a large number of small files are stored init, for .txt files: the figurebutThe the below: next first reads read take will significantly take more lesser time time since as depicted the mapis in efficiency is reduced. The VI.approach CONCLUSIONS designed in this solution created but the next reads take significantly lesser time as could improveHDFS small was filesoriginally storage developed and accessing for storing efficiency large files. depicted in the figure below: significantly. Files are sorted on the basis of their extensions and Whenthen merged a large into number zip files of smallwhose filessize does are storednot exceed in it, the efficiency is HDFSreduced. block size. The This approach helps designedto locate any in thissmall solution file easily. could improve A filesmall read files cache storage has been and established, accessing so efficiency that the significantly.program Files can readare sorted a small on file the quickly. basis of The their experimental extensions results and then show merged into that zipthe filesapproach whose can size effectively does not reduce exceed the the load HDFS of HDFS block size. This and improvehelps to the locate efficiency any small of storing file easily.and accessing A file readsmall cachefiles. has been For 10,000established, small files, so that the thewriting program time is can reduced read by a small80% and file quickly. the readingThe experimental time is reduced results by show92% on that a single-node the approach cluster can effectively whilereduce for the the multi-node load of HDFScluster, andthe percentage improve the decrease efficiency for of storing writingand is accessing 77% and for small reading files. text For files 10,000 is 89% small and for files, .pdf the writing Fig. 8: Time taken for different reads in the proposed solution files is 15%. V. Related Work time is reduced by 80% and the reading time is reduced by 92% on a single-node cluster while for the multi-node cluster, Bo Dong1, Jie Qiu, Qinghua Zheng, Xiao Zhong, Jingwei VII. Future Work Li and Ying Li in [1] designed a way of effective storage and the percentage decrease for writing is 77% and for reading text filesAs for is future 89% work, and forthis .pdfsolution files can is be 15%. enhanced further access pattern for large number of small files in HDFS. For to provide a more advanced file correlation framework. This V. R ELATED WORK storing and accessing small files, correlations between the framework should provide a mechanism to combine files of files and locality of reference remaining among small files in similar domain. Append operationVII. F canUTURE also beW providedORK to add Bothe context Dong1, of Jie PPT Qiu, courseware Qinghua are Zheng, taken into Xiao account. Zhong, In Jingwei the similar filesAs for into future thework, existing this solutioncombined canfiles. be It enhanced can alsobe further Li andfirst Ying step, Lithey in tried [1] designedto merge aall waythe correlated of effective small storage files andof extendedto provide for other a types more of advanced file format. file correlation framework. This accessa PPT pattern courseware for large into number a single of big small file filesthat incan HDFS. effectively For framework should provide a mechanism to combine files of storingreduce and the accessing metadata small load files,on the correlations NameNode. betweenIn the second the files References similar domain. Append operation can also be provided to add andstep, locality they of introduced reference a remaining concept of among two-level small prefetching files in the [1] Bo Dong1, Jie Qiu, Qinghua Zheng, Xiao Zhong, Jingwei Li, similar files into the existing combined files. It can also be contextmechanism. of PPT Applyingcourseware this are mechanism, taken into account. the efficiency In the firstof Ying Li ”Improving the Efficiency of Storing and Accessing accessing small files is improved. The experimental results extendedSmall Files foron Hadoop: other types a Case of Study file format.by PowerPoint Files”. step, they tried to merge all the correlated small files of a PPT finally indicate that their presented design efficiently reduces In proceedings of Services Computing (SCC), 2010 IEEE coursewareburden on into the aNameNode. single big It filealso thatimproves can effectivelythe efficiency reduce of International Conference, Miami, FL, July 2010, pp. 65 72. the metadata load on the NameNode. In the second step, they [2] Chandrasekar S, DakshinamurthyREFERENCES R, Seshakumar P G, storing and accessing huge number of small files on HDFS. Prabavathy B, Chitra Babu ”A Novel Indexing Scheme for [1] Bo Dong1, Jie Qiu, Qinghua Zheng, Xiao Zhong, Jingwei Li, Ying introducedHowever, a it concept does not of take two-level into account prefetching other types mechanism. of files Efficient Handling of Small Files in Hadoop Distributed Li ”Improving the Efficiency of Storing and Accessing Small Files on Applyingsuch as this.txt, .pdf, mechanism, .png, .odt, theetc. efficiency of accessing small File System”. In proceedings of Computer Communication files is improved. The experimental results finally indicate and Hadoop: Informatics a Case (ICCCI), Study by 2013 PowerPoint International Files”. In Conference, proceedings of Services Chandrasekar S, Dakshinamurthy R, Seshakumar P Computing (SCC), 2010 IEEE International Conference, Miami, FL, July that their presented design efficiently reduces burden on the Coimbatore, Jan. 2013, pp. 1 8. G, Prabavathy B and Chitra Babu in [2] proposed a solution [3] Yang2010, Zhang, pp. Dan 65 72.Liu ”Improving the Efficiency of Storing for NameNode.based on the It works also improvesof Dong et theal., where efficiency a set of correlated storing and [2]SmallChandrasekar Files in HDFS”. S, Dakshinamurthy In proceedings R, of SeshakumarComputer Science P G, Prabavathy B, accessingfiles is hugecombined, number as identified of small by filesthe client, on HDFS. into a single However, large it and ServiceChitra Babu System ”A Novel (CSSS), Indexing 2012 International Scheme for Efficient Conference, Handling of Small does not take into account other types of files such as .txt, Files in Hadoop Distributed File System”. In proceedings of Computer .pdf, .png, .odt, etc. Communication and Informatics (ICCCI), 2013 International Conference, Coimbatore, Jan. 2013, pp. 1 8. CSI Journal of Computing | Vol. 2 • No. 4, 2015 [3] Yang Zhang, Dan Liu ”Improving the Efficiency of Storing for Small Files in HDFS”. In proceedings of Computer Science and Service System Chandrasekar S, Dakshinamurthy R, Seshakumar P G, (CSSS), 2012 International Conference, Nanjing, Aug. 2012, pp. 2239 Prabavathy B and Chitra Babu in [2] proposed a solution 2242. based on the works of Dong et al., where a set of correlated [4] Running Hadoop on Ubuntu Linux Single-Node Cluster by Michael files is combined, as identified by the client, into a single large Noll http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu- file to reduce the file count. An indexing mechanism has been linux-single-node-cluster/ built to access the individual files from the corresponding [5] Running Hadoop on Ubuntu Linux MultiNode Cluster by Michael Noll http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu- combined file. Efficient management of metadata for small linux-multi-node-cluster/ files helps for greater utilization of HDFS resources. However, [6] Hadoop commands and Hadoop Concepts http://hadoop.apache.org/ it does not sort the files on the basis of their extension which [7] Hadoop Cluster Architecture http://wiki.apache.org/hadoop/, makes it difficult while reading similar types of files. http://www.guruzon.com/6/hadoopcluster/architecture/ [8] Hadoop Shell Commands https://hadoop.apache.org/docs/r0.18.3/HDFS shell.html Yang Zhang and Dan Liu [3] proposed an approach for [9] Eclipse plugin for Hadoop http://wiki.apache.org/hadoop/EclipsePlugIn small files processing strategy and proposes files efficient [10] Tom White, ”The Small Files Problem”. merger, which builds the file index and uses boundary file https://blog.cloudera.com/blog/2009/02/the-small-files-problem/ block filling mechanism. It successfully accomplishes it’s goal of effective files separation and files retrieval. Their experimental results clearly show that their proposed design has improved the storing and processing of huge number of small files efficiently in HDFS. However, it does not take into account file correlation mechanism which can reduce access time in HDFS. This solution can be enhanced further by effectively implementing file correlation mechanism. An Efficient Approach for Storing and Accessing R7 : 66 Huge Number of Small Files in HDFS

Nanjing, Aug. 2012, pp. 2239 2242. [7] Hadoop Cluster Architecture http://wiki.apache.org/hadoop/, [4] Running Hadoop on Ubuntu Linux Single-Node Cluster by http://www.guruzon.com/6/hadoopcluster/architecture/ Michael Noll http://www.michael-noll.com/tutorials/running- [8] Hadoop Shell Commands https://hadoop.apache.org/docs/ hadoop-on-ubuntulinux-single-node-cluster/ r0.18.3/HDFS shell.html [5] Running Hadoop on Ubuntu Linux MultiNode Cluster by [9] Eclipse plugin for Hadoop http://wiki.apache.org/hadoop/ Michael Noll http://www.michael-noll.com/tutorials/running- EclipsePlugIn hadoop-on-ubuntulinux-multi-node-cluster/ [10] Tom White, ”The Small Files Problem”. https://blog.cloudera. [6] Hadoop commands and Hadoop Concepts http://hadoop.apache. com/blog/2009/02/the-small-files-problem/ org/

About the Authors

Shyamli Rao received her B. Tech. in Computer Engineering from College of Engineering Pune and currently working in a private company for a year. She will be going to pursue M. Tech. from IIT Bombay. Her areas of interest are Cloud Computing, Parallel Computing and Data Mining.

Amit Joshi is an Assistant Professor of Computer Engineering and Information Technology department at College of Engineering Pune. He received M. Tech. from Dr. Babasaheb Ambedkar Technological University Lonere, Raigad. He is selected for Ph.D. Program at NIT, Trichy. His research interests include Parallel Computer Architectures, Parallel Computing using Multicore Architectures.

CSI Journal of Computing | Vol. 2 • No. 4, 2015 Mahendiran et al. R8 : 67 Cross-Referencing Cancer Data Using GPU for Multinomial Classification of Genetic Markers Abinaya Mahendiran, Srivatsan Sridharan, Sushanth Bhat and Shrisha Rao

IIIT Bangalore, 26/C Electronics City, Hosur Road, Bangalore 560 100, Emails: [email protected], [email protected], [email protected], [email protected]

All the data used and generated for this work, as well as the complete source code and outputs, are available for download from https://github.com/abinayam/OSProject9C2.

We propose a classification scheme for genetic markers, to associate them with a single type or multiple types of cancer. Given multiple unknown datasets with genetic markers that are associated with particular types of cancer, we classify these genetic markers into two different classes. One class comprises the single genetic markers associated with a particular type of cancer, but with the potential to cause other cancers. The other class is comprised of multiple genetic markers each associated with a particular type of cancer, which in combination also have the potential to be associated with a different type of cancer. Use of this classification scheme would be helpful in predicting a possibility of new cancers from available data about genetic markers, and would thus have diagnostic or therapeutic value in oncology. A graphics processing unit (GPU) is used to accelerate the process of classification by making the entire classification scheme faster compared to the same process running on a CPU. A feature subset selection algorithm is used to remove irrelevant features by comparison of all the features from the dataset with the threshold file which contains the appropriate feature that relate them with the possibility of causing cancer. This work enables the classification of all given genetic markers into two categories, of which one category in particular consists of markers whose influence in causing multiple types of cancer is otherwise unknown.

Index Terms : Cancer, classification, CUDA, genetics, GPU, oncology, machine learning, prediction algorithms, supervised learning.

1. Introduction Cancer incidence and deaths are rising worldwide as a One class suggests the possibility of single genetic result of the growth and aging of the human population [1]. markers associated with particular types of cancer, as having There are certain genetic markers associated with every type the potential to cross the threshold of causing new types of of cancer. It is commonly accepted that the detailed study of cancer in a manner that is not directly known from a bare these genetic markers would help unlock the mysteries of scrutiny of the data as such. Thus, this leads to a better how cancers develop in human beings. With large datasets of understanding of the association of single genetic markers clinical or laboratory results about genetic markers becoming with different types of cancer. available [2], they can be analyzed in order to understand the The other class indicates the possibility of multiple similarity and differences between different types of cancers genetic markers that are individually associated with and their correlates. different types of cancer, but which in combination have the The main result of this work is to establish a method potential to cross a threshold for causing a new cancer. This to detect otherwise unknown relationships between the class suggests that a specific set of multiple genetic markers occurrences of cancers with that of the genetic markers. It each causing cancer, can in combination give rise to the also aims at cross-referencing cancer data for multinomial possibility of some other type of cancer. So another benefit classification of the genetic markers. This implies the of this classification scheme is to predict the possibility of a classification of genetic markers causing cancer into two sets new cancer in a patient if the multiple genetic markers in of overlapping classes: multiple genetic markers that are combination have the potential to cross the threshold value of associated with a single type of cancer, and single genetic a new type of cancer. markers that are associated with multiple types of cancer. The use of the parallel computing power of the graphics

CSI Journal of Computing | Vol. 2 • No. 4, 2015 Cross-Referencing Cancer Data Using GPU for R8 : 68 Multinomial Classification of Genetic Markers processing unit (GPU) makes the entire process execute in Besides its many well-known applications in computer near-linear time even for large datasets. graphics, GPU computing has been applied recently in several One of the best ways to analyze such large datasets is other fields, such as biomedical engineering [5], distributed to use a machine learning algorithm to identify multiple storage systems [6], medical imaging [7], pattern matching genetic markers that are associated with the same type of [8], processing of streaming data [9], and others. cancer, as well as identifying single genetic markers that A GPU is hierarchically split into grids, blocks and are associated with multiple types of cancers [3]. This is the threads, providing a high degree of parallelism [10]. Each of premise underlying our work. the grids, blocks and threads has the capacity and resources to The main thrust of our approach is in the process of execute different code in parallel. This parallelism increases classification of the genetic markers that are associated the performance of the chosen machine learning algorithm with cancers into two overlapping classes. This comprises kNN. The algorithm generates an output of the classification extracting those genetic markers that have been identified in scheme that classifies the genetic markers associated with the prior literature as potentially causing certain types of cancers, cancer into two different type of classes, as noted previously. followed by their classification into two types of overlapping It is evident from the output that the percentage of single classes: one that consists of all the genetic markers that are genetic markers causing multiple cancers is high. There are associated with a single type of cancer; and a second that some genetic markers from the given dataset that do not fall consists of single genetic markers that are associated with in either of the two classes. These are regarded as outliers that multiple types of cancer. By doing so, we also try to establish do not correspond to any type of cancer. Prior work [11]–[13] the correlations between the different types of cancers and has only portrayed the occurrence of the cancer given a set of their genetic markers. genetic markers, but does not mention the correlation of the In order to accomplish this task, a machine learning genetic markers to cancers. Such works also do not harness algorithm based on k-nearest neighbors (kNN) is used. The the parallel computing capability of the GPU. Designing the available raw sets of data are subjected to preprocessing architecture over a parallelized environment has increased before analysis. Data cleansing is carried out in order to make the performance of the kNN algorithm [14] to a significant sure that the dataset has relevant features, i.e., common extent. attributes among the different types of cancer, following Traditional CPU implementations of kNN take which the data are divided into training and testing datasets. exponential time to run on huge datasets, whereas the time The process of cleansing is carried out by a feature subset taken by the GPU implementation of kNN takes almost- algorithm where attributes are analyzed for data type as well linear time for execution. The accuracy of the system can as the value each genetic marker holds. The division of the further be improved by performing regression analysis by given dataset into training and testing datasets is done using making the data to work as both the training and the test a centroid algorithm where the centroid of all the given dataset dataset. Regression analysis is done in order to minimize the points is calculated and an intercept is generated. Based on error in the predictor model chosen. The testing dataset is the position of each dataset point on the hyperplane, it is further divided into subsets, of which a few sets can be used classified as belonging to either the training dataset or the for training purposes and the remaining can be used for test dataset. Approximately sixty percent of the dataset points testing purposes in the consecutive iterations. Ultimately, are grouped into the training dataset and the remaining forty all the elements of the entire dataset would have been used percent are grouped into the test dataset, as is standard [4]. both as training and testing dataset. This would result in The training dataset has the genetic markers along with the reduced prediction errors. The classification method uses types of cancer they can cause. This dataset is used in the fast-kNN approach for classification and cross-referencing training phase. The testing dataset contains genetic markers of the genetic markers into two overlapping sets of classes. without indication of the types of cancer they can cause. The The feature subset selection algorithm is used to extract only kNN algorithm is able to predict the correlations of the types the relevant feature by comparing the input genetic marker of cancers with the genetic markers given in the test dataset. dataset with the threshold file. The threshold file contains The entire work is implemented using CUDA parallel only those parameters that are relevant for this classification programming on a GPU system, with innovative use of process. The feature subset selection algorithm compares supervised machine learning algorithms to classify the the input genetic marker dataset with the parameters of genetic markers. CUDA is a parallel computing platform the threshold file, and removes all unnecessary features or and a programming model that enables dramatic increases irrelevant parameters from the classification process. in computing performance by harnessing the power of the To summarize: this work proposes a method for GPU. The CUDA architecture consists of several components classification of genetic markers into two sets of overlapping such as parallel compute engines, Nvidia GPUs, OS kernel- classes. The current approaches [13] [11] [12] [15] predict level support for hardware initialization and configuration, only the occurrence of a cancer from the given genetic marker and a user mode driver which provides a device level API for dataset, but we propose a classification method to correlate developers. The GPU can execute large numbers of threads in different sets of genetic markers and associate them with parallel and operates as a co-processor with the conventional either a single or multiple types of cancer. This method will central processing unit (CPU). benefit end users (patients and oncological diagnosticians) in

CSI Journal of Computing | Vol. 2 • No. 4, 2015 Mahendiran et al. R8 : 69 the following ways: worthwhile to come up with a machine-learning algorithm By predicting the possibility of a new type of cancer from that would identify multiple genetic markers that are the patient’s genetic marker dataset. This happens by associated with the same type of cancer, as well as single correlating different genetic markers that are known to genetic markers that are associated with multiple types of cause cancers in patients to predict the possibility of the cancers. From a given set of genetic markers, features are same in combination to cross a threshold value of causing extracted by applying a feature extraction algorithm. Of all a different type of cancer. This has obvious diagnostic the obtained features, only the relevant features are taken implications. into consideration by a feature subset selection algorithm. By predicting the possibility of a new cancer from a single Once this is done, the given data are partitioned into genetic marker in a patient. This happens by exploring training and test data using a data partitioning algorithm. the possibility of the association of a single genetic A supervised learning model using the kNN algorithm is marker that could cause cancer in a patient’s body, with implemented to classify the given genetic markers into two a different type of cancer, so that a patient with that classes: a class that associates a single genetic marker with marker may also carry a risk for another cancer that was multiple cancers, and a class that associates multiple genetic not previously known. markers with a single cancer. The system designed gives the The power of parallel computing using the GPU is correlation between the cancers and the genetic markers. harnessed to make the execution of the classification There are a few prediction algorithms based on decision method much faster in time even for large datasets. The tree induction and bayesian classifiers [21], that help us in classification method uses a fast-kNN approach for rapid identifying and predicting the occurrence of cancer from the classification and association of the genetic markers into given dataset of genetic markers. two overlapping sets of classes. The relevant features are extracted from the genetic marker dataset using a 2.2.1 Parallelism Using GPU feature subset selection algorithm where the relevant When the dataset is huge and complex, solving the feature is extracted by comparison with the parameters machine learning problem becomes computationally in the threshold file. expensive. Specifically, the time to run optimization algorithms increases drastically with increasing dataset sizes. 2. CLASSIFICATION OF GENETIC MARKERS So, parallelism is a good way to speed up computations, and GPU computing is a good way to achieve the same. A GPU 2.1 Existing Systems has a large number of Arithmetic and Logic Units (ALUs) as GenePattern [16] [17] is a powerful genomic analysis compared to the traditional CPU. As a result, it can do large platform that provides access to more than 230 tools for amounts of complex computations at a faster rate than CPUs. gene expression analysis, proteomics, SNP analysis, flow GPUs are effective at performing rapid calculations due to cytometry, RNA-seq analysis, and common data processing their pipeline, which can perform multiple computations at tasks. A web-based interface provides easy access to these the same time. A CPU can also do some parallel computation, tools, and allows the creation of multi-step analysis pipelines but it is limited by the number of cores available. In our that enable reproducible in silico research. The features of system, the Nvidia [10] [22] Tesla GPU has been used, in the GenePattern tool are, in brief, the following: Genes that which there are 128 cores divided into 8 clusters consisting are distinct in different phenotypes can be chosen; supervised of 16 cores each. Each core is capable of handling multiple and unsupervised learning algorithms can be used to predict threads. A number of threads can be assigned according the class to which a gene sequence belongs; and there is a to the records that must be processed in parallel. Multiple provision to convert data from one file format to another. The threads can be executed in a single block, and each thread tool provides differential analysis and gene marker selection. is identified by the block number and the thread index. The supervised learning algorithms used by the tool can The collection of blocks is represented by a single grid. This predict the correct classes. Using unsupervised learning, a architecture gives the ability to process massive amounts of biologically relevant unknown taxonomy is identified by a data in parallel, thus reducing the computational time to a gene expression signature. great extent. Parallelization through GPU can be done in the Other existing systems, especially those that are used following two ways, learning and prediction. for breast cancer [18]–[20] recognition, mainly deal with the question of detecting whether some given genetic markers 2.2.2 Learning would cause a particular type of (breast) cancer or not. There The training dataset is further divided into several is a need to develop a system that could identify particular parts based on the number of cores in the GPU. For example, genetic markers as having the potential to cause multiple if there are 4 GPUs with 512 cores, then the training types of cancer rather than just one, and would also correlate dataset is divided into 512 parts, and is fed into the cores different genetic markers to jointly infer a risk of cancer. simultaneously during the training phase in order to speed up the computation. Each core acts on the dataset, processes 2.2 Our System it and produces the results. The individual results are then With large sets of raw data being available, it is summed up and taken as the final output of the training

CSI Journal of Computing | Vol. 2 • No. 4, 2015 Cross-Referencing Cancer Data Using GPU for R8 : 70 Multinomial Classification of Genetic Markers

Fig. 1 : Architecture of System for Classification of Genetic Markers for Cancer phase of the algorithm. The phase of learning takes as the least one unknown feature value. The value of the feature input the preprocessed data about the genetic markers that that occurs most often is selected to be the value for all the cause cancers, and produces two different subsets–the test unknown features. and the training datasets. These training and test datasets The architecture of the our system is as shown in Fig. 1. are generated by the process of cross-validation in order to Present systems only predict the occurrence of specific obtain a better data partition. cancers given genetic markers, but our classification method 2.2.3 Prediction deals with the correlation of the genetic markers with single or multiple types of cancers. In the prediction phase, the test dataset is also divided Table 1 depicts the differences between the existing into subsets based on the number of cores in the GPU. approaches and our genetic marker classification system. Then the individual subsets are processed by the cores simultaneously and the results of the prediction algorithm 3. Data Preprocessing and Extraction are obtained. The final outcome of the prediction phase classifies the given test data into one of the two classes, either 3.1 Data Sources the class of multiple genetic markers that are associated with a single type of cancer, or the class of genetic markers that The datasets used as inputs for this classification process are associated with multiple types of cancer. The prediction contain parameters related only to the genetic markers, and phase includes weighted voting to determine the correlation do not feature any personally identifiable or other sensitive between the training samples and the testing samples. Based information. Some parameters present in those input on the results, the test samples are classified into either of the datasets may also not be useful, and some parameters might two classes. These classified training samples are then used be missing values. So to remove redundant parameters and by the algorithm for learning, which in turn helps improve to preprocess the entire input dataset, the use of a threshold the accuracy of the supervised prediction model. file becomes necessary. Supervised learning algorithms take a set of input data The threshold file contains parameters that are relevent along with the associated class labels, and build a predictor for classification. It contains genetic markers causing cancer, model based on the selected features, to produce relevant the respective threshold values that when crossed by a responses by assigning appropriate class labels to new genetic marker are known to cause cancer in humans, and sets of data when input to the model. As a part of the data the respective value associated with each genetic markers. preprocessing phase, there are some techniques that are used For instance, considering the colon cancer dataset, it to handle specifically the missing data that would be most contains parameters labeled “Tumor n,” where n represents appropriate for the machine learning algorithm. One of the any number. These labels note particular tumors that are techniques is the method of ignoring instances with unknown associated with the genetic markers causing cancers, and feature values. This method ignores instances which have at some normal (not cancer-causing) genetic markers too. The

CSI Journal of Computing | Vol. 2 • No. 4, 2015 Mahendiran et al. R8 : 71 threshold file contains only the genetic markers (labelled values, those features are discarded before it is further “Tumor”) that cause cancer. processed by the feature subset selection algorithm. 1) Colon Cancer Dataset from The University of Edinburgh- School of informatics [23]: This dataset contains 3.3 Feature Subset Selection Algorithm expression levels of 2000 genetic markers taken in Feature subset selection is the process of identifying and 62 different samples. For each sample it is indicated removing irrelevant and redundant features [31], [32]. This whether it came from a tumor biopsy or not. This is used reduces the dimensionality of the data and enables learning in our work by treating the genetic markers as records algorithms to operate faster and more effectively. Generally, with 62 attributes embodied. The size of the dataset features are characterized as relevant or irrelevant. The holding the genetic markers is 1.9 MB data with 529 KB former are features that have an influence on the output, names and 207 bytes labels. and whose roles cannot be subsumed by the rest. Irrelevant 2) Lung Cancer Dataset - Broad Institute [24]: The lung features are those with no influence on the output, and whose cancer dataset contains expression levels of 7129 genetic values are generated at random for each example. markers taken over 72 different samples. The size of the Feature subset selection can also be thought of as the dataset holding the genetic markers is 3.8 MB, with 72 process of identifying and removing possible irrelevant and samples. redundant features [33], for the purpose of reducing the 3) Breast Cancer Dataset - Broad Institute [25]: The breast dimensionality of the data and enabling learning algorithms cancer dataset contains expression levels of 1212 genetic to operate faster and more effectively. markers taken over 55 different samples. For each In our case, the compound dataset which is in a common sample data, there is an associated class label which format contains genetic markers for different types of cancer. indicates whether the dataset is associated with cancer Each type of cancer may have a different set of features or not. The threshold file contains the class labels of those associated with it. In order to make sure that all the classes of genetic markers that are associated with breast cancer. cancer have only common features, the relevant features are The size of the dataset holding the genetic markers is selected by the feature subset selection algorithm [34]–[36]. 1.5 MB, with 55 different samples. The threshold file The actual data are read, and the data types of the chosen is 0.7 MB holding only those genetic markers that are features are checked. If there are any mismatches in the associated with the cancer. This threshold file is used to data types even when the features referred to are the same, extract the relevant parameters from 1212 genes taken then those features are discarded. After this second filtering over 55 samples. step, the relevant features are selected and written to a file 4) Prostate Cancer Dataset - Broad Institute [25]: The along with the associated output labels. For each feature, the prostate cancer dataset contains expression levels of feature names and data types are compared from the data file 5896 genetic markers taken over 78 different samples. and the threshold file. If they are the same, then it is added to The size of the dataset holding the genetic marker is 2.68 the feature set. After this filtering step, the relevant features MB with 78 samples. are selected and written to a file along with the associated The entire collection of obtained cancer data is merged output labels. into a single compound dataset. This compound dataset is The main aspects of the feature subset selection what we use for our work. algorithm are as follows.

3.2 Pre-processing Module 3.4 Redundant Features The data pre-processing module deals with the following: A redundancy exists whenever a feature can take the 1) Data format [2] [26]: The raw data that are available are role of another. Feature selection algorithms in general have in different file formats. To implement any data mining two components [37], a selection algorithm that generates technique, the dataset has to be of a single format. The proposed subsets of features and attempts to find an optimal file format chosen for the compound dataset is Attribute- subset, and an evaluation algorithm that determines how Relation File Format (ARFF) [27], which is an ASCII text good a proposed feature subset is, returning some measure file format. of goodness to the selection algorithm. However, without a 2) Data cleansing [28]–[30]: If the dataset has any features suitable stopping criterion, the feature selection process may that holds irrelevant values, i.e., null or non-numeric run exhaustively or forever through the space of subsets. If Table 1 : Comparison of the proposed classification system with the state of the art

Factor / Feature Existing System [13] [11] [12] [15] Our Classification System Correlation among Cancer Types No Yes Processing Serial Parallel Processor CPU GPU Prediction Time Exponential Linear

CSI Journal of Computing | Vol. 2 • No. 4, 2015 5 TABLE 1. Comparison of the proposed classiﬁcation system with the state of the art

Factor / Feature Existing System [13] [11] [12] [15] Our Classiﬁcation System

Correlation among No Yes Cancer Types Processing Serial Parallel Processor CPU GPU Prediction Time Exponential Linear

Attribute-Relation File Format (ARFF) [27], which is an two components [37], a selection algorithm that generates ASCII text file format. proposed subsets of features and attempts to find an optimal 2) Data cleansing [28]–[30]: If the dataset has any features subset, and an evaluation algorithm that determines how that holds irrelevant values, i.e., null or non-numeric good a proposed feature subset is, returning some measure values, those features are discarded before it is further of goodness to the selection algorithm. However, without a Cross-Referencing Cancer Data Using GPU for processed by the feature subset selection algorithm. R8 :suitable 72 stopping criterion, the feature selection process may Multinomial Classification of Genetic Markers run exhaustively or forever through the space of subsets. If 3.3 Feature Subset Selection Algorithm the addition or deletion of any feature does not produce the aaddition better subset,or deletion then itof cananybe feature considered does asnot a produce stopping a 3.5 Partitioning of Dataset Feature subset selection is the process of identifying and bettercriterion. subset, Ideally, then it featurecan be considered selection methods as a stopping search criterion. through removing irrelevant and redundant features [31], [32]. This The dataset in the ARFF format is then segregated into Ideally,the subsetsfeature ofselection features methods and try tosearch find thethrough best onethe amongsubsets reduces the dimensionality of the data and enables learning two subsets, the test dataset and the training dataset. The of featuresthe possible and candidatetry to find subsets the accordingbest one toamong some the evaluatio possiblen algorithms to operate faster and more effectively. Generally, two are later used for the classification of the genetic markers candidatefunction. subsets according to some evaluation function. features are characterized as relevant or irrelevant. The former into two different classes. are features that have an influence on the output, and whose Algorithm 1: FEATURE SUBSET SELECTION roles cannot be subsumed by the rest. Irrelevant features are (i) Training Dataset: Data: dataFileFeatures[ ], thresholdFileFeatures[ ] those with no influence on the output, and whose values are Result: Common feature set features[ ] generated at random for each example. The training dataset contains the input data together 1 k=0; Feature subset selection can also be thought of as the with the correct/expected output. This dataset is usually 2 features[ ]=null ; process of identifying and removing possible irrelevant and prepared by collecting some data in a semi-automated 3 for i from 0 to dataFileFeatures.size()-1 do redundant features [33], for the purpose of reducing the way. It is important that there be expected output 4 for j from 0 to thresholdFileFeatures.size()-1 do dimensionality of the data and enabling learning algorithms labels associated with the dataset, because it is used 5 if (dataFileFeatures[i]=thresholdFileFeatures[j]) to operate faster and more effectively. for supervised learning. Approximately 60% [4] of the and (getDataType(dataFileFeatures[i])= In our case, the compound dataset which is in a common original dataset is to be used for the training dataset. getDataType(thresholdFileFeatures[j])) then format contains genetic markers for different types of 6 features[k + +] ← cols1[i]; cancer. Each type of cancer may have a different set of (ii) Test Dataset: 7 break; features associated with it. In order to make sure that 8 end all the classes of cancer have only common features, the The remaining data are taken for the test dataset. These 9 end relevant features are selected by the feature subset selection are the data that validate the underlying model. For the 10 end algorithm [34]–[36]. The actual data are read, and the purpose of cross-validation, the test dataset is in turn data types of the chosen features are checked. If there are divided into further subsets which can then be used to any mismatches in the data types even when the featuresExplanationExplanation of Algorithm of Algorithm 1 1 the test the accuracy of the chosen model effectively. referred to are the same, then those features are discarded. (i) For each feature, the feature names and data types are After this second filtering step, the relevant features are(i) For each feature, the feature names and data types are compared from the data file and the threshold file (line 3.6 Centroid Generation selected and written to a file along with the associated compared from the data file and the threshold file (line 5). output labels. For each feature, the feature names and data 5). Algorithm 2 calculates the centroid among all the (ii) If they are the same then it is added to the feature set types are compared from the data file and the threshold distributed dataset points and stores the coordinates as (ii) If they(line are 6). the same then it is added to the feature set file. If they are the same, then it is added to the feature set. (iii)(line After 6). this filtering step, the relevant features are (centroid x, centroid y) points in the space. Random coordinates After this filtering step, the relevant features are selected (iii) Afterselected this filtering and written step, tothe a filerelevant along features with the associatedare selected are generated for each sample row data, and starting values and written to a file along with the associated output labels. andoutput written labels. to a file along with the associated output are assigned to arrays max x[ ], min x[ ], max y[ ], min y[ ]. The main aspects of the feature subset selection For each sample data, we assign maximum value of random algorithm are as follows. labels.However, this procedure is exhaustive as it tries to find coordinates to max x and max y and minimum values to min onlyHowever, the best this one. procedure It may beis tooexhaustive costly and as practicallyit tries to x and min y.We assign the average of min x, max x and min y, findprohibitive, only the even best for one. a medium-sized It may be too feature costly set size.and practically 3.4 Redundant Features max y to centroid x and centroid y respectively. This part of the prohibitive,Other methods even for based a medium-sized on heuristics orfeature random set search size. data preprocessing algorithm calculates the centroid among A redundancy exists whenever a feature can take the roleOthermethods methods attempt based to reduceon heuristics the computational or random complexity search of another. Feature selection algorithms in general have while compromising on performance. Filter methods are all the distributed dataset points and stores the coordinates methods attempt to reduce the computational complexity as (centroid_x, centroid_y) points in the space.6 while compromising on performance. Filter methods are 6 independentindependent of ofthe the inductive inductive algorithm,algorithm, whereas whereas wrapper wrapper independent of the inductive algorithm, whereas wrapper methodsmethods use use the the inductive inductive algorithmalgorithm as as the the evaluation evaluation methods use the inductive algorithm as the evaluation function.function. Stopping Stopping functions functions [38] [38] [39] [39] can can be beused, used, with with criteria function.criteria Stopping based on thefunctions following [38] parameters: [39] can be distance, used, with based on the following parameters: distance, information, criteriainformation, based dependence on the and following consistency. parameters: distance, dependence and consistency. information, dependence and consistency.

Fig. 3. Percentage of feature reduction achieved

Fig. 2.2Working : Working of the of feature the feature subset selection subset algorithm. selection AlgorithmFig. 2: CENTROID 3 : Percentage GENERATION of feature reduction achieved algorithm. Fig. 3. Percentage of feature reduction achieved Data: x coordinate[], y coordinate[], max x[], max y[], min x, min y 3.5 Partitioning of Dataset Result: centroid x, centroid y Fig.The 2. datasetWorking in the of ARFF the formatfeature is subset then segregated selection into algorithm. two 1 max x[] ← rand x coordinate[0]; Algorithm 2: CENTROID GENERATION CSIsubsets, Journal the of test Computing dataset and | the Vol. training 2 • No. dataset. 4, 2015 The two 2 max y[] ← rand y coordinate[0]; are later used for the classiﬁcation of the genetic markers 3 min x[] ←Datarand: xx coordinatecoordinate[0][];, y coordinate[], max x[], into two different classes. 4 min y[] ← randmaxy coordinatey[], min[0];x, min y 5 begin 3.5(i) Partitioning Training Dataset: of Dataset Result: centroid x, centroid y The training dataset contains the input data together 6 for i=1 toSampleDataRowCount-1 do 1 The datasetwith the in the correct/expected ARFF format output. is then This segregated dataset intois two7 if randmax xxcoordinate[] ← rand[i]x≥ coordinatemax x and [0]; subsets,usually the test prepared dataset by and collecting the training some dataset. data in The a two rand2 maxy coordinatey[] ← rand[i] ≥ymaxcoordinatey then [0]; are latersemi-automated used for the way. classiﬁcation It is important of the geneticthat there markers8 3 minmax xx[]=←randrandx coordinatex coordinate[i]; [0]; into twobe different expected classes. output labels associated with the 9 4 minmax yy[]=←randrandy coordinatey coordinate[i]; [0]; dataset, because it is used for supervised learning. 10 end (i) Training Dataset: 5 begin Approximately 60% [4] of the original dataset is to be 11 if rand x coordinate[i] < min x and Theused training for the training dataset dataset. contains the input data together rand6 yforcoordinatei=1 toSampleDataRowCount-1[i] < min y then do (ii)with Test Dataset: the correct/expected output. This dataset12 is 7 min x if= randrandx xcoordinatecoordinate[i]; [i] ≥ max x and usuallyThe remaining prepared data are by taken collecting for the test somedataset. These data in13 a min y rand= randyy coordinatecoordinate[i];[i] ≥ max y then semi-automatedare the data that validate way. the It underlying is important model. For that the there14 end8 max x = rand x coordinate[i]; purpose of cross-validation, the test dataset is in turn 15 end be expected output labels associated with the 9 max y = rand y coordinate[i]; divided into further subsets which can then be used to 16 centroid x = ( max x + min x )/2; 10 end dataset,the test the because accuracy it of is the used chosen for model supervised effectively. learning.17 centroid y = ( max y + min y )/2; Approximately 60% [4] of the original dataset is to18 beend 11 if rand x coordinate[i] < min x and 3.6used Centroid for the Generation training dataset. rand y coordinate[i] < min y then (ii)Algorithm Test Dataset: 2 calculates the centroid among all the 12 min x = rand x coordinate[i]; distributedThe remaining dataset data points are and taken stores for the the coordinates test dataset. as These 13 min y = rand y coordinate[i]; (centroidare thex data, centroid that validatey) points the in underlying the space. model. Random For theExplanation14 of Algorithmend 2 coordinatespurpose are of cross-validation, generated for each samplethe test row dataset data, and is in turn 15 end starting values are assigned to arrays max x[ ], min x[ ], max (i) Random coordinates are generated for each sample row divided into further subsets which can then be used to 16 centroid x = ( max x + min x )/2; y[ ], min y[ ]. For each sample data, we assign maximum data, and starting values are assigned to arrays max x[ 17 y y y valuethe testof random the accuracy coordinates of the to chosenmax x and model max effectively.y and ], min x[ ], maxcentroidy[ ], min=y[ ( ] max (lines 1 to+min 4). )/2; minimum values to min x and min y. We assign the average (ii) For each18 end sample data, maximum values of random coordinates are assigned to max x and max y and 3.6of min Centroidx, max x Generationand min y, max y to centroid x and centroid y respectively. This part of the data preprocessing algorithm minimum values to min x and min y (lines 6 to 15). Algorithmcalculates the 2 centroid calculates among the all centroid the distributed among dataset all(iii) the The average of min x,max x and min y, max y are distributedpoints and storesdataset the coordinatespoints and as (centroid stores thex, centroid coordinatesy) asassigned to centroid x and centroid y respectively (centroidpoints inx the,space. centroid y) points in the space. Random(lines 16,Explanation 17). of Algorithm 2 coordinates are generated for each sample row data, and starting values are assigned to arrays max x[ ], min x[ ], max (i) Random coordinates are generated for each sample row y[ ], min y[ ]. For each sample data, we assign maximum data, and starting values are assigned to arrays max x[ value of random coordinates to max x and max y and ], min x[ ], max y[ ], min y[ ] (lines 1 to 4). minimum values to min x and min y. We assign the average (ii) For each sample data, maximum values of random of min x, max x and min y, max y to centroid x and centroid coordinates are assigned to max x and max y and y respectively. This part of the data preprocessing algorithm minimum values to min x and min y (lines 6 to 15). calculates the centroid among all the distributed dataset (iii) The average of min x,max x and min y, max y are points and stores the coordinates as (centroid x, centroid y) assigned to centroid x and centroid y respectively points in the space. (lines 16, 17). 7 TABLE 2. Achieving feature reduction using feature subset algorithm

Types of Cancer Initial Number of Final Number of % of Feature Reduction 6 Features Features Achieved independent of the inductive algorithm, whereas wrapper methods use the inductive algorithm as the evaluation Breast Cancer 34 24 29.41 function. Stopping functions [38] [39] can be used, with Prostrate Cancer 48 26 45.83 criteria based on the following parameters: distance, Lung Cancer 35 21 40.00 information, dependence and consistency. Colon Cancer 33 27 18.18

3.7 Intercept Generation intercepts coordinate1 and coordinate2 is the boundary Algorithm 3 generates a random point on either the x-axis for dividing the dataset into training and test subsets. If or the y-axis, and correspondingly generates its intercept the count of the coordinates generated during the centroid on the alternate axis using the principle of parallelogram, generation algorithm above this boundary is high, we thereby dividing the dataset into training and testing consider the rows corresponding to these coordinates as subsets. After generating random coordinates rand x and the training dataset. The rows corresponding to the rest rand y, the greater of the two values is picked and assigned of the coordinates below the line are the test dataset and vice-versa. The next step of Algorithm 4 is writing the test Fig. 3. Percentage of feature reduction achieved to the coordinate, and the flag is set accordingly. Based on Mahendiran et al. the property of similar triangles, it finds the coordinateR8 : 732 and training dataset values to the corresponding files. The using the centroid and the known coordinate1. set with most number of points is taken as the training dataset and the set with fewer points is taken as the test Fig. 2. Working of the feature subset selection algorithm. dataset. The corresponding row values of the indices are Algorithm 2: CENTROID GENERATION Algorithm 3: INTERCEPT GENERATION stored in training data[i] and test data[i], and written to Data: x coordinate[], y coordinate[], max x[], Data: rand x, rand y, centroid x, centroid y training data[i].arff test data[i].arff files. max y[], min x, min y Result: coordinate1, coordinate2 3.5 Partitioning of Dataset Result: centroid x, centroid y 1 begin The dataset in the ARFF format is then segregated into two 1 max x[] ← rand x coordinate[0]; 2 begin Explanation of Algorithm 4 subsets, the test dataset and the training dataset. The two 2 max y[] ← rand y coordinate[0]; 3 if rand x>rand y then (i) The line joining the intercepts coordinate1 and are later used for the classification of the genetic markers 3 min x[] ← rand x coordinate[0]; 4 coordinate1 = rand x; coordinate2 is the boundary for dividing the dataset into two different classes. 4 min y[] ← rand y coordinate[0]; 5 flag0=1; into training and test subsets. 5 begin 6 else (i) Training Dataset: (ii) If the count of the coordinates above this boundary 6 for i=1 toSampleDataRowCount-1 do 7 coordinate1 = rand y; The training dataset contains the input data together generated during centroid generation is high, we 7 if rand x coordinate[i] ≥ max x and 8 flag1=1; with the correct/expected output. This dataset is consider the rows corresponding to these coordinates rand y coordinate[i] ≥ max y then 9 end usually prepared by collecting some data in a as the training subset. 8 max x = rand x coordinate[i]; 10 if flag0 == 1 then semi-automated way. It is important that there (iii) The rows corresponding to rest of the coordinates 9 max y = rand y coordinate[i]; 11 coordinate2 = (( coordinate1 * centroid y ) be expected output labels associated with the below the line are the test subset, and vice versa. dataset, because it is used for supervised learning. 10 end / centroid x ) - centroid y; Approximately 60% [4] of the original dataset is to be 11 if rand x coordinate[i] < min x and 12 end Supervised learning algorithms take a set of input used for the training dataset. rand y coordinate[i] < min y then 13 if flag1 == 1 then data along with the associated class labels, and build a 14 predictor model [40] based on the selected features, and (ii) Test Dataset: 12 min x = rand x coordinate[i]; coordinate2 = (( coordinate1 * centroid x ) produce relevant responses by assigning appropriate class The remaining data are taken for the test dataset. These 13 min y = rand y coordinate[i]; / centroid y ) - centroid x; 15 end labels to new sets of data when input to the model. kNN are the data that validate the underlying model. For the 14 end 16 end is an instance-based learning algorithm which classifies purpose of cross-validation, the test dataset is in turn 15 end 17 end the objects based on the closest k training examples divided into further subsets which can then be used to 16 centroid x = ( max x + min x )/2; in the feature space. The predictor function is only the test the accuracy of the chosen model effectively. 17 centroid y = ( max y + min y )/2; approximated locally [41] and all computation is deferred 18 end ExplanationExplanation of ofAlgorithm Algorithm 3 3 until classification [42]. (i) Random coordinates rand x and rand y are generated 3.6 Centroid Generation (i) Random coordinates rand_x and rand_y are generated In the present work, each genetic marker is a vector in and checked to see which is greater. The greater value is Algorithm 2 calculates the centroid among all the Explanation of Algorithm 2 and checked to see which is greater. The greater value n dimensions that represents the features of a particular assigned to the coordinate and the flag set accordingly distributed dataset points and stores the coordinates as is assigned to the coordinate and the flag set accordingly type of cancer. The training instances in n dimensions are (i) Random coordinates are generated for each sample row (lines 3 through 9). (centroid x, centroid y) points in the space. Random Explanation of Algorithm 2 (lines 3 through 9). mapped to a two-dimensional vector (which represents the data, and starting values are assigned to arrays max_x[ (ii) Based on the property of similar triangles, coordinate2 coordinates are generated for each sample row data, and (ii) Based on the property of similar triangles, coordinate2 is output of the given predictor model), where one dimension ], min_x[ ], max_y[ ], min_y[ ] (lines 1 to 4). is found using the centroid and known coordinate1. starting values are assigned to arrays max x[ ], min x[ ], max (i) Random coordinates are generated for each sample row represents the class of genetic markers that are associated (ii) For each sample data, maximum values of random found using the centroid and known coordinate1. y[ ], min y[ ]. For each sample data, we assign maximum data, and starting values are assigned to arrays max x[ with a single type of cancer, and the other dimension coordinates are assigned to max_x and max_y and value of random coordinates to max x and max y and ], min x[ ], max y[ ], min y[ ] (lines 1 to 4). 3.83.8 Training Training and testing datasetdataset divisiondivision represents the genetic markers that are associated with (ii)minimum For each values sample to data,min_x maximum and min_y values(lines 6 ofto 15). random minimum values to min x and min y. We assign the average AlgorithmAlgorithm 4 counts4 counts the the points points of of thethe trainingtraining andand test multiple types of cancer. When a test instance (which is also (iii) Thecoordinates average areof min_ assignedx, max_ tox max and xmin_andy, maxmax_yy andare of min x, max x and min y, max y to centroid x and centroid datasetsdatasets on either either side of the the intercept. The line joining ththee a vector in n dimensions) is given to the predictor model, assignedminimum to valuescentroid_ to minx andx centroid_and minyy respectively(lines 6 to 15). (lines y respectively. This part of the data preprocessing algorithm intercepts coordinate1 and coordinate2 is the boundary for (iii)16, The 17). average of min x,max x and min y, max y are calculates the centroid among all the distributed dataset dividing the dataset into training and test subsets. If the count assigned to centroid x and centroid y respectively points and stores the coordinates as (centroid x, centroid y) of the coordinates generated during the centroid generation 3.7 Intercept(lines 16, 17). Generation points in the space. algorithm above this boundary is high, we consider the rows Algorithm 3 generates a random point on either the x-axis corresponding to these coordinates as the training dataset. or the y-axis, and correspondingly generates its intercept The rows corresponding to the rest of the coordinates below on the alternate axis using the principle of parallelogram, the line are the test dataset and vice-versa. The next step of thereby dividing the dataset into training and testing Algorithm 4 is writing the test and training dataset values to subsets. After generating random coordinates rand_x and the corresponding files. The set with most number of points rand_y, the greater of the two values is picked and assigned is taken as the training dataset and the set with fewer points to the coordinate, and the flag is set accordingly. Based on the is taken as the test dataset. The corresponding row values of property of similar triangles, it finds the coordinate2 using the indices are stored in training data[i] and test data[i], and the centroid and the known coordinate1. written to training data[i].arff test data[i].arff files. Table 2 : Achieving feature reduction using feature subset algorithm

Types of Cancer Initial Number of Final Number of Features % of Feature Reduction Features Achieved Breast Cancer 34 24 29.41 Prostrate Cancer 48 26 45.83 Lung Cancer 35 21 40.00 Colon Cancer 33 27 18.18

CSI Journal of Computing | Vol. 2 • No. 4, 2015 8

Algorithm 4: TRAINING AND TESTING DATASET the predictor model always assigns the dominant class label DIVISION to the test instance. So, the value of k has to be chosen with Data: rand x coordinate, rand y coordinate , utmost care. coordinate, final coordinate, For the problem of classifying the genetic markers SampleDataRowCount into two different classes, the value of k must be odd. Result: Sampledataset1[ ], Sampledataset2[ ] Once the k-nearest training instances are found, their class 1 begin assignments are used to predict the class for test instances 2 begin using weighted voting. 3 if rand x>rand y then 4 for i=0 toSampleDataRowCount-1 do Algorithm 5: MEAN CALCULATION -TRAINING 5 if rand x coordinate[i] ≤ coordinate1 DATASET and Input: trainingData[ ], feature set , no of features, rand y coordinate[i] ≤ coordinate2 no of rows then Result: actual[ ] - set of means of all the features 6 Sampledataset1[j]=i; 1 sum=0; 7 count1++; 2 actual[ ]=NULL; 8 else 3 for k = 0 to no of features-1 do 9 Sampledataset2[k]=i; 4 for i = 0 to no of rows- 1 do 10 count2++; 5 sum = sum + 11 end trainingData[i].NumericAttributes[feature set[k]]; 12 end 6 end 13 else 7 actual[k] = sum / no of rows; 14 for i=0 toSampleDataRowCount-1 do 8 end 15 if rand x coordinate[i] ≤ coordinate1 and Explanation of Algorithm 5 rand y coordinate[i] ≤ coordinate2 then (i) This algorithm finds the mean of each feature/attribute in the training dataset. 16 Sampledataset1[Cross-Referencingj]=i; Cancer Data Using GPU for (ii) The sum of all numerical values of every feature is R8 : 74 17 count1++;Multinomial Classification of Genetic Markers found (line 5). 18 else (iii) Then the mean of the feature set is found and stored 19 Sampledataset2[k]=i; Explanation of Algorithm 4 Supervised learning algorithms take a set of input(line data 7). 20 count2++; along with the associated class labels, and build a predictor (i) The line joining the intercepts coordinate21 1 and end model [40] based on the selected features, and produce 22 end Algorithm 6: kNN CLASSIFICATION ALGORITHM coordinate2 is the boundary for dividing the dataset into relevant responses by assigning appropriate class labels to 23 end training and test subsets. new sets of data when input to the model. kNN is an instance-Data: This algorithm finds k-nearest neighboring / If count1 > count2, then (ii) If the count of the coordinates above this boundary * based learning algorithm which classifies the objects basedsamples in the test dataset Sampledataset1 is the training generated during centroid generation is high, we on the closest k training examples in the feature space.Input The: TestData[samples][attributes] dataset and Sampledataset2 the Result: Returns the mean of the k-nearest neighboring consider the rows corresponding to these coordinates as predictor function is only approximated locally [41] and all test dataset. */ samples of testdata for a particular attribute the training subset. computation is deferred until classification [42]. 24 end In the present work, each genetic marker is a vector1 calculateEuclideanDistance(predict in n attr); (iii) The rows corresponding to rest of the coordinates below 25 end dimensions that represents the features of a particular2 findkNN(); type the line are the test subset, and vice versa. of cancer. The training instances in n dimensions are mapped 8 to a two-dimensional vector (which represents the output of Algorithm 4: TRAINING AND TESTING DATASET the predictor model always assigns the dominant class label the given predictor model), where one dimension represents based on the valueto the of testk, instance. the euclidean So, the distance value of betweenk has to be chosen with DIVISION the class of genetic markers that are associated with a single Data: rand x coordinate, rand y coordinatethe, test instanceutmost and care.the k-nearest training instances is calculated. type of cancer, and the other dimension represents the genetic coordinate, final coordinate, markersFor that the are problem associated of classifyingwith multiple the types genetic of cancer. markers When a testinto instance two different (which classes,is also thea vector value in of nk must be odd. SampleDataRowCount When a test instance (which is also a vector in n dimensions) is givenOnce theto thek-nearest predictor training model, instances based on are the found, their class Result: Sampledataset1[ ], Sampledataset2[ ] dimensions) is given to the predictor model, based on the value of k, the euclideanassignments distance are used between to predict the test the instance class for test instances 1 begin value of k, the euclidean distance between the test instance and the k-nearest training instances is calculated [14]. 2 begin andusing the weighted k-nearest voting.training instances is calculated [14]. 3 if rand x>rand y then The euclidean distance (D) between any two points in n The euclidean distance (D) betweenth any two points in n 4 for i=0 toSampleDataRowCount-1 dimensions,do say XAlgorithmi and Xj where 5: MiEANand j CALCULATIONare the i and-Tth RAININGth th dimensions, say Xi and Xj where i and j are the i and the j 5 if rand x coordinate[i] ≤ coordinatethe j point1 inpoint theDATASET feature in the feature space, isspace, given is by:given by: and Input: trainingData[2 ], feature set , no of features, D = (Xi − Xj ) rand y coordinate[i] ≤ coordinate2 no of rows then The choice of theResultThe parameterchoice : actual[ of the ]k - setisparameter very of means critical k of is all asvery the it featurescritical as it 6 Sampledataset1[j]=i; determines thedetermines accuracy1 sum=0; of the the accuracy predictor of model.the predictor The larger model. The larger 7 count1++; the value of k the[14],2 actual[value the smallerof ]=NULL; k [14], is the the smaller effect of is noise the effect on of noise on 8 else classification. Here,classification.3 for “noise”k = 0 to refers noHere,of to features-1 irrelevant“noise” refersdo feature to s irrelevant in features in 8 9 Sampledataset2[k]=i; the dataset. Onthe the4 dataset. otherfor hand,i On= 0 the toif the nootherof value rows-hand, of 1kifdo isthe too value large, of k is too large, Algorithm 4: TRAINING AND TESTING DATASET the predictor model always assigns the dominant class label 10 count2++; the5 predictorsum model = sum always + assigns the dominant class label DIVISION to the test instance. So, the value of k has to be chosen with 11 end to the test instance.trainingData[ So, thei].NumericAttributes[feature value of k has to be chosen withset[ k]]; utmost care. 12 Data: randendx coordinate, rand y coordinate , utmost6 care.end coordinate, final coordinate, For the problem of classifying the genetic markers 13 else 7 Foractual[ the problemk] = sum of / classifying no of rows; the genetic markers into two different classes, the value of k must be odd. 14 SampleDataRowCountfor i=0 toSampleDataRowCount-1 do into8 endtwo different classes, the value of k must be odd. Once Once the k-nearest training instances are found, their class 15 Result: Sampledataset1[if rand x ],coordinate Sampledataset2[[i] ≤ coordinate ] 1 the k-nearest training instances are found, their class 1 begin and assignments are are used used to to predict predict the the class class for for test test instances instances Explanation of Algorithm 5 2 begin rand y coordinate[i] ≤ coordinate2 using weighted weighted voting. voting. 3 if rand thenx>rand y then (i) This algorithm finds the mean of each feature/attribute 4 in the training dataset. 16 for i=0Sampledataset1[toSampleDataRowCount-1j]=i; do Algorithm 5: MEAN CALCULATION -TRAINING 5 (ii) The sum of all numerical values of every feature is 17 if randcount1++;x coordinate[i] ≤ coordinate1 DATASET found (line 5). 18 andelse Input: trainingData[ ], feature set , no of features, (iii) Then the mean of the feature set is found and stored 19 randSampledataset2[y coordinate[ki]]=≤ icoordinate; 2 no of rows (line 7). 20 thencount2++; Result: actual[ ] - set of means of all the features 6 21 endSampledataset1[j]=i; 1 sum=0; 7 22 end count1++; 2Algorithmactual[ ]=NULL; 6: kNN CLASSIFICATION ALGORITHM 8 else 23 end 3 forDatak := This 0 to algorithm no of features-1 finds kdo-nearest neighboring 9 Sampledataset2[k]=i; /* If count1 > count2, then 4 forsamplesi = 0 to no in theof rows- test dataset 1 do 10 count2++; Sampledataset1 is the training 5 Input:sum TestData[samples][attributes] = sum + 11 end dataset and Sampledataset2 the Result:trainingData[ Returns the meani].NumericAttributes[feature of the k-nearest neighboringset[k]]; 12 end 6 end test dataset. */ samples of testdata for a particular attribute 13 else 7 actual[k] = sum / no of rows; 24 end 1 calculateEuclideanDistance(predict attr); 14 for i=0 toSampleDataRowCount-1 do 8 end 25 end 2 findkNN(); 15 if rand x coordinate[i] ≤ coordinate1 and Explanation of Algorithm 5 rand y coordinate[i] ≤ coordinate2 based on the valuethen of k, the euclidean distance between (i) This algorithm finds the mean of each feature/attribute in the training dataset. the16 test instance andSampledataset1[ the k-nearestj]= trainingi; instances is (ii) The sum of all numerical values of every feature is calculated.CSI17 Journal of Computingcount1++; | Vol. 2 • No. 4, 2015 found (line 5). 18 When a testelse instance (which is also a vector in n (iii) Then the mean of the feature set is found and stored dimensions)19 is givenSampledataset2[ to the predictork]= model,i; based on the (line 7). value20 of k, the euclideancount2++; distance between the test instance and21 the k-nearestend training instances is calculated [14]. 22 The euclideanend distance (D) between any two points in n Algorithm 6: kNN CLASSIFICATION ALGORITHM th dimensions,23 end say Xi and Xj where i and j are the i and Data: This algorithm finds k-nearest neighboring th the j point/* inIf the count1 feature space, > count2, is given then by: samples in the test dataset Sampledataset1 is2 the training Input D = (X − X ) : TestData[samples][attributes] dataset and Sampledataset2i j the Result: Returns the mean of the k-nearest neighboring The choicetest of the dataset. parameter k is very critical* as/ it samples of testdata for a particular attribute determines24 end the accuracy of the predictor model. The larger 1 calculateEuclideanDistance(predict attr); the25 end value of k [14], the smaller is the effect of noise on 2 findkNN(); classification. Here, “noise” refers to irrelevant features in the dataset. On the other hand, if the value of k is too large, based on the value of k, the euclidean distance between the test instance and the k-nearest training instances is calculated. When a test instance (which is also a vector in n dimensions) is given to the predictor model, based on the value of k, the euclidean distance between the test instance and the k-nearest training instances is calculated [14]. The euclidean distance (D) between any two points in n th dimensions, say Xi and Xj where i and j are the i and the jth point in the feature space, is given by: 2 D = (Xi − Xj ) The choice of the parameter k is very critical as it determines the accuracy of the predictor model. The larger the value of k [14], the smaller is the effect of noise on classification. Here, “noise” refers to irrelevant features in the dataset. On the other hand, if the value of k is too large, 8

Algorithm 4: TRAINING AND TESTING DATASET the predictor model always assigns the dominant class label DIVISION to the test instance. So, the value of k has to be chosen with Data: rand x coordinate, rand y coordinate , utmost care. coordinate, final coordinate, For the problem of classifying the genetic markers SampleDataRowCount into two different classes, the value of k must be odd. Result: Sampledataset1[ ], Sampledataset2[ ] Once the k-nearest training instances are found, their class 1 begin assignments are used to predict the class for test instances 2 begin using weighted voting. 3 if rand x>rand y then 4 for i=0 toSampleDataRowCount-1 do Algorithm 5: MEAN CALCULATION -TRAINING 5 if rand x coordinate[i] ≤ coordinate1 DATASET and Input: trainingData[ ], feature set , no of features, rand y coordinate[i] ≤ coordinate2 no of rows then Result: actual[ ] - set of means of all the features 6 Sampledataset1[j]=i; 1 sum=0; 7 count1++; 2 actual[ ]=NULL; 8 else 3 for k = 0 to no of features-1 do 9 Sampledataset2[k]=i; 4 for i = 0 to no of rows- 1 do 10 count2++; 5 sum = sum + 11 end trainingData[i].NumericAttributes[feature set[k]]; 12 end 6 end 13 else 7 actual[k] = sum / no of rows; 14 for i=0 toSampleDataRowCount-1 do Mahendiran8 end et al. R8 : 75 15 if rand x coordinate[i] ≤ coordinate1 and Explanation of Algorithm 5 rand y coordinate[i] ≤ coordinate2 Explanation of Algorithm 5 algorithm. (i) This algorithm finds the mean of each feature/attribute then (i) This algorithm finds the mean of each feature/attribute (ii) The mean evaluated by kNN algorithm is compared with in the training dataset. the actual mean of the training dataset (line 2). 16 Sampledataset1[j]=i; in the training dataset. (ii) The sum of all numerical values of every feature is 17 count1++; (ii) The sum of all numerical values of every feature is found (iii) If there is less than 40% deviation from the actual mean found (line 5). 18 else (line 5). then both means, along with cancer type, are written to (iii) Then the mean of the feature set is found and stored a file (line 3). 19 Sampledataset2[k]=i; (iii) Then the mean of the feature set is found and stored (line (line 7). 20 count2++; 7). (iv) These means represent coordinates for the x − y plot with 21 end x-axis representing actual mean of the training data and y-axis representing the calculated mean of the kNN 22 end Algorithm 6: kNN CLASSIFICATION ALGORITHM algorithm. 23 end Data: This algorithm finds k-nearest neighboring (v) Also the cancer type is used to depict the different types /* If count1 > count2, then samples in the test dataset of cancer in the plot. Sampledataset1 is the training Input: TestData[samples][attributes] dataset and Sampledataset2 the Result: Returns the mean of the k-nearest neighboring 4. Performance and Results test dataset. */ samples of testdata for a particular attribute The model was tested with different cancer datasets in 24 end 1 calculateEuclideanDistance(predict attr); order to compare the performance while using GPU and CPU. 25 end 2 findkNN(); The programs were made to run with varying cancer data 9 files and threshold files. The results obtained are plotted to 9 Explanation of Algorithm 6 Function calculateEuclideanDistance(predict attr) ExplanationExplanation of Algorithm of Algorithm 6 6 obtainFunction the degreecalculateEuclideanDistance(predict of correlation among the cancer datasetattr) and based on the value of k, the euclidean distance between (i) The find kNN function calls the time taken for processing them using both GPU and CPU. (i) The find kNN function calls the 1 dist=0; (i) The find kNN function calls the calculateEuclidean Figure1 dist=0; 4 and Table 4 show the performance comparison of the test instance and the k-nearest training instances is calculateEuclideanDistance function for each attribute of 2 for i = 0 to samples - 1 do DistancecalculateEuclideanDistance function for each attributefunction of for the each test data attribute (line of the2 developedfor i = 0 tomodel samples on both - 1 do CPU and GPU processors, with calculated. the test data (line 3). 3 dist=0; for j = 0 to attributes - 1 do 3).the test data (line 3). varying3 numberdist=0; forof cancerj = 0 todata attributes records. - 1 do 9 When a test instance (which is also a vector in n (ii) The calculateEuclideanDistance function computes the 4 if j == predict attr then (ii)(ii) The The calculateEuclideanDistancecalculateEuclideanDistance functionfunction computes computes the the 4 if j == predict attr then dimensions) is given to the predictor model, based on the Explanationeuclidean distance of Algorithm of the 6 test data from the mean of Function5 calculateEuclideanDistance(predictdist = dist + square(TestData[i][j]attr) - euclideaneuclidean distance distance of of the the test test datadata from from the the mean mean of of 5 dist = dist + square(TestData[i][j] - value of k, the euclidean distance between the test instance training data using the formula: TrainingMean[j]); (i) Thetrainingtraining datafinddata using usingk2 NNthe the formula: formula:function calls the 1 dist=0; TrainingMean[j]); and the k-nearest training instances is calculated [14]. D = (Xi − Xj ) 2(line 8). 6 end calculateEuclideanDistanceD = (Xi − Xj ) (line 8).function for each attribute of 2 6for i = 0end to samples - 1 do The euclidean distance (D) between any two points in n (iii)the Then test thedata test (linedata 3). is sorted based on these distance 7 end th (iii) Then Then the the test test datadata is is sorted sorted based based on on these these distance distance 3 7 dist=0;end for j = 0 to attributes - 1 do dimensions, say Xi and Xj where i and j are the i and (ii) ThevaluescalculateEuclideanDistance (line 5). function computes the 8 TestData[i].distance = sqrt(dist); th valuesvalues (line (line 5). 5). 4 8 TestData[i].distanceif j == predict attr =then sqrt(dist); the j point in the feature space, is given by: (iv) The mean of kNN which is the predicted output value 9 end (iv)euclidean The mean distance of kNN of which the testis thedata predicted from the output mean value of 5 dist = dist + square(TestData[i][j] - (iv) ofThe this mean algorithm of kNN iswhich found is the (line predicted 9). output value of 9 end 2 trainingthisof this algorithm algorithmdatausing is found is the found (line formula: (line9). 9). TrainingMean[j]); D = (Xi − Xj ) (v) This algorithm has2 been implemented using CUDA-C, (v)(v) DThis This= algorithm algorithm(Xi − X jhas has) (linebeen been 8).implemented implemented using using CUDA-C, CUDA-C, 6 end The choice of the parameter k is very critical as it (iii) Thenwhich the uses test paralleldata executionis sorted basedof threads. on these distance Function findkNN whichwhich uses uses parallel parallel execution execution of threads. of threads. 7 Functionend findkNN determines the accuracy of the predictor model. The larger (vi)values Because (line calculation 5). of euclidean distance is exclusive 1 sum=0; (vi) Because Because calculation calculation of of euclidean euclidean distance distance is isexclusive exclusiv e 8 1 sum=0;TestData[i].distance = sqrt(dist); the value of k [14], the smaller is the effect of noise on (iv) Theto one mean particular of kNN attribute, which is we the can predicted have number output valueof the 2 for every attribute in attributes do toto one one particular particular attribute, attribute, we we can can have have number number of ofthe the 9 2endfor every attribute in attributes do classification. Here, “noise” refers to irrelevant features in ofthreads this algorithm equal to theis found number (line of 9). attributes, which run in 3 calculateEuclideanDistance(attribute); threadsthreads equal equal to to the the number number of of attributes, attributes, which which run run in in 3 calculateEuclideanDistance(attribute); the dataset. On the other hand, if the value of k is too large, (v) Thisparallel algorithm in order has to beenfind the implemented euclidean distances. using CUDA-C, 4 end parallelparallel in in order order to to find find the the euclidean euclidean distances. distances. 4 end (vii)which This uses strategy parallel improves execution the of threads.running time of the 5 sort(TestData); (vii) This This strategy strategy improves improves the running the runningtime of the time algorithm of the Function5 sort(TestData);findkNN (vi) Becausealgorithm calculation for large data. of euclidean distance is exclusive 6 for i = 1 to k do foralgorithm large data. for large data. 1 6sum=0;for i = 1 to k do to one particular attribute, we can have number of the 7 sum = sum + 2 7for everysum attribute = sum + in attributes do Algorithmthreads equal 7: CORRECTNESS to the number OF ofkNN attributes, which run in TestData[i].NumericAttributes[predict attr]; Algorithm 7: CORRECTNESS OF kNN 3 calculateEuclideanDistance(attribute);TestData[i].NumericAttributes[predict attr]; parallel in order to find the euclidean distances. 8 end Input: TestData,k value,actualMean 4 8endend (vii)Input This: strategy TestData,k improvesvalue,actualMean the running time of the 9 mean = sum/k; Result: Verifies the efficiency of the prediction of kNN 5 9sort(TestData);mean = sum/k; algorithmResult: Verifies for large the data.efficiency of the prediction of kNN 10 return mean; algorithm and writes the data coordinates to 610forreturni = 1 mean; to k do algorithm and writes the data coordinates to an output file 7 sum = sum + an output file Algorithm1 kMean = 7:kNN(TestData,CORRECTNESSk OFvalue);kNN TestData[i].NumericAttributes[predict attr]; 1 kMean = kNN(TestData,k value); 2 if (�actualMean − kMean�)/actual <0.4 then 8 end 2Inputif (�actualMean: TestData,k value,actualMean− kMean�)/actual <0.4 then 3 writeToFile(outputfile, result, actual, cancer type); 9 mean = sum/k; 3ResultwriteToFile(outputfile,: Verifies the efficiency result, of the actual, prediction cancer of type);kNN /* Writes kMean, actualMean, 10 return mean; /* algorithmWrites kMean,and writes actualMean, the data coordinates to cancer_type to the output file */ ancancer_type output file to the output file */ 4 end 1 4kMeanend = kNN(TestData,k value); 2 if (�actualMean − kMean�)/actual <0.4 then Explanation of the correctness of Algorithm 7 3 ExplanationwriteToFile(outputfile, of the correctness result, of actual, Algorithm cancer 7 type); (i) ExplanationThis algorithm of the evaluates correctness the of efficacy Algorithm of 7 the kNN (i) This/* algorithmWrites kMean, evaluates actualMean, the efficacy of the kNN (i) This algorithm evaluates the efficacy of the kNN algorithm.cancer_type to the output file */ algorithm. (ii)4 end The mean evaluated by kNN algorithm is compared (ii) The mean evaluated by kNN algorithm is compared with the actual mean of the training dataset (line 2). with the actual mean of the training dataset (line 2). (iii) If there is less than 40% deviation from the actual mean CSI Journal of Computing | Vol. 2 • No. 4, 2015 (iii)Explanation If there is less of the than correctness 40% deviation of Algorithm from the actual 7 mean then both means, along with cancer type, are written to (i) Thisthen algorithm both means, evaluates along with the cancer efficacytype, of are the writtenkNN to a file (line 3). algorithm.a file (line 3). (iv) These means represent coordinates for the x − y plot (ii)(iv) The These mean means evaluated represent by coordinateskNN algorithm for the is comparedx − y plot with x-axis representing actual mean of the training withwith thex-axis actual representing mean of the actual training mean dataset of the(line training 2). data and y-axis representing the calculated mean of the (iii) Ifdata there and is lessy-axis than representing 40% deviation the calculated from the actual mean mean of the kNN algorithm. thenkNN both algorithm. means, along with cancer type, are written to (v) Also the cancer type is used to depict the different (v)a Also file (line the 3). cancer type is used to depict the different types of cancer in the plot. (iv) Thesetypes means of cancer represent in the plot. coordinates for the x − y plot Fig. 4. Performance Comparison – GPU vs CPU with x-axis representing actual mean of the training Fig. 4. Performance Comparison – GPU vs CPU data and y-axis representing the calculated mean of the 4PERFORMANCE AND RESULTS 4PkNNERFORMANCE algorithm. AND RESULTS The(v) Also model the was cancer testedtype with is used different to depict cancer the datasets different in It is clear that the performance of GPU is better than the The model was tested with different cancer datasets in It is clear that the performance of GPU is better than the ordertypes to compare of cancer the in the performance plot. while using GPU and performance of CPU as the number of records increases. order to compare the performance while using GPU and performance of CPU as the number of records increases. CPU. The programs were made to run with varying cancer TheFig. result 4. Performance of the classification Comparison – produced GPU vs CPU by kNN CPU. The programs were made to run with varying cancer The result of the classification produced by kNN data files and threshold files. The results obtained are algorithm is plotted as a bar diagram. The output suggests data files and threshold files. The results obtained are algorithm is plotted as a bar diagram. The output suggests 4PplottedERFORMANCE to obtain the degree AND ofRESULTS correlation among the cancer the presence of two different classes. One class represents plotted to obtain the degree of correlation among the cancer the presence of two different classes. One class represents dataset and time taken for processing them using both the genetic markers that can cause multiple types of cancer, Thedataset model and was time tested taken with for different processing cancer them datasets using both in Itthe is geneticclear that markers the performance that can cause of GPUmultiple is better types than of cancer, the GPU and CPU. Figure 4 and Table 4 show the performance and the other is a class that represents a possibility of orderGPU toand compare CPU. Figure the performance 4 and Table 4 while show using the performance GPU and performanceand the other of CPU is a as class the number that represents of records a increases. possibility of comparison of the developed model on both CPU and GPU multiple genetic markers from different cancer data sets CPU.comparison The programs of the developed were made model to run on with both varying CPU and cancer GPU multipleThe result genetic of markers the classification from different produced cancer by datakNN sets processors, with varying number of cancer data records. with the potential, in combination, to cause a single type dataprocessors, files and with threshold varying files. number The of results cancer obtained data records. are algorithmwith the ispotential, plotted as in a combination, bar diagram. to The cause output a single suggests type plotted to obtain the degree of correlation among the cancer the presence of two different classes. One class represents dataset and time taken for processing them using both the genetic markers that can cause multiple types of cancer, GPU and CPU. Figure 4 and Table 4 show the performance and the other is a class that represents a possibility of comparison of the developed model on both CPU and GPU multiple genetic markers from different cancer data sets processors, with varying number of cancer data records. with the potential, in combination, to cause a single type 9

Explanation of Algorithm 6 Function calculateEuclideanDistance(predict attr)

(i) The find kNN function calls the 1 dist=0; calculateEuclideanDistance function for each attribute of 2 for i = 0 to samples - 1 do the test data (line 3). 3 dist=0; for j = 0 to attributes - 1 do (ii) The calculateEuclideanDistance function computes the 4 if j == predict attr then euclidean distance of the test data from the mean of 5 dist = dist + square(TestData[i][j] - training data using the formula: 2 TrainingMean[j]); D = (Xi − Xj ) (line 8). 6 end (iii) Then the test data is sorted based on these distance 7 end values (line 5). 8 TestData[i].distance = sqrt(dist); (iv) The mean of kNN which is the predicted output value 9 end of this algorithm is found (line 9). (v) This algorithm has been implemented using CUDA-C, which uses parallel execution of threads. Function findkNN (vi) Because calculation of euclidean distance is exclusive 1 sum=0; to one particular attribute, we can have number of the 2 for every attribute in attributes do threads equal to the number of attributes, which run in 3 calculateEuclideanDistance(attribute); parallel in order to find the euclidean distances. 4 end (vii) This strategy improves the running time of the 5 sort(TestData); algorithm for large data. 6 for i = 1 to k do 7 sum = sum + Algorithm 7: CORRECTNESS OF kNN TestData[i].NumericAttributes[predict attr]; 8 end Input: TestData,k value,actualMean 9 mean = sum/k; Cross-Referencing Cancer Data Using GPU for Result: Verifies the efficiency of the prediction of kNN 10 return mean; algorithm and writes the data coordinates to R8 : 76 Multinomial Classification of Genetic Markers an output file 1 kMean = kNN(TestData,k value); The main intention of this system is not only to classify the 2 if (�actualMean − kMean�)/actual <0.4 then genetic markers into two classes, but also to speed up the 3 writeToFile(outputfile, result, actual, cancer type); entire process of classification. This is seen to be achieved /* Writes kMean, actualMean, using the GPU, while in case of a CPU, if the number of cancer_type to the output file */ records increases, the time taken (in seconds) also increases 4 end exponentially. So this is a significant finding, as the cancer dataset to be processed might easily be in the thousands. The speed could be improved by using the GPU, which Explanation of the correctness of Algorithm 7 takes linear time to complete the process of classification for (i) This algorithm evaluates the efficacy of the kNN a large number of records. This is seen in Fig. 4 which shows algorithm. the time taken by the CPU to process the records is almost (ii) The mean evaluated by kNN algorithm is compared exponentially growing with the increase in numbers of records, with the actual mean of the training dataset (line 2). while the time taken by the GPU seems to almost remain (iii) If there is less than 40% deviation from the actual mean constant as the number of records increases. (It actually then both means, along with cancer type, are written to rises slowly, linearly, but the rise is not visible in the graph.) a file (line 3). The main point behind the comparison of the performance of (iv) These means represent coordinates for the x − y plot CPU and GPU lies in the fact of the demonstration of power with x-axis representing actual mean of the training of parallel processing for the classification algorithm, i.e., it data and y-axis representing the calculated mean of the could help in reducing the time complexity involved in solving kNN algorithm. the problems involved. (v) Also the cancer type is used to depict the different Fig. 5 pertaining to the entire process of cancer dataset types of cancer in the plot. classification scheme is thus generated. The main objective Fig.Fig. 4 4.: PerformancePerformance ComparisonComparison –– GPUGPU vsvs CPUCPU is to demonstrate the percentage of genetic markers that are responsible for causing multiple cancer from the given It is clear that the performance of GPU is better than the dataset and the percentage of multiple genetic markers in 4PERFORMANCE AND RESULTS performance of CPU as the number of records increases. combination that cause single types of cancer. The outliers The model was tested with different cancer datasets in It is clearThe result that the of performance the classification of GPU produced is better than by kNN the may be as a result of wrong classification, but are significantly order to compare the performance while using GPU and performancealgorithm is ofplotted CPU as as a the bar number diagram. of The records output increases. suggests less which indicates that the percentage error as result of CPU. The programs were made to run with varying cancer theThe presence result of two of different the classification classes. One producedclass represents by ktheNN applying proposed system is quite low. The significance of the data files and threshold files. The results obtained are algorithmgenetic markers is plotted that ascan a cause bar diagram. multiple Thetypes output of cancer, suggests and plot in Fig. 5 is as follows. plotted to obtain the degree of correlation among the cancer thethe presenceother is a of class two that different represents classes. a possibility One class of representsmultiple (i) The two tall bars in the graph represent two different dataset and time taken for processing them using both thegenetic genetic markers markers from that different can cause cancer multiple data typessets with of cancer, the types of cancer whereas the short bar represents the GPU and CPU. Figure 4 and Table 4 show the performance andpotential, the other in combination, is a class to thatcause represents a single type a of possibility cancer. The of outlier – the wrong classification. comparison of the developed model on both CPU and GPU multiplegenetic markers genetic that markers do not belong from differentto either class cancer are outliers, data sets (ii) The two tall bars represent the genetic markers processors, with varying number of cancer data records. withi.e., they the are potential, not significant in combination, enough, given to cause the data a single available, type associated with each type of cancer. 10 to be correlated with any type of cancer. So the benefit of thisof cancer. classification The genetic scheme markers is to that predict do not the belong possibility to either ofa newclass cancer are outliers, in a patient i.e., they if the are multiple not significant genetic enough, markers gi venin combinationthe data available, have the to potential be correlated to cross with the threshold any type ofvalue cancer. of a Sonew the type benefit of cancer. of this classification scheme is to predict theTable possibility 3 represents of a new the cancerdifferent in classes a patient under if the which multiple the geneticgenetic markers markers fall. in Class combination label 0 shows have thethat potential it is an outlier, to cross 1 theshows threshold that the value genetic of amarkers new type are of associated cancer. with a single type Tableof cancer 3 represents and 2 that the it is different associated classes with under multiple which types the ofgenetic cancer. markersThus the fall. correlations Class label between 0 shows the that genetic it is anmarkers outlie r, and1 shows the cancers that the could genetic be markerseasily established are associated in this with process. a single Thetype main of cancer point andto note 2 that from it is Table associated 3 is that with in multiple almost types all typesof cancer. of cancer Thus data the correlationsthat have been between tested the by genetic this system, marker s alland the the data cancers yield couldalmost be similar easily establishedresults. This in means this process. that theThe percentage main point of genetic to note markers from Table associated 3 is that with in almostmultiple all cancerstypes ofis cancerhigher datacompared that haveto that been of testedthe multiple by this genetic system, markersall the causing data yield single almost types similarof cancer, results. and the This percentage means thatof outliers.the percentage The percentage of genetic of outliers, markers which associated amounts with to is multiple quite low,cancers thus indicating is higher comparedthe efficacy to of that the ofsystem. the multiple genetic markersTable causing4 illustrates single the types comparison of cancer, of and the theperformance percentage ofof the outliers. proposed The algorithm percentage on of CPU outliers, and which GPU. amountsIn a GPU, to is therequite are low, many thus cores indicating working the in efficacy parallel of theandsystem. the power of Fig. 5 : Output Plot – Classification of Genetic Markers parallelTable programming 4 illustrates is thethus comparison achieved to of a thegreater performance extent. Fig. 5. Output Plot – Classification of Genetic Markers of the proposed algorithm on CPU and GPU. In a GPU, there are many cores working in parallel and the power of CSIparallel Journal programming of Computing is thus | Vol. achieved 2 • No. to 4, a 2015 greater extent. (iii) It is evident from the graph that the bar indicating The main intention of this system is not only to classify the markers associated with multiple cancers is the highest genetic markers into two classes, but also to speed up the among all in the graph. So, for our given dataset that entire process of classification. This is seen to be achieved deal with four types of cancers, there are more genetic using the GPU, while in case of a CPU, if the number of markers with the ability to cause multiple cancers. records increases, the time taken (in seconds) also increases (iv) The genetic markes that are wrongly classified that lie exponentially. So this is a significant finding, as the cancer outside the class boundaries are outliers in the data, i.e., dataset to be processed might easily be in the thousands. they are not significant enough to be correlated with The speed could be improved by using the GPU, which any type of cancer. takes linear time to complete the process of classification (v) The rest are the genetic markers, that cause only one for a large number of records. This is seen in Figure 4 type of cancer. This single type of cancer mostly comes which shows the time taken by the CPU to process the from the type of cancer dataset that the given genetic records is almost exponentially growing with the increase markers belongs to. in numbers of records, while the time taken by the GPU seems to almost remain constant as the number of records 5CONCLUSION increases. (It actually rises slowly, linearly, but the rise is not k visible in the graph.) The main point behind the comparison Cross-referencing genetic markers for cancer using the NN of the performance of CPU and GPU lies in the fact of has been developed on an Nvidia GeForce GPU. The the demonstration of power of parallel processing for the power of parallel programming has been exploited by the classification algorithm, i.e., it could help in reducing the use of a GPU, where the performance of the predictor time complexity involved in solving the problems involved. model can be improved considerably by simultaneous execution of multiple threads as compared to the much Figure 5 pertaining to the entire process of cancer dataset slower CPU execution. This current model works only on classification scheme is thus generated. The main objective textual information, but can be further enhanced to process is to demonstrate the percentage of genetic markers that information from image data or normalized micro-array are responsible for causing multiple cancer from the given data. dataset and the percentage of multiple genetic markers There are possible future preventive, diagnostic, or in combination that cause single types of cancer. The therapeutic applications of this work, and such correlations outliers may be as a result of wrong classification, but are among genetic markers can also of course be attempted for significantly less which indicates that the percentage error diseases other than cancers. as result of applying proposed system is quite low. The More immediately, it surely would be helpful to identify significance of the plot in Figure 5 is as follows. those genetic markers that could potentially have higher (i) The two tall bars in the graph represent two different chances of causing multiple types of cancers if they are types of cancer whereas the short bar represents the present in the human body, and correlate the same with outlier—the wrong classification. other risk factors; likewise, a person with multiple genetic (ii) The two tall bars represent the genetic markers markers associated with a specific cancer may well be associated with each type of cancer. at higher risk than someone with just one or none of Mahendiran et al. R8 : 77

(iii) It is evident from the graph that the bar indicating than someone with just one or none of those markers. Such markers associated with multiple cancers is the highest and other uses of the analytical and computational tools among all in the graph. So, for our given dataset that we have suggested could provide a new weapon in the fight deal with four types of cancers, there are more genetic against dangerous diseases that take millions of lives yearly. markers with the ability to cause multiple cancers. (iv) The genetic markes that are wrongly classified that lie Acknowledgments outside the class boundaries are outliers in the data, i.e., The authors would like to thank Bhavani Kakarla and they are not significant enough to be correlated with any Sana Javed for their help in gathering and analyzing data type of cancer. from various sources. (v) The rest are the genetic markers, that cause only one type of cancer. This single type of cancer mostly comes References from the type of cancer dataset that the given genetic [1] R. L. Strausberg, “Cancer genome anatomy project,” eLS, 1999. markers belongs to. [2] T. Barrett, D. B. Troup, S. E. Wilhite, P. Ledoux, C. Evangelista, I. F. Kim, M. Tomashevsky, K. A. Marshall, K. H. Phillippy, P. M. Sherman et al., “NCBI GEO: archive for functional genomics 5. Conclusion data sets10 years on,” Nucleic Acids Research, vol. 39, no. suppl Cross-referencing genetic markers for cancer using the 1, pp. D1005-D1010, 2011. kNN has been developed on an Nvidia GeForce GPU. The [3] J. A. Cruz and D. S. Wishart, “Applications of machine learning in cancer prediction and prognosis,” Cancer informatics, vol. 2, power of parallel programming has been exploited by the use p. 59, 2006. of a GPU, where the performance of the predictor model can be [4] M. Kubat, S. Matwin et al., “Addressing the curse of imbalanced improved considerably by simultaneous execution of multiple training sets: one-sided selection,” in ICML ’97: Proceedings of threads as compared to the much slower CPU execution. This the 14th International Conference on Machine Learning, 1997, current model works only on textual information, but can be pp. 179-186. further enhanced to process information from image data or [5] O.M. Dorgham, S. D. Laycock, andM. H. Fisher, “GPU accelerated generation of digitally reconstructed radiographs for 2-d/3-d normalized micro-array data. image registration,” IEEE Trans. Biomed. Eng., vol. 59, no. 9, There are possible future preventive, diagnostic, or pp. 2594-2603, Sep. 2012, doi:10.1109/TBME.2012.2207898. therapeutic applications of this work, and such correlations [6] S. Al-Kiswany, A. Gharaibeh, and M. Ripeanu, “GPUs as storage among genetic markers can also of course be attempted for system accelerators,” IEEE Trans. Parallel Distrib. Syst., vol. 24, no. 8, pp. 1556-1566, Aug. 2013, doi:10.1109/TPDS.2012.239. diseases other than cancers. [7] J. W. Choe, A. Nikoozadeh, ¨Omer Oralkan, and B. T. Khuri- More immediately, it surely would be helpful to identify Yakub, “GPU-based real-time volumetric ultrasound image those genetic markers that could potentially have higher reconstruction for a ring array,” IEEE Trans. Med. Imag., vol. 32, chances of causing multiple types of cancers if they are no. 7, pp. 1258-1264, Jul. 2013, doi:10.1109/TMI.2013.2253117.11 [8] C.-H. Lin, C.-H. Liu, L.-S. Chien, and S.-C. Chang, “Accelerating present in the human body,TABLE and correlate 3. Association the same of with genetic other markers withpattern single matching or multi usingple a cancersnovel parallel algorithm on GPUs,” risk factors; likewise, a person with multiple genetic markers IEEE Trans. Comput., vol. 62, no. 10, pp. 1906-1916, Oct. 2013, associated with a specific cancer may well be at higher risk doi:10.1109/TC.2012.254. Type of Cancer Genetic Markers associated Genetic Markers associated % of Outliers 11 with Single Cancer with Multiple Cancer TableTABLE 3 : 3.Association Association of of genetic genetic markers markers with singlesingle oror multimultipleple cancers cancers

TypeBreast of Cancer Cancer 33.33Genetic Markers associated Genetic59.45 Markers associated %7.22 of Outliers Prostate Cancer 45.35with Single Cancer with49.23 Multiple Cancer 5.42 Lung Cancer 38.48 53.33 8.19 Colon Cancer 43.27 49.68 7.05 Breast Cancer 33.33 59.45 7.22 Prostate Cancer 45.35 49.23 5.42 Lung Cancer 38.48 53.33 8.19 Colon Cancer 43.27 49.68 7.05

TableTABLE 4 : 4. Performance Performance Comparison BetweenBetween CPU CPU and and GPU GPU

S.No Number of Time Taken in CPU (in Time Taken in GPU (in Records secs) secs) TABLE 4. Performance Comparison Between CPU and GPU

1.S.No 400Number of 5.852Time Taken in CPU (in Time1.515 Taken in GPU (in 2. 600Records 14.166secs) secs)1.522 3. 800 26.445 1.54 4. 1000 38.962 1.562 1.5. 4001200 5.85252.775 1.5151.568 2.6. 6001400 14.16678 1.5221.579 3. 800 26.445 1.54 4. 1000 38.962 1.562 5. 1200 52.775 1.568 6. 1400 78 CSI Journal1.579 of Computing | Vol. 2 • No. 4, 2015 those markers. Such and other uses of the analytical and [8] C.-H. Lin, C.-H. Liu, L.-S. Chien, and S.-C. Chang, “Accelerating computational tools we have suggested could provide a new pattern matching using a novel parallel algorithm on GPUs,” IEEE Trans. Comput., vol. 62, no. 10, pp. 1906–1916, Oct. 2013, weapon in the fight against dangerous diseases that take doi:10.1109/TC.2012.254. millions of lives yearly. [9]H.P.Huynh,A.Hagiescu,O.Z.Liang,W.-F.Wong,and those markers. Such and other uses of the analytical and [8] C.-H.R. S. Lin, M. C.-H. Goh, Liu, “Mapping L.-S. Chien, streaming and S.-C. applications Chang, “Accele ontorating GPU computational tools we have suggested could provide a new systems,”pattern matchingIEEE Trans. using Parallel a novel Distrib. parallel Syst. algorithm, to appear on GPUs,” 2014, ACKNOWLEDGMENTS IEEEdoi:10.1109/TPDS.2013.195. Trans. Comput., vol. 62, no. 10, pp. 1906–1916, Oct. 2013, weapon in the fight against dangerous diseases that take doi:10.1109/TC.2012.254. millionsThe authors of lives would yearly. like to thank Bhavani Kakarla and Sana [10] N. Whitehead and A. Fit-Florea, “Precision & performance: [9]H.P.Huynh,A.Hagiescu,O.Z.Liang,W.-F.Wong,andFloating point and IEEE 754 compliance for NVIDIA GPUs,” rn Javed for their help in gathering and analyzing data from R.(A+ S. B), M. vol. Goh, 21, pp. “Mapping 1–7, 2011. streaming applications onto GPU systems,” IEEE Trans. Parallel Distrib. Syst., to appear 2014, various sources. [11] H.Cheng,J.Shan,W.Ju,Y.Guo,andL.Zhang,“Automatedbreast ACKNOWLEDGMENTS doi:10.1109/TPDS.2013.195. cancer detection and classification using ultrasound images: A The authors would like to thank Bhavani Kakarla and Sana [10] N.survey,” WhiteheadPattern Recognitionand A. Fit-Florea,, vol. 43, no. “Precision 1, pp. 299–317, & performan 2010. ce: Floating point and IEEE 754 compliance for NVIDIA GPUs,” rn REFERENCES [12]L.Wang,J.Zhang,L.H.Schwartz,H.Eisenberg,N.M.Ishill, Javed for their help in gathering and analyzing data from (A+ B), vol. 21, pp. 1–7, 2011. various sources. C. S. Moskowitz, P. Scardino, and H. Hricak, “Incremental value [1] R. L. Strausberg, “Cancer genome anatomy project,” eLS, 1999. [11]of H.Cheng,J.Shan,W.Ju,Y.Guo,andL.Zhang,“Automatedb multiplanar cross-referencing for prostate cancer stagingreast with [2] T. Barrett, D. B. Troup, S. E. Wilhite, P. Ledoux, C. Evangelista, endorectalcancer detection mri,” American and classification Journal of Roentgenologyusing ultrasound, vol. image 188, no.s: A1, I. F. Kim, M. Tomashevsky, K. A. Marshall, K. H. Phillippy, P. M. pp.survey,” 99–104,Pattern 2007. Recognition, vol. 43, no. 1, pp. 299–317, 2010. Sherman et al., “NCBI GEO: archive for functional genomics data REFERENCES [13][12]L.Wang,J.Zhang,L.H.Schwartz,H.Eisenberg,N.M.Ishi J. Abonyi and F. Szeifert, “Supervised fuzzy clustering for thell, sets10 years on,” Nucleic Acids Research, vol. 39, no. suppl 1, pp. identificationC. S. Moskowitz, ofP. fuzzy Scardino, classifiers,” and H.Pattern Hricak, “IncrementalRecognition Letters value, [1]D1005–D1010, R. L. Strausberg, 2011. “Cancer genome anatomy project,” eLS, 1999. vol.of multiplanar 24, no. 14, pp. cross-referencing 2195–2207, 2003. for prostate cancer staging with [3][2] J.T. A. Barrett, Cruz and D. B. D. Troup, S. Wishart, S. E. “Applications Wilhite, P. Ledoux, of machine C. Evange learninglista, in endorectal mri,” American Journal of Roentgenology, vol. 188, no. 1, [14] W.-J. Hwang and K.-W. Wen, “Fast knn classification algorithm cancerI. F. Kim, prediction M. Tomashevsky, and prognosis,” K. A. Marshall,Cancer informatics K. H. Phillippy,, vol. 2, p. P. 59,M. pp. 99–104, 2007. Sherman et al., “NCBI GEO: archive for functional genomics data based on partial distance search,” Electronics letters, vol. 34, no. 21, 2006. [13] J. Abonyi and F. Szeifert, “Supervised fuzzy clustering for the sets10 years on,” Nucleic Acids Research, vol. 39, no. suppl 1, pp. pp. 2062–2063, 1998. [4] M. Kubat, S. Matwin et al., “Addressing the curse of imbalanced identification of fuzzy classifiers,” Pattern Recognition Letters, D1005–D1010, 2011. [15] G. Parmigiani, E. S. Garrett-Mayer, R. Anbazhagan, and training sets: one-sided selection,” in ICML ’97: Proceedings of the vol. 24, no. 14, pp. 2195–2207, 2003. [3] J. A. Cruz and D. S. Wishart, “Applications of machine learning in E. Gabrielson, “A cross-study comparison of gene expression 14th International Conference on Machine Learning, 1997, pp. 179–186. [14] W.-J. Hwang and K.-W. Wen, “Fast knn classification algorithm [5]cancer O.M.Dorgham,S.D.Laycock,andM.H.Fisher,“GPUacceler prediction and prognosis,” Cancer informatics, vol. 2, p.ated 59, studies for the molecular classification of lung cancer,” Clinical 2006. basedcancer onresearch partial, vol. distance 10, no. search,” 9, pp. 2922–2927,Electronics 2004. letters, vol. 34, no. 21, generation of digitally reconstructed radiographs for 2-d/3-d pp. 2062–2063, 1998. [4] M. Kubat, S. Matwin et al., “Addressing the curse of imbalanced [16] M. Reich, T. Liefeld, J. Gould, J. Lerner, P. Tamayo, and J . P. image registration,” IEEE Trans. Biomed. Eng.,vol.59,no.9,pp. [15] G. Parmigiani, E. S. Garrett-Mayer, R. Anbazhagan, and 2594–2603,training sets: Sep. one-sided 2012, doi:10.1109/TBME.2012.2207898. selection,” in ICML ’97: Proceedings of the Mesirov, “Genepattern 2.0,” Nature Genetics, vol. 38, no. 5, pp. 14th International Conference on Machine Learning, 1997, pp. 179–186. 500–501,E. Gabrielson, 2006. “A cross-study comparison of gene expression [6] S. Al-Kiswany, A. Gharaibeh, and M. Ripeanu, “GPUs as storage studies for the molecular classification of lung cancer,” Clinical [5] O.M.Dorgham,S.D.Laycock,andM.H.Fisher,“GPUaccelerated [17] J. Hubble, J. Demeter, H. Jin, M. Mao, M. Nitzberg, T. Reddy, system accelerators,” IEEE Trans. Parallel Distrib. Syst., vol. 24, cancer research, vol. 10, no. 9, pp. 2922–2927, 2004. no.generation 8, pp. 1556–1566, of digitally Aug. reconstructed 2013, doi:10.1109/TPDS.2012.239 radiographs for 2-d. /3-d F. Wymore, Z. K. Zachariah, G. Sherlock, and C. A. Ball, [16] M. Reich, T. Liefeld, J. Gould, J. Lerner, P. Tamayo, and J . P. [7] J.image W. registration,” Choe, A. Nikoozadeh,IEEE Trans. Biomed.Omer¨ Eng. Oralkan,,vol.59,no.9,pp. and B. T. “Implementation of GenePattern within the Stanford microarray Mesirov, “Genepattern 2.0,” Nature Genetics, vol. 38, no. 5, pp. 2594–2603,Khuri-Yakub, Sep. “GPU-based 2012, doi:10.1109/TBME.2012.2207898. real-time volumetric ultrasound image database,” Nucleic Acids Research, vol. 37, no. suppl 1, pp. 500–501, 2006. [6]reconstruction S. Al-Kiswany, for A. Gharaibeh, a ring array,” andIEEE M. Ripeanu, Trans. Med. “GPUs Imag. as, vol. storage 32, D898–D901, 2009. systemno. 7, pp. accelerators,” 1258–1264, Jul.IEEE 2013, Trans. doi:10.1109/TMI.2013.225 Parallel Distrib. Syst.,3117. vol. 24, [18][17] J. Addeh Hubble, and J. Demeter, A. Ebrahimzadeh, H. Jin, M. “Breast Mao, cancer M. Nitzberg, recogniti T.on Red usingdy, no. 8, pp. 1556–1566, Aug. 2013, doi:10.1109/TPDS.2012.239. F. Wymore, Z. K. Zachariah, G. Sherlock, and C. A. Ball, [7] J. W. Choe, A. Nikoozadeh, Omer¨ Oralkan, and B. T. “Implementation of GenePattern within the Stanford microarray Khuri-Yakub, “GPU-based real-time volumetric ultrasound image database,” Nucleic Acids Research, vol. 37, no. suppl 1, pp. reconstruction for a ring array,” IEEE Trans. Med. Imag., vol. 32, D898–D901, 2009. no. 7, pp. 1258–1264, Jul. 2013, doi:10.1109/TMI.2013.2253117. [18] J. Addeh and A. Ebrahimzadeh, “Breast cancer recognition using Cross-Referencing Cancer Data Using GPU for R8 : 78 Multinomial Classification of Genetic Markers

[9] H. P. Huynh, A. Hagiescu, O. Z. Liang, W.-F. Wong, and R. S. [25] “Broad institute - cancer datasets.” [Online]. Available: http:// M. Goh, “Mapping streaming applications onto GPU systems,” www.broadinstitute.org/cgi-bin/cancer/datasets.cgi IEEE Trans. Parallel Distrib. Syst., to appear 2014, doi:10.1109/ [26] W. A. Fry, H. R. Menck, and D. P. Winchester, “The national TPDS.2013.195. cancer data base report on lung cancer,” Cancer, vol. 77, no. 9, [10] N. Whitehead and A. Fit-Florea, “Precision & performance: pp. 1947–1955, 1996. Floating point and IEEE 754 compliance for NVIDIA GPUs,” rn [27] “Attribute-Relation File Format (ARFF),” Nov. 2008. [Online]. (A+ B), vol. 21, pp. 1-7, 2011. Available: http://tinyurl.com/bn8lxp [11] H. Cheng, J. Shan,W. Ju, Y. Guo, and L. Zhang, “Automated [28] C.-M. Teng, “Correcting noisy data,” in ICML ’99: Proceedings of breast cancer detection and classification using ultrasound the 16th International Conference on Machine Learning, 1999, images: A survey,” Pattern Recognition, vol. 43, no. 1, pp. 299- pp. 239-248. 317, 2010. [29] K. Lakshminarayan, S. A. Harp, and T. Samad, “Imputation of [12] L. Wang, J. Zhang, L. H. Schwartz, H. Eisenberg, N. M. Ishill, C. missing data in industrial databases,” Applied Intelligence, vol. S. Moskowitz, P. Scardino, and H. Hricak, “Incremental value of 11, no. 3, pp. 259-275, 1999. multiplanar cross-referencing for prostate cancer staging with [30] V. Ganti and A. D. Sarma, Data Cleaning: A Practical endorectal mri,” American Journal of Roentgenology, vol. 188, Perspective. Morgan & Claypool Publishers, 2013, vol. 5, no. 3. no. 1, pp. 99-104, 2007. [31] A. Jain and D. Zongker, “Feature selection: Evaluation, [13] J. Abonyi and F. Szeifert, “Supervised fuzzy clustering for the application, and small sample performance,” IEEE Trans. identification of fuzzy classifiers,” Pattern Recognition Letters, Pattern Anal. Mach. Intell., vol. 19, no. 2, pp. 153-158, 1997. vol. 24, no. 14, pp. 2195-2207, 2003. [32] S. Das, “Filters, wrappers and a boosting-based hybrid for [14] W.-J. Hwang and K.-W. Wen, “Fast knn classification algorithm feature selection,” in ICML ’01: Proceedings of the 18th based on partial distance search,” Electronics letters, vol. 34, no. International Conference on Machine Learning, 2001, pp. 74-81. 21, pp. 2062-2063, 1998. [33] B. Pfahringer, “Compression-based discretization of continuous [15] G. Parmigiani, E. S. Garrett-Mayer, R. Anbazhagan, and E. attributes,” in ICML ’95: Proceedings of the 12th International Gabrielson, “A cross-study comparison of gene expression Conference on Machine Learning, 1995, pp. 456-463. studies for the molecular classification of lung cancer,” Clinical [34] M. Dash and H. Liu, “Feature selection for classification,” cancer research, vol. 10, no. 9, pp. 2922-2927, 2004. Intelligent Data Analysis, vol. 1, no. 3, pp. 131-156, 1997. [16] M. Reich, T. Liefeld, J. Gould, J. Lerner, P. Tamayo, and J. P. [35] J. Yang and V. Honavar, “Feature subset selection using a Mesirov, “Genepattern 2.0,” Nature Genetics, vol. 38, no. 5, pp. genetic algorithm,” in Feature extraction, construction and 500-501, 2006. selection. Springer, 1998, pp. 117-136. [17] J. Hubble, J. Demeter, H. Jin, M. Mao, M. Nitzberg, T. [36] P. Somol, P. Pudil, J. Novovičová, and P. Paclık, “Adaptive Reddy, F. Wymore, Z. K. Zachariah, G. Sherlock, and C. A. floating search methods in feature selection,” Pattern Ball, “Implementation of GenePattern within the Stanford Recognition Letters, vol. 20, no. 11, pp. 1157-1163, 1999. microarray database,” Nucleic Acids Research, vol. 37, no. suppl [37] G. H. John, R. Kohavi, K. Pfleger et al., “Irrelevant features and 1, pp. D898-D901, 2009. the subset selection problem,” in ICML ’94: Proceedings of the [18] J. Addeh and A. Ebrahimzadeh, “Breast cancer recognition 11th International Conference on Machine Learning, 1994, pp. using a novel hybrid intelligent method,” Journal of Medical 121-129. Signals and Sensors, vol. 2, no. 2, p. 95, 2012. [38] H. Liu, F. Hussain, C. L. Tan, and M. Dash, “Discretization: An [19] A. Marcano-Cede˜no, J. Quintanilla-Dom´ınguez, and D. enabling technique,” Data Mining and Knowledge Discovery, Andina, “Wbcd breast cancer database classification applying vol. 6, no. 4, pp. 393–423, 2002. artificial metaplasticity neural network,” Expert Systems with [39] M. Boulle, “Khiops: A statistical discretization method of Applications, vol. 38, no. 8, pp. 9573-9579, 2011. continuous attributes,” Machine Learning, vol. 55, no. 1, pp. 53- [20] H.-L. Chen, B. Yang, J. Liu, and D.-Y. Liu, “A support vector 69, 2004. machine classifier with rough set-based feature selection for [40] J. M. Jerez-Aragonés, J. A. Gómez-Ruiz, G. Ramos-Jiménez, breast cancer diagnosis,” Expert Systems with Applications, vol. J.Mu˜noz-Pérez, and E. Alba-Conejo, “A combined neural 38, no. 7, pp. 9014-9022, 2011. network and decision trees model for prognosis of breast cancer [21] P. Langley and S. Sage, “Induction of selective bayesian relapse,” Artificial Intelligence in Medicine, vol. 27, no. 1, pp. 45- classifiers,” in Proceedings of the Tenth international conference 63, 2003. on Uncertainty in Artificial intelligence. Morgan Kaufmann [41] O. Okun, H. Priisalu, O. Okun, and H. Priisalu, “Ensembles of Publishers Inc., 1994, pp. 399-406. nearest neighbors for cancer classification using gene expression [22] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, data,” in Proc. of the Workshop on Supervised and Unsupervised “Nvidia tesla: A unified graphics and computing architecture,” Ensemble Methods and Their Applications (in conjunction with IEEEMicro, vol. 28, no. 2, pp. 39–55, 2008. 3rd Iberian Conf. on Pattern Recognition and Image Analysis, [23] “Princeton university - colon cancer datasets.” [Online]. IbPRIA 2007, Girona, 2007, pp. 57-71. Available: http://tinyurl.com/nfk6jmb [42] Z. Zheng, “Constructing x-of-n attributes for decision tree [24] “Broad institute - lung cancer datasets.” [Online]. Available: learning,” Machine Learning, vol. 40, no. 1, pp. 35-75, 2000. http://tinyurl.com/pgow379

CSI Journal of Computing | Vol. 2 • No. 4, 2015 Mahendiran et al. R8 : 79

About the Authors

Abinaya Mahendiran received her Bachelor of Technology degree in information technology from Amrita School of Engineering, India, and her Master of Technology degree in information technology from IIIT-Bangalore. She has been with Cognizant Technology Solutions and is presently with Cisco Systems. Her research interest is in machine learning, with a particular focus on deep learning.

Srivatsan Sridharan received his Bachelor of Engineering degree in computer science and engineering from Sri SaiRam Engineering College, affiliated with Anna University, Chennai, India and his Master of Technology degree in information technology from the International Institute of Information Technology - Bangalore (IIIT-Bangalore), a graduate school of information technology in Bangalore, India. He is presently with Cisco Systems.

Sushanth Bhat received his Master of Technology degree in information technology from IIIT-Bangalore, and his Bachelor of Engineering degree in computer science from NMAM Institute of Technology, Nitte, India. He has been with Cisco Systems, and is now with e-commerce major Flipkart.com.

Shrisha Rao received his Ph.D. in computer science from the University of Iowa, and before that his M.S. in logic and computation from Carnegie Mellon University. He is an associate professor at IIIT- Bangalore. His research interests are in distributed computing, specifically algorithms and approaches for concurrent and distributed systems, and include solar energy and microgrids, cloud computing, energy- aware computing (“green IT”), and demand side resource management. Dr. Rao is a senior member of the IEEE, and a member of the ACM, the American Mathematical Society, and the Computer Society of India.

CSI Journal of Computing | Vol. 2 • No. 4, 2015 80

Letter to the Editor

CORRECTIONS

My paper “Bidirectional Heuristic Search based on In Section 3 Error Estimate” (Article 7) was published in your CSI (page S1:61 left column) in third line in the algorithm BAE*: Journal of Computing Vol. 2, No. 1-2, 2013. I observe that the assumption “(iv) h1(s) = h2(t)” may be removed from the Replace “h0 ¬ h1(s) (= h2(t))” between Set and comma with

Section 2. Removing this assumption will not materially “h01 ¬ h1(s), h02 ¬ h2(t)”. change any examples, theorems, proofs, algorithm, results In Step 3 of the algorithm, in last but one line: Replace “h ” and conclusion of the paper. In the original paper, I assumed 0 with “(h01 + h02)/2”. a constant h0 = h1(s) (=h2(t)). Now, I have taken two constants h and h such that h = h (s)and h = h (t). This change will 01 02 01 1 02 2 In Step 5 of the algorithm, in last but one line: Replace “h0” enhance the algorithm and the paper. I am requesting you with “h0d”. to accommodate such changes if possible. It will have great value to me. In Step 6 of the algorithm, in last line: Replace “h0” with “h0d”. The following changes are needed in the original paper. (page S1:61 right column) in two lines above the last line:

In Section 2: Replace “h0” with “h0d” in two places.

(Page S1:59 left column) in Definition 2.1: Replace “h0” with (page S1:62 right column) in Definition 3.3: Replace “2g1

“h1(s)” in two places. (n)+h1 (n)-h2 (n) ≤ 2h1* (s)- h1 (s)” with “TE1 (n) ≤ TE1* (t)”.

(page S1:59 right column) in Definition 2.2: Replace “h0” with (page S1:62 right column) in Definition 3.4: Replace “2g2

“h2(t)” in two places. (n)+h2 (n)-h1 (n) ≤ 2h2* (t) - h2 (t)” with “TE2 (n) ≤ TE2* (s)”.

(page S1:59 right column) under we assume the following: (page S1:62 right column) in last but two lines: Replace “h0”

Remove “(iv) h1(s) = h2(t)”. with “h0d”. (page S1:60 left column) in Table 1: Replace “h ” 0 In Grossary of Notation with “ {h1(s) + h2(t)}”. (page S1:64 right column) In the Grossary of Notation last (page S1:60 left column) in Theorem 2.1: but eight lines: Replace “FE (t) = BE (t) = FE (s) = BE (s)” by two lines as 1 1 2 2 Replace the line “h = h (s) (= h (t)” by two lines as follows 0 1 2 follows

“FE1 (t) = BE2(s)” “h01 = h1(s)”

“FE2 (s) = BE1(t)” “h02 = h2(t)”

Replace “2{c(P) – h0}” with “2c(P) – {h1(s) + h2(t)}”. (page S1:60 left column) in Corollary 2.1: Replace “h ” with 0 Samir K Sadhukhan “ {h (s) + h (t)}”. 1 2 Indian Institute of Management Calcutta Kolkata 700104, India, [email protected]. (page S1:60 right column) below Fig. 3: Replace “h0” with “h1(s)” in three places in three separate lines. 19/8/2014

CSI Journal of Computing | Vol. 2 • No. 4, 2015

Journal of Computing Best Paper Award

The best paper in each volume of the CSI Journal of Computing will be given an award in one of the CSI sponsored conferences. The details will follow. TM

CSI Journal of COMPUTING

Contents

Article R6 Impact, Team Size, Collaboration, and Epochs Stories from Manju. V. C., 8 Pages Four Decades of Software Engineering Research Dr. Sasi Kumar

Article R7 An Efficient Approach for Storing and Accessing Huge Shyamli Rao & 6 Pages Number of Small Files in HDFS Amit Joshi

Article R8 Cross-Referencing Cancer Data Using GPU for Multinomial Srivatsan Sridharan, 13 Pages Classification of Genetic Markers Abinaya Mahendiran, Sushanth Bhat and Shrisha Rao Letter to the Editor

Published by Mr. Suchit Gogwekar for COMPUTER SOCIETY OF INDIATM at Unit No. 3, 4th Floor, Samruddhi Venture Park, MIDC, Andheri (E), Mumbai-400 093. Tel. : 022-2926 1700 Fax : 022-2830 2133 Email : [email protected]