Cybersecurity Through Nimble Task Allocation: Semantic Workflow Reasoning for Mission-Centered Network Models

Yolanda Gil

Information Sciences Institute University of Southern California

[email protected] http://www.isi.edu/~gil

USC Information Sciences Institute Yolanda Gil 1 Knowledge Technologies at ISI: From Basic Research to Transition

AFOSR DARPA DOD

2000 2005 TRELLIS: Basic JIST: Capturing decisions Transition to IC’s research on information and evidence in RDEC experimental analysis capture intelligence analysis platform

NSF AFRL DOD

2002 2008 Semantic workflows for Windward: Scalable Transition to IC’s large-scale scientific data Knowledge Discovery RDEC experimental analysis through Distributed Analysis platform

AFOSR W3C

2005 2010 MathTrust: Basic research Provenance Incubator on social trust of open leading to standards for information sources provenance on the web

AFOSR 2011 Mission - Centered USC InformationNetwork Sciences Models Institute Yolanda Gil 2 http://www.w3.org/2005/Incubator/prov/wiki

W3C Provenance Standard

 Need standard representation for origins of information • Trust in open systems (Web), computational science, process validation and compliance  W3C Provenance Incubator launched in 2009 (Y. Gil, Chair) • Collected use cases for provenance • Developed provenance requirements for the Web • Created mappings for existing provenance vocabularies • State-of-the-art report • Provenance in Web architecture  Final report in Dec 2012 proposed charter for a Working Group on provenance  WG started 4/2011 to develop PROV standard, released final request for comments 8/2012

USC Information Sciences Institute Yolanda Gil 3 W3C PROV Standard (2012)

 PROV Primer  PROV Ontology  PROV Data Model  PROV Notation  PROV Constraints  PROV Access and Query

USC Information Sciences Institute Yolanda Gil 4 New Award (2011-2015): “Cybersecurity Through Nimble Task Allocation: Semantic Workflow Reasoning for Mission-Centered Network Models”  Problem: Existing cybersecurity measures for network intrusion follow similar pattern • Discover intrusion -> Find vulnerability-> Fix problem -> Repeat • Unclear that these approaches can lead to secure systems  Approach: Accomplish the mission assuming the network is compromised and subject to deception • Focus on protecting the mission, rather than the network  Key Idea: Mission-Centered Network Models • Semantic workflows represent mission – High level tasks + goals + constraints • Reasoning to specialize and map workflow onto network  Benefits: Nimble task allocation implies unpredictability to opponent

USC Information Sciences Institute Yolanda Gil 5 Problem Addressed

 Existing cybersecurity measures for network intrusion follow similar pattern • Discover intrusion -> Find vulnerability-> Fix problem -> Repeat  Vast amounts of communications occurring throughout the network • Network models focus on physical and logical level • No models to describe high-level information flow

USC Information Sciences Institute Yolanda Gil 6 Key Idea: Semantic Workflows Represent Missions and their Task Decompositions

 Workflows are important to model because they represent task-driven information flow within the network • Workflows represent how information is derived from retrieving and processing data obtained from distributed sources  Workflows arise from: • Predefined SOPs to reflect common information processing tasks by users • Dynamic creation by users using mashup construction tools • Analysis of actual network traffic data USC Information Sciences Institute Yolanda Gil 7 Related Work: Reconstructing Flows

 Extracting application use from network flow [Labovitz et al 2010] [Maier et al 2009] [Fukuda et al 2005] [Bartlett et al 2007] • Classifying traffic based on: – Port use – Packet signatures based on packet payload – Blind techniques based on application communication patterns • Short-lived flows, low-rate flow periodicity, service discovery, etc.  Program slicing to reconstruct causal difference graphs [Johnson et al 2011]

USC Information Sciences Institute Yolanda Gil 8 Mission-Centered Network Models

MISSION AT TIME T Tasks as Workflows Results & Provenance Mission: M= {G,W,R,π}

• G: goals

Composed of Trusted because Task • W: semantic workflows Layer Workflow Dynamic Workflow Fragments Policies Provenance • R: resources

• π: policies

Assembled from Related through

Software Data Network = {L,P,C} Components Sources Logical • Layer L: logical resources

• P: physical resources Executed in Populated by • C: connections

Physical Layer

USC Information Sciences Institute Yolanda Gil 9 Semantic Workflows to Executable Workflows

MISSION AT TIME T Tasks as Workflows Results & Provenance  Four stages of

workflow elaboration:

Composed of Trusted because Task 1. Workflow template Layer Workflow Dynamic Workflow Fragments Policies Provenance – M= {G,W,R,π}

2. Workflow instance – WI= f(M,D) Assembled from Related through 3. Execution-ready Software Data Components Sources workflow (map to L) Logical Layer – ERW=f(WI,π,L}

Executed in Populated by 4. Executable workflow (map to P) Physical – EW=g(ERW,π,P} Layer

USC Information Sciences Institute Yolanda Gil 10 Example: Airlift Workflow (Adapted from [Burstein et al 2008])

Goal Airlift all patients to specified hospitals

Workflow Specify Flight Satisfy Flight Parameters Constraints

Repeat for Repeat while Find Airport POE each PAX more PAX Fail to obtain (Port of Embarcation) Dest. destinations. permission Near to Find Airport POD Passengers (PAX) (Port of Debarcation) Find Planes to carry Create Scheduled Task: Near to Hospitals Passengers between Request Landing Fly plane P to POE and POD during Time At POE the specified time period Pickup at POE Data flow: Identify Passenger Lat/Lon Request Landing Location Identify nearest APT Create Scheduled Task: Time at POD By incrementally Fly plane P to Delivery at POD Logical Expanding region Resources Resource Requests Reqs to Airport Assert flights Query to Queries to Resource Requests Permission Reqs to Assert flights to Airlift Units Admins In Schedule DB Geographic Server AIRPORTS Server to Each Airlift Unit Airport Admin Svcs to schedule DB

Physical ResourcesUSC Information Sciences Institute Yolanda Gil 11 Protect Mission While Network is Under Attack Goal Airlift all patients to specified hospitals

Workflow Specify Flight Satisfy Flight Parameters Constraints

Repeat for Repeat while Find Airport POE each PAX more PAX Fail to obtain (Port of Embarcation) Dest. destinations. permission Near to Find Airport POD Passengers (PAX) (Port of Debarcation) Find Planes to carry Create Scheduled Task: Near to Hospitals Passengers between Request Landing Fly plane P to POE and POD during Time At POE the specified time period Pickup at POE Data flow: Identify Passenger Lat/Lon Request Landing Location Identify nearest APT Create Scheduled Task: Time at POD By incrementally Fly plane P to Delivery at POD Logical Expanding region Resources Resource Requests Reqs to Airport Assert flights Query to Queries to Resource Requests Permission Reqs to Assert flights to Airlift Units Admins In Schedule DB Geographic Server AIRPORTS Server to Each Airlift Unit Airport Admin Svcs to schedule DB

Physical Resources A A A USC Information Sciences Institute Yolanda Gil 12 Effects of Attack

Resource Requests Reqs to Airport Assert flights Query to Queries to Resource Requests Permission Reqs to Assert flights to Airlift Units Admins In Schedule DB Geographic Server AIRPORTS Server to Each Airlift Unit Airport Admin Svcs to schedule DB

A1 A2 A3 A4  A1: Assess trust on this already executed task • Check provenance records

 A2: Attack does not affect physical resource selected • Proceed as planned

 A3: Attack affects physical resource selected • Reassign task to alternative physical resource

 A4: Has not occurred yet, but resource is critical • Replicate resource, ensure protection of resource USC Information Sciences Institute Yolanda Gil 13 Research Challenges (I): Robust Missions through MCNMs

 Ability to trust already executed tasks • Analyze provenance record, resubmit if needed  Ability to determine what are the critical resources for accomplishing a given task within a mission • Determine tasks that have limited mapping choices  Ability to control how a task is mapped to physical and logical resources • Policy-based resource allocation algorithms  Ability to manage a task in a network under a threat and reassign it dynamically to a different set of resources • Adaptively change set of mapping policies π

USC Information Sciences Institute Yolanda Gil 14 Research Challenges (II)

 Ability to prioritize the use of uncompromised resources based on criticality of mission tasks  Ability to detect intrusion and deception • Generate n mappings for a given workflow (with π and random assignments), compare results through provenance records  Ability to proactively detect high-level patterns of deception • Orchestrate simulated tasks mirroring real missions and observe  Ability to predict the practical impact on the mission of deception activities in specific (L and P) resources • Can workflow be mapped given the mission requirements  Ability to measure mission accomplishment based on level of trust on task outcome when network is compromised

USC Information Sciences Institute Yolanda Gil 15 Our Prior Work: From Workflow Templates to Executable Workflows [Kumar et al 09] Workflow template Executable workflow for for image processing 2560x2400 pixels

USC Information Sciences Institute Yolanda Gil 16 Our Prior Work: Wings/Pegasus/Condor Workflows for Seismic Hazard Analysis [Gil et al 07]

 Input data: a site and an earthquake forecast model • thousands of possible fault ruptures and rupture variations, each a file, unevenly distributed • ~110,000 rupture variations to be simulated for that site

 High-level template combines 11 application codes  8048 application nodes in the workflow instance generated by Wings  Provenance records kept for 100,000 workflow data products • Generated more than 2M triples of metadata  24,135 nodes in the executable workflow generated by Pegasus, including: • data stage-in jobs, data stage-out jobs, data registration jobs  Executed in USC HPCC cluster • Runtime is1.9 CPU years

USC Information Sciences Institute Yolanda Gil 17 Semantic Workflows

 Workflow templates  Dataflow diagram • Each constituent (node, link, component, dataset) has a corresponding variable  Semantic properties (TestData dcdom:isDiscrete false)  Constraints on workflow (TrainingData dcdom:isDiscrete false) variables

USC Information Sciences Institute Yolanda Gil 18 Semantic Workflows [Kim et al CCPEJ 08; Gil et al IEEE eScience 09; Gil et al K-CAP 09; Kim et al IUI 06; Gil et al IEEE IS 2010; Gil et al JETAI 2011]

 Workflows augmented with semantic constraints • Each workflow constituent has a variable associated with it – Nodes, links, workflow components, datasets – Workflow variables can represent collections of data as well as classes of software components • Constraints are used to restrict variables, and include: – Metadata properties of datasets – Constraints across workflow variables • Incorporate function of workflow components: how data is transformed  Reasoning about semantic constraints in a workflow • Algorithms for semantic enrichment of workflow templates • Algorithms for matching queries against workflow catalogs • Algorithms for generating workflows from high-level user requests • Algorithms for generating metadata of new data products • Algorithms for assisting users w/creation of valid workflow templates USC Information Sciences Institute Yolanda Gil 19 Semantic Workflows to Executable Workflows

MISSION AT TIME T Tasks as Workflows Results & Provenance  Four stages of

workflow elaboration:

Composed of Trusted because Task 1. Workflow template Layer Workflow Dynamic Workflow Fragments Policies Provenance – M= {G,W,R,π}

2. Workflow instance – WI= f(M,D) Assembled from Related through 3. Execution-ready Software Data Components Sources workflow (map to L) Logical Layer – ERW=f(WI,π,L}

Executed in Populated by 4. Executable workflow (map to P) Physical – EW=g(ERW,π,P} Layer

USC Information Sciences Institute Yolanda Gil 20 Mapping Domain Tasks to Logical Tasks

 Two separate catalogs: • Catalog of Domain Tasks (CDoT) – conceptual tasks in mission • Catalog of Logical Tasks (CAC) – application codes, services, etc. • Catalog of Physical Tasks (CIC) – installations of codes, services

 Workflow template: M= {G,W,R,π} • t in W, t CDoT  Workflow instance: WI= f(M,D)  Execution-ready workflow (map to L): ERW=f(WI,π,L} • t in WI, t CAC  Executable workflow (map to P): EW=g(ERW,π,P} • t in ERW, t CIC

USC Information Sciences Institute Yolanda Gil 21 unified well-formed request Workflow Generation Seed workflow from request Algorithm [Gil et al 11] seeded workflows

Find input data requirements

binding-ready workflows

Data source selection

bound workflows

Parameter selection

configured workflows

Workflow instantiation

workflow instances

Workflow grounding

ground workflows

Workflow ranking

top-k workflows

Workflow mapping

USC Information Sciences Institute Yolanda Gil 22 executable workflows Implementing the Task Layer with Semantic Workflows

Mixed codes

WEKA codes

USC Information Sciences Institute Yolanda Gil 23 Benefits

 Flexible mapping enables reassignment of tasks to logical and physical resources to protect the mission, but also: • Provides new failure recovery mechanisms • Portability of processes to new environments

Drugome Workflow (with P. Bourne of UCSD) Social Media Analysis Workflow (with R. Sethi)

USC Information Sciences Institute Yolanda Gil 24 Mapping Workflows to Provenance Traces [Garijo and Gil 11]

USC Information Sciences Institute Yolanda Gil 25 Building on the W3C PROV Standard (2012)

USC Information Sciences Institute Yolanda Gil 26 Towards Extending the W3C PROV Provenance Standard with a Workflow Language [Garijo & Gil 12]

USC Information Sciences Institute Yolanda Gil 27 Summary “Cybersecurity Through Nimble Task Allocation: Semantic Workflow Reasoning for Mission-Centered Network Models”  Problem: Existing cybersecurity measures for network intrusion follow similar pattern • Discover intrusion -> Find vulnerability-> Fix problem -> Repeat • Unclear that these approaches can lead to secure systems  Approach: Accomplish the mission assuming the network is compromised and subject to deception • Focus on protecting the mission, rather than the network  Key Idea: Mission-Centered Network Models • Semantic workflows represent mission – High level tasks + goals + constraints • Reasoning to specialize and map workflow onto network  Benefits: Nimble task allocation implies unpredictability to opponent

USC Information Sciences Institute Yolanda Gil 28 Publications (available from http://wings-workflows.org)

 “From Application Codes to Domain Tasks in Workflows: Towards Extending the Portability of Workflows.” Gil, Y. Proceedings of the IEEE Conference on e-Science, Chicago, Illinois, October 2012.  “Workflows of Domain Tasks: Extending the Flexibility and Portability of Workflow Approaches.” Gil, Y. Submitted to Seventh Workshop on Workflows in Support of Large-Scale Science (WORKS’12), held in conjunction with SC 2012, Salt Lake City, Utah, November 2012.  “Common Motifs in Scientific Workflows: An Empirical Analysis.” Garijo, D.; Alper, P.; Belhajjame, K.; Corcho, O.; Gil, Y.; and Goble, C. Proceedings of the IEEE Conference on e-Science, Chicago, Illinois, October 2012.  “A Primer for the PROV Provenance Model.” Gil, Y. and Miles, S. (Eds). World Wide Web Consortium (W3C), August 2012.  “Automatic Metadata Annotation through Reconstructing Provenance.” Groth, P.; Gil, Y.; and Magliacane, S. Proceedings of the Third International Workshop on the role of Semantic Web in Provenance Management (SWPM) , Heraklion, Greece, May 2012.  “A New Approach for Publishing Workflows: Abstractions, Standards, and Linked Data.” Garijo, D., and Gil, Y. Proceedings of the Sixth Workshop on Workflows in Support of Large-Scale Science (WORKS'11), in conjunction with SC 2011, Seattle, Washington, Nov 2011.  “A Semantic Framework for Automatic Generation of Computational Workflows Using Distributed Data and Component Catalogs. Gil, Y.; Gonzalez-Calero, P. A.; Kim, J.; Moody, J.; and Ratnakar, V. 2011. Journal of Experimental and Theoretical Artificial Intelligence, 23(4):.

USC Information Sciences Institute Yolanda Gil 29