Semantic Web 0 (0) 1 1 IOS Press

Semantic Modeling for Engineering Data Analytics Solutions

Madhushi Bandara a,*, Fethi A. Rabhi a a Department of and Engineering, University of New South Wales, NSW, Australia E-mails: [email protected], [email protected]

Abstract. Data Analytics Solution (DAS) engineering often involves multiple tasks from data exploration to result presentation which are applied in various contexts and on different datasets. Semantic modeling based on the open world assumption supports flexible modeling of linked knowledge. The objective of this paper is to review existing techniques that leverage technologies to tackle challenges such as heterogeneity and changing requirements in DAS engineering. We explore the application scope of those techniques, the different types of semantic concepts they use and the role these concepts play during DAS development process. To gather evidence for the study we performed a systematic mapping study by identifying and reviewing 82 papers that incorporate semantic models in engineering DASs. One of the paper’s findings is that existing models can be classified within four types of knowledge spheres: domain knowledge, analytics knowledge, services and user intentions. Another finding is to show how this knowledge is used in literature to enhance different tasks within the analytics process. We conclude our study by discussing the limitations of the existing body of research, showcasing the potential of semantic modeling to enhance DASs and presenting the possibility of leveraging ontologies for effective end-to-end DAS engineering.

Keywords: Semantic Modeling, Data Analytics, Development, Ontology

1. Introduction ple tasks such as identifying suitable datasets, devel- oping and validating analytics models and interpreting The business intelligence and analytics fields are final results. The software engineering aspect of DAS, rapidly expanding across all industry sectors and many which we refer to as "DAS engineering", involves de- organizations are trying to make analytics an inte- signing, developing and deploying DASs and includes gral part of everyday decision making [1, 2]. These tasks such as requirement elicitation, data integration fields include the techniques, technologies, systems, and process composition [6]. practices, methodologies and applications that are con- With the increasing popularity of big data as a re- cerned with analyzing critical business data to help an search area, the focus of the most research efforts have enterprise better understand its business and market been on developing specific analysis techniques (e.g. and make timely business decisions [2, 3]. machine learning algorithm design) but not on sup- There is no universally accepted definition for the porting the overall DAS engineering [7]. Within many process data analytics. CRISP-DM [4] and KD process organizations, analysts with limited programming ex- proposed by Fayyad et. al [5] are two examples of well perience are often required to manually establish re- developed and popular definitions. In the context of lationships between software components of the an- this paper, we identify a "data analytics process" (also alytics solution such as software services used for called an "analytics pipeline") as an end-to-end Data computation, and data elements or algo- Analytics Solution (DAS) that captures tasks related rithms [8, 9]. According to No-Free-Lunch theorem to data mining, knowledge discovery or business intel- [10], DAS engineering becomes further challenging as ligence. A DAS can be itself decomposed into multi- there is not a model that works best for every problem and depending on the application context and input *Corresponding author. E-mail: [email protected]. data, analysts have to try different techniques before

1570-0844/0-1900/$35.00 c 0 – IOS Press and the authors. All rights reserved 2 Semantic Modeling for Engineering Data Analytics Solutions getting optimal results. Most organizations are looking tions 5. The limitations of the study are discussed in for flexible solutions that align with their specific ob- section 6 and the paper concludes in section 7. jectives and IT infrastructures [11], usually resulting in the use of a mix of data sources and software frame- works. Understanding and managing these heteroge- 2. Background neous technologies need to be supported by a good infrastructure. In addition to The literature emphasizes the significance of knowl- incurring high software development costs, maintain- edge management in different fields such as enter- ing and evolving heterogeneous software infrastruc- prise data analytics [17] and scientific workflow [18], tures in the face of constant changes in both business and there have been many attempts at identifying requirements and technical specifications is very ex- knowledge specific to DAS. For example, the ADAGE pensive [12]. framework [19] proposes an approach that leverages Ontology is the formal foundation for semantic the capabilities of service-oriented architectures and modeling. The main roles of an ontology are to cap- scientific workflow management systems. The main ture domain knowledge, to evaluate constraints over idea is that the models used by analysts (i.e. work- domain data and to guide domain model engineering flow, service, and data models) contain concise infor- [13]. It is a powerful tool for modeling and reason- mation and instructions that can be viewed as an accu- ing [11]. As ontologies can produce a sound represen- rate record of the analytics process. These models can tation of concepts and the relationships between con- become useful artifacts for provenance tracking and cepts, they provide malleable models that are suitable can ensure reproducibility of such analytics processes. for tracking various kinds of software development ar- However, designing models that accurately represent tifacts ranging from requirements to implementation the complex business contexts and expertise associated code [14]. Although there have been many efforts in with a DAS still remains a challenge. Development methods such as CRISP-DM for enterprise-level data applying semantic modeling for DAS engineering, the mining [4] and Domain-oriented data mining [20] are overall picture of their capabilities is far from clear. advocating the necessity of using knowledge manage- Hence, this study systematically explores various re- ment techniques for capturing the business domain and search studies that are focusing on engineering appli- understanding of data in order to build better analytics cations that support DASs with the aid of semantic solutions. technology. Further, we identify unresolved challenges There are multiple known knowledge representa- and potential research directions in DAS engineering tion approaches related to different aspects of DAS space. We follow the systematic mapping study pro- such as UML diagrams [8, 21] and petri-nets [22] cess proposed by Petersen et. al. [15], collect evidence which are syntactically oriented and limited in the abil- from the publications in five prominent and ity to capture semantic details and reason from the extend the evidence further by snowballing [16] rele- knowledge [23]. The focus of this paper is on seman- vant references of identified studies. We conduct our tic knowledge representation with ontological models, study around the main research question of identify- a technology that originated from the Semantic Web ing the existing techniques that use semantic models in concept [24], where ontologies are expressed through DAS engineering and two sub-questions related to that. the RDF1, RDFS2 and OWL3 language notations. Al- We evaluate how different knowledge areas related to though has been part of the re- DASs such as mental models of the end-user, domain search landscape for a while, the industry is only just knowledge, of data, applicability of analyt- beginning to discover the power of linked open data, ics algorithms and tools for a particular task, compat- ontologies and semantic applications in assisting en- ibility between data and tools etc. are represented by hancements to the data analytics process. Whilst there semantic models and leveraged for conducting tasks are significant examples of leading companies related to DAS engineering. (e.g. Google, Amazon, and Facebook) beginning to ex- The rest of the paper is structured as follows. Sec- ploit the power of and domain ontolo- tion 2 presents background related to this study. Sec- tion 3 explains the review method that we followed. 1https://www.w3.org/RDF/ The results derived from the 82 identified studies are 2https://www.w3.org/TR/rdf-schema/ included in section 4 followed by a discussion in sec- 3https://www.w3.org/OWL/ Semantic Modeling for Engineering Data Analytics Solutions 3 gies (e.g. Schema.org, DBpedia) [25], many organiza- 1. What type of concepts are modeled/used by these tions are still largely unaware of the value that these techniques? approaches can bring [26–28]. 2. What tasks related to DAS engineering are en- To our knowledge, there is no formal study con- abled using the identified concepts? ducted on how semantic modeling has contributed to DAS engineering except the surveys conducted by 3.3. Search of Relevant Literature Abello et. al [11] and Ristoski and Paulheim [25]. In particular, Abello et. al [11] studied about using se- We adapted the work used in [15, 29–31] and iden- mantic web technologies for Exploratory OLAP, con- tified the following strategy to construct the search sidering the data extraction and integration aspects strings: while Ristoski and Paulheim [25] conducted a survey – Derive major terms used in the review questions in 2016 about the different stages of the knowledge – Search for synonyms and alternative words. discovery process that use semantic web data. In com- – Use the Boolean OR to incorporate alternative parison, our work is unique as it looks at the appli- spellings and synonyms cations of semantic models in DAS engineering from – Use the Boolean AND to link the major terms data analytics as well as a software engineering per- spective. To obtain a balance between sensitivity and speci- ficity as highlighted by Petticrew and Robert [32], we selected a search string that contains three ma- 3. Research Method jor terms related to the concepts: semantic technology, data analytics, and software engineering, connected by 3.1. Introduction a Boolean AND operation. Each term contains a set of keywords related to the respective concept, connected As our goal is to provide a holistic view of how by a Boolean OR operation. semantic modeling is used in DAS engineering land- The complete search string initially used for the scape, we conducted a systematic mapping study searching of the literature was as follows: (SMS). By the definition, SMS is used to provide an (("knowledge management" OR semantic OR overview of a research topic by showing the type of "" OR ontology OR "conceptual mod- research and results that have been published by cate- eling") AND ("big data analytics" OR "business gorizing them with the goal of answering a specific re- analytics" OR "data analytics" OR "scientific search question [15, 29]. We followed the process pro- workflow" OR "data mining") AND (requirement posed by Petersen et. al [15] to ensure the accuracy and OR "development process" OR "code genera- quality of the outcome. We conducted initial evidence tion")) search on five databases, two conference proceedings The primary search process involved the use of 5 and one journal related to semantic web technologies. online databases: Web of Science, Scopus, ACM Digi- Findings were extended through snowballing approach tal Library, IEEE Xplore, and ProQuest. The selection proposed by Wohlin [16]. of databases was based on our knowledge about those that index major publications related to computer sci- 3.2. Research Questions ence, software engineering and semantic technology. Based on the recommendations of domain experts, we The primary focus of our SMS is to identify and un- expanded the search space to the collection of proceed- derstand how semantic modeling approaches are used ings of International Semantic Web Conference and to represent and communicate knowledge of a data an- European Semantic Web Conference, their associated alyst as well as how existing DAS engineering tech- workshops, and the publications by the Semantic Web niques are leveraging this knowledge. The review was Journal accessible via DBLP search API. conducted on a primary research question and two sub- Upon completion of the primary search phase, we questions which are stated as follows: further expanded the identification of relevant liter- Primary Question: What are the existing tech- ature through snowballing. All the papers identified niques that use semantic modeling for DAS engineer- from the primary search phase were reviewed for rele- ing? vancy. If a paper satisfied the selection criteria, it was Sub-questions: added to the list of studies qualified for the synthesis. 4 Semantic Modeling for Engineering Data Analytics Solutions

3.4. Selection of Studies To avoid the inclusion of duplicate studies which would inevitably bias the result of the synthesis [35], Below are the exclusion criteria we adapted from we thoroughly checked if very similar studies were [33]: published in more than one paper. In total, 82 studies were included in the synthesis of evidence. 1. Books and news articles 2. Papers where semantic modeling was not applied 3.6. Constructing Classification Schemas directly to DAS engineering 3. Vision papers Our study requires two classification schemas to an- 4. Papers not written in English. swer sub research question 1 and 2, i.e. what type of se- 5. Application specific research that does not gener- mantic concepts are modeled/used by identified stud- alize (such as text extraction and web search ap- ies and what tasks related to DAS engineering are en- plications) abled using the identified concepts. 6. Infrastructure related or performance-oriented re- To construct each classification schema for our map- search. ping study we adapted the systematic process proposed 7. Full text that was not available for public access in [15]. We created the classification schema in a top- and not licensed by the University of New South down fashion, incorporating different classifications Wales proposed and used in literature. We used the abstract, We did not restrict the search to a span of time and in- introduction and conclusion of the selected 82 studies cluded all research available in the selected databases and aligned the studies with categories identified in the up to 27/06/2018. literature. When necessary, the classification schema was extended with keywords and categories defined in 3.5. Study Quality Assessment the selected literature to provide clarity and granular- ity. We designed a quality checklist to measure the qual- To answer sub-question 1, we distinguished four ity of the primary studies by reusing some of the ques- broad classes of concepts represented through ontolo- tions proposed in the literature [32, 34]. Our quality gies in identified studies, referred to as Domain, Ana- checklist comprised 4 general questions stated below: lytic, Service and Intent. This classification was guided by the proposal of Nigro [36] to use three ontology 1. Is the study related to DAS engineering? types in data mining. The first two: "Domain Ontolo- 2. Does the study leverage semantic models for in- gies" and "Ontologies for Data Mining Process" were formation modeling? included in our schema as the domain and analytics 3. Does the study provide sound evaluation? concepts respectively. The third one, called " 4. Are the findings credible? Ontologies", defines how variables are constructed. Because this definition is high-level and incomplete, Initially, one author went through the title, abstract we introduced two new concept types: Intent concepts and keywords of search results and divided papers into and Service concepts to capture knowledge that sup- 3 categories by relevancy: "Yes", "No" and "Maybe". ports requirements management and implementation Then the second author went through the full text management within DAS engineering. Using the evi- of the papers under the "Maybe" category to identify dence of identified literature we extended this classifi- whether they were compliant or not with our quality cation to a more granular level with different subtypes. checklist. The details of this classification are discussed in sec- Through the initial search, we identified tion 4.2. 1414 empirical studies as candidates. Among those re- When we look at the problem of classifying differ- sults, 63 (4.46% of 1414 studies) were identified as rel- ent DAS-related tasks for sub-question 2, there was evant studies, based on the study quality assessment no unique definition in the literature regarding what and exclusion criteria. The same steps were applied to constitutes a task in relation to DAS engineering. the literature identified through snowballing at the sec- Fayyad et. al. [5] propose a five-step process model for ond stage. We iterated through the references of 63 pa- knowledge discovery - selection, preprocessing, trans- pers selected during the initial search and identified 19 formation, data mining, evaluation and interpretation. additional relevant papers for our study. CRISP-DM proposed by Chapman et. al. [4] is more Semantic Modeling for Engineering Data Analytics Solutions 5 enterprise-oriented and breaks down the life-cycle into five steps: domain understanding, data understanding, data preparation, modeling, evaluation, and deploy- Software Code Engineering Generation ment. All identified studies do not follow any specific Related Tasks 9 model, some of them focus more on high-level tasks Analytics Solution Validation 8 Result Presentation such as domain understanding and process composi- 5 and Interpretaton tion while others support more granular tasks such as Service model selection. Composition 7 4 Model Building Hence, we combined key categories extracted from Data 3 Model Selection the identified 82 studies with the tasks proposed in dif- Integration 6 Data 2 Data Extraction & ferent literature and came up with 9 tasks under two Analytics Transformation categories as illustrated in Fig. 1. We initially identi- Related 1 Business Tasks fied five data analytics related tasks aligning CRISP- Understanding DM process model with the tasks proposed in the 82 studies - Business Understanding, Data Extraction and Transformation, Model Selection, Model Build- Fig. 1. Tasks associated with DAS engineering identified from the ing, Result Presentation and Interpretation. We further literature identified four tasks from the studies that were rooted in the software engineering literature: Data Integra- 4. Results tion, Service Composition, Analytics Solution Valida- tion and Code Generation. Software engineering re- 4.1. Primary Question: What are the existing lated tasks are important because they contribute to en- techniques that use semantic modeling for DAS hancing the overall quality of the DAS implementa- engineering? tion. Data analytics related tasks are conducted in the The complete list of identified studies is included order shown by the arrow, while software engineering in Appendix A. RDF, RDFS and OWL are the en- related tasks can be used to support different data an- coding languages widely accepted by the semantic alytics related tasks without a particular order. Each web community to represent ontologies. All the iden- DAS engineering project does not need to perform all tified studies, except S35, are relying on this nota- these tasks. The selection of tasks depends on the na- tion for their semantic model representation. S35 de- ture of the analysis being performed (e.g. whether we viates from that common practice and uses the Predic- need to choose amongst multiple competing data min- tive Model Markup Language (PMML) [37] and Back- ing algorithms or use a specific algorithm) and the con- ground Knowledge Exchange Format (BKEF) [38] to text in which it is being developed (e.g. whether we represent knowledge associated with DASs. need to support automatic code generation to save soft- By assessing the identified studies, we observed that ware development cost or not). these efforts vary in the application context they are addressing and in the way the analytics knowledge 3.7. Data Extraction and Mapping of Studies is modeled. Through the sub-questions in section 4.2 and 4.3, we explore the different semantic concepts During the data extraction phase, we read and sorted and tasks used by these 82 identified studies accord- papers in accordance with the classification schemas ing to the classification schemas mentioned in section and then reviewed them in detail. One author read and 3.6. Section 4.2 classifies different ontological con- classified the 82 papers according to the two schemas, cepts into four types and describes the characteristics noting down the rationale of why each paper belongs to of the knowledge they capture. In section 4.3, we relate the selected category. The second author reviewed the identified semantic concepts to their role in realizing table, discussed and resolved disagreements and com- and facilitating various DAS engineering tasks. piled the final mapping. The classification schemas de- veloped initially evolved through this phase, resulting 4.2. Sub-question 1. What type of concepts are in adding new subcategories and splitting in certain modeled/used by these techniques? scenarios. The finalized mappings and the associated details The mapping results according to the classification are discussed in the next section. schema described in section 3.6 are illustrated in Table 6 Semantic Modeling for Engineering Data Analytics Solutions

1. We identified that out of 82 studies, the majority (54 Under the first subtype, data preprocessing and in- studies) model domain concepts related to various ap- tegration, S14 proposes an analytics ontology for link- plication domains and 11 of them reuse publicly avail- ing different temporal and geographical datasets us- able ontologies. There are 30 studies that capture and ing concepts from different predefined ontologies. S20 use analytics concepts and 30 studies that capture ser- proposes a Rules ontology to store concepts related to vice concepts. Smallest representation was in the in- rules that transform one data schema to another. S27, tent category with 14 studies. A detailed analysis is S37, S41 and S54 propose concepts to model different presented in the following subsections. data preprocessing tasks such as null value removal, 4.2.1. Domain Concepts data format conversion, sampling and feature selec- These are context specific and high-level concepts tion. S80 proposes the Analytics-Aware Ontology to which represent domain-specific knowledge and ob- represent data aggregation and comparison functions. jects. We identified 54 studies that rely on different There are 25 studies under the subtype data ana- subtypes of domain concepts to support DAS engineer- lytics and mining. The majority of studies [S27, S36, ing. S37, S38, S41, S53, S54, S60, S62, S66, S75, S76, First subtype, application specific concepts repre- S78, S79] model analytics tasks and algorithms such as sent objects and relationships in a variety of niche ar- classification, clustering and regression. Analytic On- eas such as gene and protein analysis, health-care and tology [S15] is dedicated to statistical and machine transport as shown in Table 1. The solutions they pro- learning models. The Actor Ontology [S24] is lim- vide are highly coupled with a single application con- ited to image processing algorithms such as image text and provide less opportunity for adoption by other classification and feature extraction. The Data Min- applications. The majority of studies in this subtype ing Ontology in S17, the Simple Data Mining On- (32 studies) do not propose any specific domain con- tology in S22 and Expose ontology in S56 try to cepts but provide users with the ability to customize capture non-functional attributes, performance assess- concepts in any application context. ments such as sensitivity, specificity, accuracy and The second subtype represent studies that reused user satisfaction related to data mining algorithms. concepts from different publicly available ontologies. S63 and S74 propose Feature Ontologies to enhance We identified five publicly available domain-specific feature set descriptions in data analytics. S67 cap- ontologies used for designing analytics solutions- SSN tures event-condition-action rules for event process- Ontology [39], GeoVocab [43], Gene Ontology [40], ing with Rule Management Ontology. S69 proposes an GALEN ontology [41] and TOVE ontology [42]. extended event ontology to capture event processing 4.2.2. Analytics Concepts queries. In addition to the ontologies, three of the iden- Analytics concepts are closely aligned with the tified studies [S22, S27, S35, S66] capture details of knowledge that reflects different algorithms, computa- the analytics models using Predictive Model Markup tional models and the data-flow nature of the analyt- Language [37]. ics process in terms of inputs, outputs, and their com- Six studies focus on control and data flow related patibility. Analytic concepts provide a vocabulary for concepts. The Analytic Ontology [S1], DM Workflow defining and communicating analytics operations and Ontology [S75], SemNExt Ontology [S76] and DM related attributes. Further, they can help in describ- OPtimization Ontology [S78] include concepts that ing dependency relationships between variables. These capture the data flow nature of any analytics operation concepts are not coupled to a specific application do- with respect to an array of characteristics such as input main or context. requirements and preferences, input and output data There are 30 studies that leverage analytics con- types and accuracy. The Task-Method Ontology [S3] cepts. We classified them into three subtypes accord- represents concepts that are useful in controlling and ing to their role in representing different analytics tasks conducting MapReduce4 type of analytics. S62 pro- or methods: concepts related to data preprocessing and poses an ontology derived from CRISP-DM terminol- integration activities, concepts that capture data ana- ogy to represent control flow elements of a DAS. lytics and mining techniques and concepts that repre- sent the control and data flow nature of an analytics process. 4https://research.google.com/archive/mapreduce.html Semantic Modeling for Engineering Data Analytics Solutions 7

Table 1 Classification of semantic concepts modeled and used in identified literature. Classification Criteria Study Domain Application Gene and Protein Analysis S11, S27, S29, S76 Concepts Specific Health Care S2, S50 Sensor & Event Information S30, S31, S33, S55, S69, S70, S74 Spatiotemporal Information S40 Traffic Information S18, S31, S52 Enterprise Quality Management S7 Agriculture S32 Hyper-spectral Image Data S24 Power Grid S58,S64 S5, S8, S10, S12, S13, S19, S21, S23, S34, S35, S39, S42, S43, S44, S45, S47, S48, S49, S51, S57, S59, Custom Built S61, S62, S63, S65, S67, S68, S72, S73, S77, S80, S81, S82 Publicly SSN Ontology [39] S30, S31, S33, S40, S69, S74 Available Gene Ontology [40] S11, S44, S76 Ontology GALEN Ontology [41] S44 TOVE Ontologies [42] S7 GeoVocab Ontology [43] S40, S70 Analytic Data Preprocessing & Integration S14, S20, S27, S37, S41, S54, S80 Concepts S15, S17, S22, S24, S27, S35, S36, S37, S38, S41, Data Analytics & Mining S53, S54, S56, S60, S62, S63, S66, S67, S69, S74, S75, S76, S78, S79 Control & Data Flow S1, S3, S62, S75, S76, S78 Service Software Web OWL-S Based [44] S4, S11, S30, S31, S36 Component Concepts Services WSDL-S Based [45] S11 Management SAWSDL based [46] S20 WSMO Based [47] S4, S43 Hydra Vocabulary S40 Based [48] Custom Models S32, S37, S52 Library Specific APIs S27, S75, S79 Generic APIs S6 Data Multidimensional Data Schema S10, S62, S71 Management Graph Data Schema S9, S46 Composition Workflow Templates S4, S16, S20, S26, S53, S75, S78 Management Provenance Related S4, S25, S72 Quality Related S30, S31 Implementation Deployment Concepts S3, S4, S8, S69 Management Data Source S3, S8, S10, S38, S43, S71 Intent Analytic Query Expressions S2, S9, S21, S48 Concepts Analytics Requirements S3, S7, S28, S44, S53, S68 User Goals S4, S39, S66, S75 8 Semantic Modeling for Engineering Data Analytics Solutions

4.2.3. Service Concepts linking and executing multiple software services and Service concepts capture knowledge related to DAS data management operations together. Under that cate- architectures and platforms. We identified 30 stud- gory, some studies propose concepts (e.g.- Kepler on- ies that model different aspects, namely web services, tology S4, Data Mining Workflow Ontology S78) to software APIs, data schemas, workflow design, knowl- model scientific workflow templates and store their in- edge related to provenance or data quality, deploy- stances. Other composition management concepts are ment information and data sources. We classified ser- associated with provenance and quality. S4, S25 and vice concepts into four subtypes which are software S72 use ontologies to describe workflow provenance component management, data management, compo- concepts. Quality related concepts are modeled in S30 sition management and implementation management and S31 to describe the quality and accuracy of dif- (see Table 2). ferent event-based services. We observed that S25 and There are 10 studies under the software compo- S30 reuse concepts extended from the standard prove- nent management subtype, with the majority focusing nance ontology PROV-O [49]. on modeling web services that realize different oper- The last subtype is related to the implementation ations in the analytics process. Many of those stud- management of a DAS. S3 uses Deployment Ontol- ies adapt or extend publicly available semantic web ogy to describe the necessary deployment details for a service annotation ontologies: OWL-S [44], WSDL-S MapReduce based system such as configuration vari- [45], SAWSDL [46], WSMO [47] and Hydra vocabu- ables, initial inputs, variables for profiling and perfor- lary [48]. Particularly, S31 extends OWL-S services to mance measurements. S4 uses a Simulation Ontology support event processing via the Complex Event Ser- to capture runtime metadata related to workflows. S69 vice Ontology. There are three studies [S32, S37, S57] models event processing framework components such that define custom concepts to represent web services as alert streams via an event ontology. S3, S10, S38, used in DASs. S27 and S79 model Weka library spe- S43 and S71 propose concepts that describe the imple- cific software components. S6 provides the capability mentation of different data sources and how to access to model any generic API that can be used to imple- them. S8 proposes concepts to capture data source ac- ment DAS, through an ontology called Processing El- cess details as well as system development details that ement (PE) . This ontology describes represent the mappings between the database imple- software components based on their input and output mentation and domain concepts. data types as well as relevant implementation details 4.2.4. Intent Concepts such as a URL for an HTTP request or a related JAVA Intent concepts capture knowledge with respect to class. the data analyst’s requirements or goals. This knowl- The second subtype represents data management edge can be in the form of low-level queries performed concepts that support in representing and managing on data or high-level goals and intentions of users. data structures and schemas. S10 proposes the BI On- The first subtype represents concepts related to ana- tology to describe concepts such as dimension, hi- lytics query expression. The Analytical Queries (AnQ) erarchy, level, property and measures that can cap- model in S9 facilitates expressing user queries that ture data cubes related to Online Analytical Processing need to be performed on data. S21 and S48 propose (OLAP) operations of a data warehouse. S62 proposes to maintain a global ontology based on user-defined or Corporate Data Model Ontology to capture metadata standard concepts and use it to express user queries. about data schemas useful in data access. S71 uses 5 S2 proposes the i2b2 Information Ontology, an intent R2RML to map relational databases into RDF data. related model that helps analysts to describe various Under the graph data schema category, S46 proposes dimensions of interest in data that are related to a par- OpenCube ontology to describe concepts surrounding ticular task. OLAP cubes in a data warehouse and similarly, S9 pro- Under analytics requirements subtype, S28, S44, poses to use an analytical schema to create RDF data S53 and S68 propose ontologies that capture user warehouses. needs and constraints at a higher level. As an example, The third subtype has concepts related to composi- S44 uses an intent ontology called MIO (Multidimen- tion management, which model knowledge related to sional Integrated Ontology) which is auto-generated based on the topics, measures, and dimensions pro- 5https://www.w3.org/TR/r2rml/ vided by the user. The KnowledgeDiscoveryTask class Semantic Modeling for Engineering Data Analytics Solutions 9 in Knowledge Discovery Ontology [S53] and the Prob- searchers in applying semantic technology to support lem component of the Decision Support Ontology DAS engineering in 2014 and 2015 . [S68] are used as templates for instantiating analytics 4.3.1. Domain Understanding requirements. The Task-method Ontology proposed in The domain understanding task focuses on ana- S3 enables users to model desired methods that can lyzing the domain, context of the problem and un- be used to realize a particular task or to define the ex- derstanding available datasets. This helps to establish pected role of a variable within a task. S7 uses the solid definitions and facilitate the communication be- Measurement Ontology to capture concepts regarding tween different stakeholders. Moreover, the ontologies product inspection and testing requirements based on and concepts related to this task are inferable and the ISO 9001 standards. resulting knowledge has the flexibility of expanding Under the user goals subtype, the Scientist’s Intent over time. Ontology in S4, Goal Oriented Model in S39, Pur- Five methods [S5, S12, S18, S49, S59] use do- pose and Goal classes in DM3 Ontology [S66] and main concepts for domain understanding. The plat- Goal component of the Base Ontology [S75] provide forms proposed in S5 and S59 use these concepts the capability to express a set of high-level user goals to capture semantic and interpretive aspects of data such as the desired outcomes of analytics tasks and the whereas S18 uses them to provide a standard spec- decision-making processes around them. ification of data for analysts. In contrast, S49 uses these concepts to model expert knowledge related to 4.3. Sub-question 2: What tasks related to DAS an analytics problem which is helpful in understanding engineering are enabled using the identified the constraints and expected behavior. S12 proposes a concepts? feature-rich framework that can use custom-built do- main concepts to understand the context thorough data In this section, we analyze the association of 82 browsing and visualization. identified studies with the different tasks related to Two methods use intent concepts for domain un- DAS engineering and how the semantic concepts dis- derstanding. S39 uses a Goal Model to define require- cussed in section 4.2 are used to realize these tasks. ments that can help in understanding and designing a The classification schema for analytics tasks was de- data warehouse model. S28 extracts analytics require- scribed in section 3.6. ments expressed in natural language through an ontol- Table 2 shows the mapping of 82 studies among 9 ogy and proposes to refine them and identify specific tasks and 4 types of concepts. One study can be fo- data requirements by interviewing the stakeholders. cused on more than one task, using one or more con- 4.3.2. Data Extraction and Transformation cept types. Business understanding is the focus of 6 This task focuses on retrieving data from one or studies that leverage domain or intent concepts. Data more sources and preparing it for the subsequent anal- extraction and transformation approaches that use do- ysis. It involves transforming data into desired formats main, analytics or service concepts are proposed in 15 and annotating with additional metadata as well. studies. 31 papers propose data integration approaches There are 8 studies that apply domain concepts for mostly using domain concepts, but some use analytics, this task. S30, S33 and S70 annotate streaming in- service and intent concepts as well. Model selection put data using domain concepts to make data queri- (17 studies) and model building (15 studies) were con- able when necessary. In S18, data transformation is as- ducted using domain, analytics or intent concepts. All sisted by standard models built using domain concepts. four concept types were used to realize service com- S2, S29 and S40 propose approaches for on-demand position (20 studies) and solution validation (8 stud- data extraction based on domain concepts that describe ies). Code generation was supported by domain, ana- data sources. S34 focuses on the extraction of JSON lytics or service concepts in 4 studies. 9 studies that data from web resources, generating semantic con- proposed approaches for result presentation and inter- cepts around them and using those to convert data into pretation used one or two concept types out of four. ontology instances. S50 conducts data label correction Different applications of those concepts are de- using a domain ontology of medical entities. S59 en- scribed in more detail in the rest of this section. The ables analysts to understand datasets to help data ex- trend of publications related to each task is illustrated traction by querying knowledge represented in domain in Fig. 2. We observe a special interest among re- ontologies. S80 uses domain and analytics concepts to- 10 Semantic Modeling for Engineering Data Analytics Solutions

Table 2 Classification of literature by their application to different DAS engineering tasks and concepts used. Concept Classification Related Task Domain Concept Analytic Concept Service Concept Intent Concept Domain S5, S12, S18, S49 - - S28, S39 Understanding S2, S8, S18, S29, S30, Data Extraction S33, S34, S40, S50, S52, S80 S8, S10, S38, S52 - and Transformation S59, S70, S80 S2, S10, S12, S13, S19, S21, S23, S29, S34, S39, S40, S42, S43, S44, S45, Data Integration S14 S9, S46, S71 S2, S9, S21, S39, S44, S48 S47, S48, S49, S51, S55, S59, S64, S65, S70, S76, S77, S81 S15, S17, S22, S27, S37, Model Selection S27, S49, S61, S62, S38, S41, S53, S56, S60, - S53, S66, S75 S62, S66, S75, S78, S79 S7, S58, S63, S64, S67, S3, S54, S56, S63, S67, Model Building S69, S70, S73, S74, S77, - S3, S7 S69, S74, S80 S80, S82 Results Presentation S35, S48, S49, S57, S61, S76 S35 S4, S25 ,S26 S4, S48 and Interpretation S6, S11, S16, S20, S27, S11, S24, S27, S30, S31, S1, S20, S24, S27, S36, S30, S31, S32, S36, S37, Service Composition S53, S75 S32, S43, S52, S62 S41, S53, S62, S75, S78 S40, S43, S52, S53, S62, S75, S78 Analytics Solution S24, S61, S72, S76 S24, S75, S76 S25, S26, S72, S75 S4, S75 Validation Code Generation S47 S3 S3, S6, S38 -

14

12

10

8

6 NUMBEROF STUDIES 4

2

0 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 YEAR OF PUBLICATION

Domain Understanding (6) Data Extraction and Transofrmation (15) Data Integration (31) Model Selection (17) Model Building (15) Service Composition (20) Analytics Solution Validation (8) code Generation (4) Result Presentation and Interpretation (9) Total (82)

Fig. 2. Trend of publications over time, related to different DAS engineering tasks. Semantic Modeling for Engineering Data Analytics Solutions 11 gether to access and transform static as well as stream- sources so that relevant data can be acquired at query ing data. time from multiple sources. Three studies use data source related service con- The fifth one, used in S9, S43, S49, S55, S65, S71 cepts for data extraction. In S10, these concepts are and S81, is a query or requirement driven approach for used to define the organization of data and how to ac- data integration where formal rules or program logic cess it on demand. S38 uses a set of concepts that can are used to represent user queries or analytics ques- model any data source for model-driven code genera- tions. Different datasets are mapped into those rules to tion for data extraction. S52 uses service concepts to derive answers. S9 is unique in that it captures analyt- identify implementations of required data transforma- ics queries as intent concepts. tions in workflows. 4.3.4. Model Selection S8 uses both domain and service concepts (data Model selection is crucial for users who do not have source and implementation management) to extract intuitive knowledge about the performance of different data from heterogeneous sources such as event data models in different contexts or when there are many streams and map the extracted data into a database competing analytics models and techniques that can be schema. S52 uses domain concepts and service con- used for a single purpose. This task facilitates compar- cepts related to data processing and automatically gen- ison of algorithms or makes recommendations of tools erate new datasets on demand. and models suitable for users. 4.3.3. Data Integration There are 4 studies that use domain concepts for Data integration implies combining heterogeneous this purpose. S49 stores expert domain knowledge in datasets in order to obtain a high-level and coherent a domain ontology and uses that to evaluate possi- view of the data. Most studies use domain concepts to ble analytics models generated by associate rule min- aid in this task. Some studies use intent concepts to ing, incorporating an "interestingness" measure. S61 conduct integration based on user needs. S9, S14, S46 proposes a recommendation engine that maintains a and S71 are special cases that leverage analytics and repository of meta-data related to historical analytic service concepts to support the integration task. solutions using domain concepts and when a user pro- We identified five different strategies in which in- vides a new dataset, a matching solution is recom- termixed concepts from different classes contribute to mended to the user. In S27, suitable analytics tech- data integration. The first one, observed in S12, S13, nique for a particular dataset is identified by matching S19, S39, S42, S46 and S47, transforms data from het- dataset features represented in domain concepts with erogeneous sources into instances of a global ontol- the requirements of the analytics techniques captured ogy. S12, S13, S39 and S42 conduct ETL processes through analytics concepts. S62 supports model selec- incorporating this global ontologies to create semantic tion by allowing users to define the context of analytics aware data warehouses. problem via domain concept and using analytics con- The second one, identified in S2, S21, S23, S44, cepts to recommend appropriate analytics techniques. S48 and S77, uses a global ontology that represents the There are 15 methods that apply data analytics and user’s perspective and a set of local ontologies to repre- mining focused concepts for model selection. The sim- sent datasets. Then the data is matched accordingly by plest method proposed in S17 uses an ontology to de- aligning or converting local ontologies into the global scribe data analytics algorithms and creates a knowl- ontologies. Users can refer to the global ontology to edge repository to provide a querying capability for the query the data. users. The third approach describes each dataset through a The concepts defined in S15, S27, S37, S41, S53, local ontology and achieves data integration by merg- S60, S66, S75 and S78 assist users in matching ana- ing local ontologies together. S34 is an example where lytics components that suit their goal and constraints. each local ontology is constructed by extracting data S53, S66 and S75 are unique among those studies as provided in JSON format, generating suitable seman- they propose intent concepts to capture user goals and tic concepts and converting the extracted data into on- requirements which are then matched with suitable tology instances using generated concepts. models represented as analytics concepts. The fourth method used in 10 studies, S10, S14, In S22, analytics concepts are used to model data S29, S40, S45, S51, S59, S64, S70, S76, maintains mining algorithms, which are then linked to web ser- linked meta-data about datasets from different data vices, providing the means for service composition. 12 Semantic Modeling for Engineering Data Analytics Solutions

S38 and S79 use analytics concepts to capture existing ations and generate reports. Further, S48 proposes a analytics workflows to enable novice users to search method to automatically match the OLAP reports with and learn from them. other documents in a related repository. When considering the use of service concepts for 4.3.5. Model Building presenting results, S4, S25 and S26 capture knowledge Model building is a core data analytics task, where a about the different aspects of scientific workflows, es- selected model needs to be applied and customized for pecially aspects can help to describe and present the the problem at hand. outputs/results. S58, S64, S70, S73, S77 and S82 use domain con- In S4, intent concepts are used capture initial goals cepts to write analytic queries or event patterns exe- of the analysts and annotate workflows with them, in cuted on data to get descriptive analytics insights. S67 and S69 follows a similar approach, but use analytics order to identify different decisions that have led to the concepts to support users in writing queries or rules. outcome and to explain the results from the perspective S63 and S74 use domain concepts and analytics con- of the analyst. cepts to describe and customize feature space for an- 4.3.7. Service Composition alytics model construction. S7 leverages domain con- Service composition means identifying and putting cepts and intent concepts for model building. The gen- together different analytics related services to provide erated analytics model consists of axioms based on for- a complete or partial DAS from data acquisition and mal competency questions to evaluate whether an en- extraction to results generation. Scientific workflow terprise model represented through domain concepts planning and service composition are largely incorpo- adhere to the compliance standards represented as in- rated into this task. tent concepts. Identifying a suitable service or tool to include in S56 uses analytics concepts to define analytics ex- a DAS is a major activity within service composition. periments in detail for supervised classification of Some studies [S32, S36, S37, S40] use software com- propositional datasets. S54 proposes a case-based sys- ponent management concepts to guide component se- tem for model selection and uses analytics concepts to lection, but leave the responsibility of process compo- adapt the solutions suggested by case based reasoning sition to the analyst. S32 proposes a model to repre- to fit user’s interest. S3 applies analytics concepts to- sent and recommend web services using pre-defined gether with intent concepts for analytics model gener- rules, based on a context expressed through domain ation in a MapReduce based analytics solution. concepts. S37 and S40 use service concepts to model 4.3.6. Result Presentation and Interpretation a wide array of software components to be selected Storing knowledge related to different aspects of the from, including pre-processing capabilities such as analytics process through a semantic model inherently null value removal. Users can query the ontologies to provides a certain inference and interpretation capabil- identify suitable components. S36 proposes a method- ity that helps in result representation. There have been ology to facilitate the selection of suitable service im- some studies that specify particular methods that use plementations based on the input data. In S43 and S52, semantic concepts to explicitly facilitate result presen- suitable service implementations for datasets are iden- tation and interpretation. tified by matching characteristics represented through Six studies use domain concepts for result presen- domain and service concepts. tation and interpretation. S49 uses domain concepts to In contrast, S30 and S31 select matching data store expert knowledge and use it to validate data min- sources for a software component according to its abil- ing results and evaluate their interestingness. S57 uses ity to extract and process related data. Data provision domain concepts to interpret hypotheses and related at- services are annotated using OWL-S and SSN ontol- tributes on statistical datasets. S61 uses domain con- ogy concepts so that users can query them and iden- cepts to annotate data tables via ontology alignment, tify suitable services. Moreover, S30 use quality re- enabling easy interpretation. S35 incorporates domain lated service concepts to incorporate important qual- concepts with analytics concepts to generate reports ity attributes that need to be considered in component on the conducted data analytics tasks. S76 generates selection. metadata about numerical analysis using domain and Other studies extend service selection to facilitate analytics concepts. S48 uses domain concepts with in- composition planning and execution by incorporate tent concepts to extract meta-data about OLAP oper- different domain, service and analytics concepts. Semantic Modeling for Engineering Data Analytics Solutions 13

S24 uses domain concepts that describe datasets, as 4.3.8. Analytics Solution Validation well as data analytics and mining concepts to support Validation of the analytics solutions involves captur- workflow composition in Kepler tool, through match- ing provenance data, validating solution workflows for ing the analytics operations with data properties. S27 service compatibility or data consistency and confirm- uses domain, analytics and service concepts to sup- ing that the solution addresses the analyst’s goals. S61 port the selection of suitable data sources, analytics uses a domain ontology to generate metadata about in- techniques and software components respectively. S41 put data and output results, enabling provenance. S24 uses analytics concepts to represent components of the uses domain concepts that describe data and analytics Weka analytics tool and link these components as a concepts to validate the structural and semantic cor- process. It does not provide executable workflows but rectness of a workflow before execution. S76 captures recommends an analytics plan to be manually executed provenance data related to datasets, data sources as by the user. S1 proposes a similar approach for the well as operations around numerical analysis through MIT Lincoln Laboratory’s Composable Analytic En- domain and analytics concepts. S25 uses provenance- vironment that including executable workflow defini- related concepts to model workflow and data, and al- tions. S20 uses service concepts for software compo- low users to query them in order to validate workflows, nent modeling and link them by modeling data trans- identify defects or extract further information. S26 de- formation rules with analytics concepts. S62 uses con- fines Research Object concept as an instance of a sci- trol and data flow concepts to guide the knowledge dis- entific workflow, for provenance purposes. S75 pro- covery process and supports decision making at each poses a solution validation approach that uses analytic, stage using a knowledge base that encompasses do- service and intent concepts to annotate data, operators, main, analytic, and service concepts. S78 uses analytics and workflow template concepts models, data mining tasks and KDD workflows as an to generate analytics processes, with a major focus on extension to RapidMiner tool. The Scientist’s Intent performance optimization. S16 uses concepts related Ontology in S4 uses goal focused intent concepts to to workflow templates to store pre-composed analytics describe user goals that are used for workflow valida- processes which can be queried by users in order to tion. S72 uses domain and service concepts to capture select a suitable implementation. provenance of ETL workflows. S53 captures analytics, service and intent concepts 4.3.9. Code Generation through Knowledge Discovery Ontology and offers Methods that support code generation rely on on- support for planning abstract analytics processes. S6 tologies to convert abstract models into executable an- supports software composition through components alytics software. This is an important task as it reduces modeled as generic APIs by matching respective in- the burden of software programming for data analysts. puts and outputs. It assists users in planning and Code generation can be used to support multiple stages matching analytics components with a comprehensive of an analytics process (workflow) execution. goal-based planning algorithm by considering the in- Most techniques use service concepts (e.g. S3, S6 put and output conditions. Similarly, S75 uses ana- lytic, service and intent concepts to capture the user and S38) to drive a code generation process in a Model goals and KDD workflows implemented in Rapid- Driven Engineering (MDE) fashion. For example, data Miner. These concepts are used to identify optimal an- sources modeled as service concepts in S38 are used alytics process for a new analytics problem based on to generate data extraction software modules. In S6 Hierarchical Task Network planning. generic API modeling service concepts are used to S11 supports comprehensive scientific workflow generate an executable analytics process. In S3, ana- composition with a graphical user interface based on lytics concepts and service concepts related to imple- the Taverna workflow engine, SADI/BioMoby plug- mentation management contributes to capture the im- ins and SADI-compliant web services. It uses web ser- plementation details of analytics task. This enables a vice concepts to model components that implement semi-automated code generation scheme for selected data analytics algorithms and domain concepts to de- analytics techniques. The method proposed in S47 is scribe input and output data. Those concepts are used the only one that uses domain concepts to model data to recommend services that match analytics require- sources, which are then used for generating code that ments or data constraints. extracts data from data sources into linked datasets. 14 Semantic Modeling for Engineering Data Analytics Solutions

5. Discussion example, S37 models both analytics and service con- cepts in one ontology, and S3 models both analytics 5.1. Limitations of Existing Work and intent concepts in the Task-methods ontology. S27 contains two ontologies (WekaOntology and ProtOn- This section summarizes the limitations we identi- tology) that cut across concepts of all classes without fied by studying what type of semantic concepts were proper separation. used in DASs and how those different concepts types 5.1.3. Little Support for End-to-End Development were applied in different analytics tasks. Process 5.1.1. Limited Usage of Intent Concepts Though there is an array of research on adapting se- Although there are 14 studies that propose intent mantic models for different development tasks such as concepts, (see Table 2), we observe that only a few data integration or model selection, only a few stud- of them use intent concepts at different stages of the ies seem to go beyond addressing one or two tasks in analytics process, with the exception of data integra- the DAS engineering lifecycle. In many cases, knowl- tion. The survey did not identify any studies that use edge from previous tasks would have been very use- intent concepts for data extraction and transformation, ful if carried over to the next tasks. Hence there is a although this is a computationally expensive and time- lack of studies that propose semantic modeling based consuming task that may waste resources if not per- solutions to support the end-to-end DAS engineering formed aptly. Existing techniques use intent concepts lifecycle. mostly on facilitating the search of algorithms, data We identified 4 studies that use semantic models for providers, web services, and computational software code generation (section 4.4.9), related to data trans- modules. Hence to a large extent, analytics require- formation and analytics process execution. They are ments such as what business decisions will be sup- also limited to a specific domain or a tool and do not ported by this analysis or what level of accuracy is re- provide sufficient flexibility to be used for a wider quired, are still a part of the mental model of the devel- class of DASs. oper or the analyst who performs these tasks. In prac- tice, several iterations of data cleansing, reformatting, 5.2. Recommendations for Future Research model selection and process composition may be re- quired in order to optimally address the analytics prob- Upon the findings of this study, we propose a set lem at hand. This can result in less effective DASs with of recommendations for future research regarding the degrading performance over time. In addition, modifi- application of semantic models for DAS engineering. cations to the process can only be conducted by some- 5.2.1. Developing Intent Concepts for Analytics one with a sound understanding of the original analyt- As discussed in section 5.1.1, service composition ics requirements. Moreover, as discussed by Canhoto techniques among the identified studies do not lever- [50], cognitive and context information, which can be age intent concepts adequately. One reason could be captured through intent concepts, is crucial for accu- that intent concepts are too high-level (e.g. a business rate interpretation and validation of a DAS. We believe goal) or too low level (a query). As the data analytics that incorporating suitable intent concepts further can community is extending wider into different industries enhance the efficiency and effectiveness of DAS engi- and organizations and as analytics contexts and re- neering. quirements are changing rapidly, it is necessary to ex- 5.1.2. Lack of Proper Concept Classification plore techniques that consider all dimensions such as Semantic concepts can be classified in many differ- business requirements, contexts and constraints. Hence ent ways. For an example, S43 separates domain, an- a potential research area is to study how high-level user alytics and service knowledge in three ontologies and goals and contexts can be represented and incorporated S79 uses a class hierarchy to separate data mining re- in DAS engineering through data integration, process lated entities as the process, information content and construction, and result interpretation. Initial work in realizable entities. Yet in some studies, this separa- this direction can be found in Bandara et. al. [51]. tion of concept types is not clearly visible. One ontol- Such approaches that promote the linking of user in- ogy with a unique prefix may contain concepts related tentions and contexts into analytics models have the to one or more categories without attempting to fol- potential of changing static analytics models deployed low modular approaches such as class hierarchies. For today into dynamic and adaptable analytics models, re- Semantic Modeling for Engineering Data Analytics Solutions 15 sponding to changes in user goals or the operational knowledge of whether the data is time-series or not context. can be used to reduce the set of algorithms should con- 5.2.2. Decoupling Concept Classes and Encouraging sider) and the relationship between the data and other Concept Reuse Across Development Tasks concepts. Representation of existing knowledge and In section 5.1.2 we discussed that it is more effective enabling efficient reuse of accumulated knowledge and to decouple different concepts as separate ontologies resources can reduce the effort spent on expert consul- as it leads to better and modular knowledge manage- tations or employee training. ment. Then each concept type can be reused or evolve independently of the others, enabling users to change 5.2.3. Semantic Model-Driven DAS Engineering the application domain, implementation, data source or Finally, our evidence reveals the opportunities of us- the analytics requirements without altering other mod- ing semantic models for code generation in the light els. The integration between those different knowl- of model-driven engineering methods. This needs to edge areas has to be done separately within the DAS be explored and experimented further as it has the po- environment, considering the context as well. Some tential of lifting the burden of software programming studies achieve concept integration through program expertise from data analysts. There are already some logic or annotation schemes, but it would be useful to examples of applying MDE for data analytics applica- have standard, platform-independent ways of model- tions [52], such as creating Hadoop MapReduce analy- ing the relationships between different types of ana- lytics knowledge to match the context of a particular sis through conceptual models [53]. A promising find- analytics process. ing is that four of the identified studies related to code To promote the usage of semantic models among generation (section 4.3.9) use semantic models that are the research community and to enhance the value and well aligned with the four ontologies proposed by Pan the reusability of the research, it is essential to advo- et.al. in their book Ontology-Driven Software Engi- cate the reuse of ontologies. This enables the creation neering [14]. In this book, the authors align a Require- of a common vocabulary and the resulting data/models ment ontology (intent concepts) to a Computational become interoperable among a variety of systems. We Independent Model (CIM), an Infrastructure Ontol- observed certain ontologies like SSN [39] and Gene ogy (service concepts) to a Platform Specific Model Ontology[40] are being used in multiple research stud- (PSM). They propose to use a domain ontology for ies. There are ongoing efforts such as OBO Foundry converting a CIM to a Platform Independent Model 6 that recommend re-using classes already defined in other ontologies classes. Yet we believe there is room (PIM) and a business process ontology (analytics con- for the system development community to adopt more cepts) to convert a PIM to a PSM. Hence we believe concepts from well-developed ontologies, particularly that studying the use of ontologies to develop analytics analytics concepts proposed in ontologies such as On- solutions in a model-driven fashion, particularly adapt- toKDD [S60] and OntoDM [S79], to improve user sup- ing the framework proposed by Pan et.al [14], is timely port for analytics process design. and significant. As knowledge represented through ontologies can Semantic-based service orchestration plays a signif- enhance each task of the DAS engineering process, a icant role in realizing a semantic model driven analyt- standard framework for designing and extending on- ics environment, as all operations from data exporting, tologies that is usable in all analytics process stages is integration into model building, execution and result necessary. The ontologies should incorporate knowl- publication can be done as independent services mod- edge related to domain concepts and business goals as well as the concepts useful for the execution level. For ules represented through semantic models. Yet there is example, an ontological representation of a data source a lack of applications that integrate the existing body may contain information necessary to retrieve data, but of research related to semantic based service orches- also information about the data quality, the latency of tration such as [54–56] with semantic data analytics data acquisition, metadata that can be used to decide process construction research. Such a combination can which algorithm is suitable to process the data (e.g.the contribute to a paradigm of service-based DASs and establish the basis for semantic model-driven data an- 6www.obofoundry.org/ alytics systems. 16 Semantic Modeling for Engineering Data Analytics Solutions

6. Limitations of the Study 7. Conclusion

There are limitations to our study, mainly due to Capturing knowledge using models to drive the soft- the literature selection process, including the selec- ware development life cycle is at the heart of the soft- tion of keywords and the construction of inclusion and ware engineering discipline. Traditional models have exclusion criteria. Firstly, our study focuses on peer- serious limitations in the area of building DASs, which reviewed publications in academic literature only. So are characterized by the need to represent rich knowl- gray literature such as technical reports, white papers edge encompassing specialized application domains, and unpublished work was not included. Secondly, the complex computing infrastructures and changing an- study might be missing some relevant work due to the alytics requirements. This has triggered our interest search string failing to match other relevant publica- in the use of semantic modelling and ontologies as a tions within the digital libraries. snowballing helped to way of underpinning new software development prac- tices in this area. In this paper, we presented 82 stud- eliminate this limitation to a certain level. These limi- ies identified through a systematic mapping study, that tations are in-line with our exclusion criteria, yet they leverage semantic modeling for engineering DASs. We pose a risk for the completeness and validity of the re- adopted a broad approach encompassing distinct re- sults. search areas such as data mining and service comput- One other limitation is that the selected databases ing. may not contain all related literature, especially ap- The results of our study reveal the diversity of plications published in domain-specific venues such knowledge representation in existing studies. Through as medical journals. We attempted to reduce the im- sub-question 1 we identified what type of seman- 7 pact of such limitations by using the Web of Science tic concepts are modelled and used in the literature. database, which exposes our search query into diverse They were falling under four main categories: domain, disciplines. Both authors have a software engineering analytic, service and intent concepts. Through sub- and modeling background and hence the paper selec- question 2 we identified the different categories of ana- tion and mapping may be biased to that point of view in lytics or software engineering related tasks mentioned spite of the efforts made to conduct an unbiased litera- in the identified literature. Different types of concepts ture filtering process, especially in the cases where the were observed to play different roles in improving inspected papers do not provide a clear-cut definitions each task and supporting various stages of DAS engi- of their research problem and proposed approach. neering. Semantic modeling was highly used for tasks As we conduct a systematic process to identify and such as data integration, model selection, process com- map literature, this paper may contain some outdated position and data extraction, which shows the abil- work or not reflect the most recent achievements in the ity of semantic models to represent heterogeneous re- discipline. Recent ontologies published related to an- sources. Studies that focus on model selection and pro- alytics and data modeling such as OntoDT [57] were cess composition tasks highlight the capability of se- not observed in any identified work. This may be due mantic models to provide end-user support for analyt- to the limitation of the search approach and authors be- ics solution engineering. We identified and discussed lieve there will be future research that utilizes existing some limitations in existing work such as the limited analytics related knowledge in DAS engineering. usage of intent concepts and the lack of end-to-end As the goal of this study is to present a holistic support for DAS engineering. overview on how semantic modeling has been used Recommended future work, discussed in section in engineering DAS, summarizing two decades of re- 5.2, emphasizes the importance of moving semantic search, it is beyond the scope of this paper to drill technology out of certain research silos and aims at de- down certain specific characteristics of analytics solu- veloping new research agendas around capturing high- tions that support particular tasks and conduct a thor- level intents and goals of data analysts and translating ough evaluation. We limit our contribution to a map- them to executable analytics processes, incorporating a ping study which can be used by researchers to study multitude of well-defined semantic knowledge reposi- certain aspects extensively in the future. tories that have the capacity to be developed, expanded and maintained independently from each other. This can be achieved within established software engineer- 7http://apps.webofknowledge.com ing frameworks, but they need to be specifically tai- Semantic Modeling for Engineering Data Analytics Solutions 17 lored to the particular characteristics of the DAS engi- S10 D. Sell, D. C. d. Silva, F. D. Beppler, M. Napoli, neering life-cycle as presented in [6]. As the next stage F. B. Ghisi, R. Pacheco, et al., SBI: a seman- of this research effort, authors are working on design- tic framework to support business intelligence, ing a requirement driven platform that provides sup- In Proceedings of the first international work- port for end-to-end analytics process engineering, in- shop on Ontology-supported business intelli- corporating semantic concept types identified through gence, 2008, p. 11. this mapping study [51, 56]. S11 D. Withers, E. Kawas, L. McCarthy, B. Van- dervalk, and M. Wilkinson, Semantically-guided workflow construction in Taverna: the SADI and Appendix A. List of Included Studies BioMoby plug-ins, Proceedings of the 4th inter- national conference on Leveraging applications S1 K. K. Nam, R. T. Locke, S. Yenson, K. Chang, of formal methods, verification, and validation and L. H. Fiedler, Advisory services for user com- 6415, 2010, pp. 301-312. position tools, In (ICSC), S12 S. Andrews, The CUBIST Project: Combining 2015 IEEE International Conference on, 2015, and Uniting Business Intelligence with Semantic pp. 9-16. Technologies, International Journal of Intelligent S2 J. G. Klann, A. Abend, V. A. Raghavan, K. D. Information Technologies 9, 2013, pp. 1-15. Mandl, and S. N. Murphy, Data interchange using S13 R. P. Deb Nath, K. Hose, and T. B. Peder- i2b2, Journal of the American Medical Informat- sen, Towards a Programmable Semantic Extract- ics Association 23, 2016, pp. 909-15. Transform-Load Framework for Semantic Data S3 T. W. Wlodarczyk, C. Rong, B. Jia, L. Cocanu, Warehouses, Proceedings of the ACM Eighteenth C. I. Nyulas, and M. A. Musen, DataStorm An International Workshop on Data Warehousing Ontology-Driven Framework for Cloud-Based and OLAP, 2015, pp. 15-24. Data Analytic Systems, 2010 6th World Congress S14 B.L. Do, T.D. Trinh, P. R. Aryan, P. Wetz, E. on Services (SERVICES-1), 2010, pp. 123-127. Kiesling, and A. M. Tjoa, Toward a statistical data S4 E. Pignotti, P. Edwards, N. Gotts, and G. Polhill, integration environment, Proceedings of the 11th Enhancing workflow with a semantic description International Conference on Semantic Systems, of scientific intent, Web Semantics: Scinece, Ser- 2015, pp. 25-32. vices and Agents on the 9, 2011, S15 M. V. Nural, M. E. Cotterell, and J. A. Miller, Us- pp. 222-244. ing Semantics in Predictive Big Data Analytics., S5 E. W. Kuiler, From big data to knowledge: an on- In Big Data (BigData Congress), 2015 IEEE In- tological approach to big data analytics, Review ternational Congress, 2015, pp. 254-261. of Policy Research 3, 2014, pp. 311-318. S16 Y. Gil, V. Ratnakar, E. Deelman, M. Sprara- S6 P. Coetzee and J. Stephen, Goal-based analyt- gen, and J. Kim, Wings for Pegasus: A Seman- ics composition for on-and off-line execution at tic Approach to Creating Very Large Scientific scale, Trustcom/BigDataSE/ISPA 2015 2, IEEE, Workflows, OWL: Experiences and Directions 2015, pp. 56-65. (OWLED), 2006. S7 H. M. Kim, M. S. Fox, and A. Sengupta, How to S17 M.S. Lin, H. Zhang, and Z.G. Yu., An Ontol- build enterprise data models to achieve compli- ogy for Supporting Data Mining Process, Com- ance to standards or regulatory requirements (and putational Engineering in Systems Applications, share data), Journal of the Association for Infor- IMACS Multiconference on 2, 2006, pp. 2074– mation Systems 8, 2007, p. 105. 2077. S8 C. Pautasso and O. Zimmermann, Linked Enter- S18 R. G. Wang, W. D. Dai, and J. R. Cheng, An prise Data Model and its use in Real Time Ana- Ontology-Based Data Mining Framework in Traf- lytics and Context-Driven Data Discovery, IEEE fic Domain, Applied Mechanics and Materials Software 32, 2015, pp. 7-9. 121-126, 2011, pp. 55-59. S9 D. Colazzo, F. Goasdoue, I. Manolescu, and A. S19 A. Bouchra, A. A. Wakrime, A. Sekkaki, and K. Roatis, RDF Analytics: Lenses over Semantic Larbi, Automating Data warehouse design using Graph, Proceedings of the 23rd international con- ontology, International Conference on Electrical ference on World wide web, 2014, pp. 467-478. and Information Technologies, 2016. 18 Semantic Modeling for Engineering Data Analytics Solutions

S20 S. Shumilov, Y. Leng, M. El-Gayyar, and A. B. S30 D. Puiu, P. Barnaghi, R. Tonjes, D. Kumper, M. I. Cremers, Distributed Scientific Workflow Man- Ali, A. Mileo, et al., CityPulse: Large Scale Data agement for Data-Intensive Applications, 12th Analytics Framework for Smart Cities, IEEE Ac- IEEE International Workshop on Future Trends of cess 4, 2016, pp. 1086-1108. Distributed Computing Systems, FTDCS’08 12, S31 F. Gao, M. I. Ali, and A. Mileo, Semantic discov- 2008, pp. 65-73. ery and integration of urban data streams, Pro- S21 J. A. R. Castillo, A. Silvescu, D. Caragea, J. ceedings of the Fifth International Conference on Pathak, and V. G. Honavar, Information extrac- Semantics for Smarter Cities 1280, 2014, pp. 15- tion and integration from heterogeneous, dis- 30. tributed, autonomous information sources - a fed- S32 Z. Laliwala, V. Sorathia, and S. Chaudhary, Dis- erated ontology-driven query-centric approach, tributed, Semantic and Rule Based Event-driven IEEE International Conference on Information Services-Oriented Agricultural Recommendation Reuse and Integration, 2003, pp. 183-191. System, 26th IEEE International Conference S22 X. Gong, T. Zhang, F. Zhao, L. Dong, and H. Yu, on Distributed Computing Systems Workshops, On Service Discovery for Online Data Mining 2006, pp. 24-24. Trails, Second International Workshop on service S33 S. Qanbari, N. Behinaein, R. Rahimzadeh, and S. discovery for online data mining trails 1, 2009, Dustdar, Gatica: Linked Sensed Data Enrichment pp. 478-482. and Analytics Middleware for IoT Gateways, S23 M. Gagnon, Ontology-based integration of data 2015 3rd International Conference on Future In- sources, 10th International Conference On Infor- ternet of Things and Cloud (FiCloud), 2015, pp. mation Fusion , 2007, pp. 1-8. 38-43. S24 J. Zhang, Ontology-Driven Composition and Val- S34 Y. Yao, H. Liu, J. Yi, H. Chen, X. Zhao, and X. idation of Scientific Grid Workflows in Kepler- Ma, An automatic semantic extraction method for a Case Study of Hyperspectral Image Process- web data interchange, 6th International Confer- ing, Fifth International Conference on Grid and ence on Computer Science and Information Tech- Cooperative Computing Workshops (GCCW’06.), nology (CSIT), 2014, pp. 148-152. 2006. S35 T. Kliegr, V. Svatek, M. Ralbovsky and M. S25 A. Chebotko, S. Lu, X. Fei, and F. Fotouhi, RDF- Simunek, SEWEBAR-CMS: semantic analytical Prov: A relational RDF store for querying and report authoring for data mining results, Journal managing scientific workflow provenance, Data of Intelligent Information Systems & Knowledge Engineering 69, 2010, pp. 836-865. 37(3), 2010, S26 K. Belhajjame, J. Zhao, D. Garijo, M. Gamble, pp. 371-395. K. Hettne, R. Palma, et al., Using a suite of on- S36 T. Marinho, E. B. Costa, D. Dermeval, R. Fer- tologies for preserving workflow-centric research reira, L. M. Braz, I. I. Bittencourt, et al., An objects, Web Semantics: Science, Services and ontology-based software framework to provide Agents on the World Wide Web 32, 2015, pp. 16- educational data mining, in 2010 ACM Sympo- 42. sium on Applied Computing, 2010, pp. 1433- S27 M. Cannataro, P. H. Guzzi, T. Mazza, G. Tradigo, 1437. and P. Veltri, Using ontologies for preprocessing S37 B. T. G. S. Kumara, I. Paik, J. Zhang, T. H. A. and mining spectra data on the Grid, Future Gen- S. Siriweera, and K. R. C. Koswatte, Ontology- eration Computer Systems 23, 2007, pp. 55-60. Based Workflow Generation for Intelligent Big S28 N. Kushiro, A Method for Generating Ontolo- Data Analytics, In Web Services (ICWS), 2015 gies in Requirements Domain for Searching Data IEEE International Conference on, 2015, pp. Sets in Marketplace, 2013 IEEE 13th Interna- 495-502. tional Conference on Data Mining Workshops S38 J.N. Mazon, J. Jacobo Zubcoff, L. Garriga, and R. (ICDMW), 2013, pp. 688-693. Espinosa, Knowledge Spring Process - Towards S29 P. Gong, W. Qu, and D. Feng, An Ontology for Discovering and Reusing Knowledge within Linked the Integration of Multiple Genetic Disorder Data Open Data Foundations, In Proceedings of 3rd Sources, 2005 IEEE Engineering in Medicine International Conference on Data Management and Biology 27th Annual Conference, 2006, pp. Technologies and Applications, 2014, pp. 291- 2824–2827. 296. Semantic Modeling for Engineering Data Analytics Solutions 19

S39 L. Bellatreche, K. Selma and N. Berkani, Se- beled multi-class big social data, In European Se- mantic data warehouse design: From ETL to de- mantic Web Conference, Springer, Cham, 2014, ployment a la carte., International Conference pp. 349-363. on Database Systems for Advanced Applications, S51 S. K. Ramnandan, A. Mittal, C. A. Knoblock, Springer Berlin Heidelberg, 2013. and P. Szekely, Assigning semantic labels to data S40 M. Frank and S. Zander, Smart Web Services for sources, In European Semantic Web Conference, Big Spatio-Temporal Data in Geographical Infor- Springer, Cham, 2015, pp. 403-417. mation Systems, ESWC workshop: Services and S52 J. L. Ambite and D. Kapoor, Automatically com- Applications over Linked APIs and Data (SALAD posing data workflows with relational descrip- 2016), 2016. tions and shim services, In The Semantic Web, S41 F. P. Abraham Bernstein, Shawndra Hill, Toward Springer, Berlin, Heidelberg, 2007, pp. 15-29. intelligent assistance for a data mining process: S53 M. Zakova, Monika, P. Kremen, F. Zelezny, An ontology-based approach for cost-sensitive and N. Lavrac, Automating knowledge discovery classification, IEEE Transactions on knowledge workflow composition through ontology-based and data engineering 17, 2005, pp. 503-518. planning, IEEE Transactions on Automation Sci- S42 S. K. Bansal, Towards a Semantic Extract-Transform- ence and Engineering 8(2), 2011, pp. 253-264. Load (ETL) framework for Big Data Integration, S54 M. Charest, S. Delisle, O. Cervantes, and Y. Shen, 2014 IEEE International Congress on Big Data Bridging the gap between data mining and de- (Bigdata Congress), 2014, pp. 521-528. cision support: A case-based reasoning and on- S43 D. Sell, L. Cabral, E. Motta, J. Domingue and tology approach, Intelligent Data Analysis 12(2), R. Pacheco, Adding semantics to business intelli- 2008, pp. 211-236. gence., Proceedings of the Sixteenth International S55 J. P. Calbimonte, O. Corcho, and A. J. Gray, En- Workshop on Database and Expert Systems Ap- abling ontology-based access to streaming data plications, 2005, pp. 543-547. sources, In International Semantic Web Confer- S44 V. Nebot, R. Berlanga, J. M. Perez, M. J. Aram- ence, Springer, Berlin, Heidelberg, 2010, pp. 96- buru and T. B. Pedersen, Multidimensional inte- 111. grated ontologies: A framework for designing se- S56 J. Vanschoren, and L. Soldatova, Expose: An on- mantic data warehouses, In Journal on Data Se- tology for data mining experiments, In Interna- mantics 8, 2009, pp. 1-36. tional workshop on third generation data min- S45 D. Skoutas and A. Simitsis, Designing ETL pro- ing: Towards service-oriented knowledge discov- cesses using semantic web technologies, Pro- ceedings of the 9th ACM international workshop ery (SoKD-2010), 2010, pp. 31-46. on Data warehousing and OLAP, ACM, 2006. S57 H. Paulheim, Generating possible interpretations S46 L. Etcheverry and A. Vaisman, Enhancing OLAP for statistics from linked open data, In Extended analysis with web cubes, The Semantic Web: Re- Semantic Web Conference, Springer, Berlin, Hei- search and Applications, 2012, pp. 469-483. delberg, 2012, pp. 560-574. S47 A. Harth, C. A. Knoblock, S. Stadtmuller, R. S58 Q. Zhou, Y. Simmhan, and V. Prasanna, Incor- Studer and P. Szekely, On-the-fly integration of porating semantic knowledge into dynamic data static and dynamic linked data, In Proceedings of processing for smart power grids, In International the Fourth International Conference on Consum- Semantic Web Conference, Springer, Berlin, Hei- ing Linked Data 1034, 2013, pp. 1-12. delberg, 2012, pp. 257-273. S48 T. Priebe and G. Pernul, Ontology-based integra- S59 D. A. Dimitrov, J. Heflin, A. Qasem, and N. tion of OLAP and information retrieval, Proceed- Wang, Information integration via an end-to-end ings of 14th International Workshop on Database distributed semantic web system, In International and Expert Systems Applications, IEEE, 2003. Semantic Web Conference, Springer, Berlin, Hei- S49 L. Brisson and M. Collard, An ontology driven delberg, 2006, pp. 764-777. data mining process, In International Conference S60 C. Diamantini, D. Potena, and E. Storti, KD- on Enterprise Information Systems, 2008, pp. 54- DONTO: An ontology for discovery and com- 61. position of KDD algorithms. Third Generation S50 M. Guo, Y. Liu, J. Li, H. Li, and B. Xu, A Data Mining: Towards Service-Oriented Knowl- knowledge based approach for tackling misla- edge Discovery (SoKD’09) ,2009, pp. 13-24. 20 Semantic Modeling for Engineering Data Analytics Solutions

S61 Y. Gil, P. Szekely, S. Villamizar, T. C. Harmon, V. S72 A. Freitas, B. Kampgen, J. G. Oliveira, S. O’Riain, Ratnakar, S. Gupta, M. Muslea, F. Silva, and C. and E. Curry, Representing interoperable prove- A. Knoblock, Mind your metadata: Exploiting se- nance descriptions for ETL workflows, In Ex- mantics for configuration, adaptation, and prove- tended Semantic Web Conference, Springer, Berlin, nance in scientific workflows. In International Se- Heidelberg, 2012, pp. 43-57. mantic Web Conference, Springer, Berlin, Heidel- S73 K. Teymourian and A. Paschke, Semantic rule- berg, 2011, pp. 65-80. based complex event processing, In International S62 M. Choinski, and J. A. Chudziak, Ontological Workshop on Rules and Rule Markup Languages learning assistant for knowledge discovery and for the Semantic Web, Springer, Berlin, Heidel- data mining. In Computer Science and Informa- berg, 2009, pp. 82-92. tion Technology, 2009. IMCSIT’09. International S74 M. Ringsquandl, S. Lamparter, S. Brandt, T. Multiconference on, IEEE, 2009, pp. 147-155. Hubauer, and R. Lepratti, Semantic-guided fea- S63 A. Tsymbal, S. Zillner, and M. Huber, Ontology- ture selection for industrial automation systems, Supported Machine Learning and Decision Sup- In International Semantic Web Conference, Springer, port in Biomedicine, In International Conference Cham, 2015, pp. 225-240. on Data Integration in the Life Sciences, Springer, S75 J. U. Kietz, F. Serban, S. Fischer, and A. Bern- Berlin, Heidelberg, 2007, pp. 156-171. stein, Semantics Inside! But let’s not tell the Data S64 A. Albalushi, R. Khan, K. McLaughlin, and S. Miners: Intelligent Support for Data Mining, In Sezer, Ontology-based approach for malicious European Semantic Web Conference, Springer, behaviour detection in synchrophasor networks, Cham, 2014, pp. 706-720. In Power & Energy Society General Meeting, S76 E. W. Patton, E. Brown, M. Poegel, H. De Los IEEE, 2017, pp. 1-5. Santos, C. Fasano, K. P. Bennett, and D. L. S65 M. Rodriguez-Muro, R. Kontchakov, and M. Za- McGuinness, SemNExT: A Framework for Se- kharyaschev, Ontology-based data access: Ontop mantically Integrating and Exploring Numeric of databases, In International Semantic Web Con- Analyses, In SemStats@ ISWC, 2015. ference, Springer, Berlin, Heidelberg, 2013, pp. S77 M. Spahn, J. Kleb, S. Grimm, and S. Scheidl, 558-573. Supporting business intelligence by providing S66 Y. Li, M. A. Thomas, and K. M. Osei-Bryson, ontology-based end-user information self-service, Ontology-based data mining model management In Proceedings of the First international Work- for self-service knowledge discovery, Informa- shop on ontology-Supported Business intelli- tion Systems Frontiers 19(4), 2017, pp. 925-943. gence, ACM, 2008, p. 10. S67 J. Debattista, S. Scerri, I. Rivera, and S. Hand- S78 C. M. Keet, A. Lawrynowicz, C. dAmato, A. schuh, Ontology-based Rules for Recommender Kalousis, P. Nguyen, R. Palma, R. Stevens, and Systems, In SeRSy, 2012, pp. 49-60. M. Hilario, The data mining OPtimization on- S68 M. Rospocher and L. Serafini, Ontology-centric tology, Web Semantics: Science, Services and Decision Support, In SeRSy, 2012, pp. 61-72. Agents on the World Wide Web 32, 2015, pp. 43- S69 K. Taylor and L. Leidinger, Ontology-driven 53. complex event processing in heterogeneous sen- S79 P. Panov, L. N. Soldatova, and S. Dzeroski, To- sor networks, In Extended Semantic Web Con- wards an ontology of data mining investigations, ference, Springer, Berlin, Heidelberg, 2011, pp. In International Conference on Discovery Sci- 285-299. ence, Springer, Berlin, Heidelberg, 2009, pp. 257- S70 F. Lecue, R. Tucker, V. Bicer, P. Tommasi, S. 271. Tallevi-Diotallevi, and M. Sbodio, Predicting S80 E. Kharlamov, Y. Kotidis, T. Mailis, C. Neuen- severity of road traffic congestion using seman- stadt, C. Nikolaou, O. Ozcep, C. Svingos et al., tic web technologies. In European Semantic Web Towards analytics aware ontology based access Conference, Springer, Cham, 2014, pp. 611-627. to static and streaming data, In International Se- S71 Y. Khan, A. Zimmermann, A. Jha, D. Rebholz- mantic Web Conference, Springer, Cham, 2016, Schuhmann, and R. Sahay Querying web poly- pp. 344-362. stores, In Big Data (Big Data), 2017 IEEE Inter- S81 D. Calvanese, G. De Giacomo, D. Lembo, M. national Conference on, IEEE, 2017, pp. 3190- Lenzerini, A. Poggi, M. Rodriguez-Muro, R. 3195. Rosati, M. Ruzzi, and D. F. Savo, The MASTRO Semantic Modeling for Engineering Data Analytics Solutions 21

system for ontology-based data access, Semantic [18] S. Shumilov, Y. Leng, M. El-Gayyar and A.B. Cremers, Dis- Web 2(1), 2011, pp. 43-53. tributed scientific workflow management for data-intensive ap- S82 D. Anicic, S. Rudolph, P. Fodor, and N. Sto- plications, IEEE, 2008, pp. 65–73. janovic, Stream reasoning and complex event [19] L. Yao and F.A. Rabhi, Building architectures for data- intensive science using the ADAGE framework, Wiley Online Semantic Web processing in ETALIS, 3(4), 2012, Library, 2015, pp. 1188–1206. pp. 397-407. [20] G. Wang and Y. Wang, 3DM: domain-oriented data-driven data mining, IOS Press, 2009, pp. 395–426. [21] S. Luján-Mora, J. Trujillo and I.-Y. Song, A UML profile References for multidimensional modeling in data warehouses, Elsevier, 2006, pp. 725–769. [1] Gartner Research, Gartner says advanced analytics is a top [22] H. Macià, V. Valero, G. Díaz, J. Boubeta-Puig and G. Ortiz, business priority, Gartner business intelligence & analytics Complex Event Processing Modeling by Prioritized Colored summit 2014 (2015). Petri Nets, IEEE, 2016, pp. 7425–7439. [2] H. Chen, R.H. Chiang and V.C. Storey, Business intelligence [23] H. Cheng and Z. Ma, A literature overview of knowledge shar- and analytics: from big data to big impact, JSTOR, 2012, ing between Petri nets and ontologies, Cambridge University pp. 1165–1188. Press, 2016, pp. 239–260. [3] O. Marjanovic, Improvement of Knowledge-Intensive Busi- [24] T. Berners-Lee, J. Hendler, O. Lassila et al., The semantic web, ness Processes Through Analytics and Knowledge Sharing, New York, NY, USA:, 2001, pp. 28–37. 2016. [25] P. Ristoski and H. Paulheim, Semantic Web in data mining and [4] P. Chapman, J. Clinton, R. Kerber, T. Khabaza, T. Reinartz, knowledge discovery: A comprehensive survey, Elsevier, 2016, C. Shearer and R. Wirth, CRISP-DM 1.0 Step-by-step data pp. 1–22. mining guide (2000). [26] J. Cardoso, M. Hepp and M.D. Lytras, The semantic web: real- [5] U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthu- world applications from industry, Vol. 6, Springer Science & rusamy, Advances in knowledge discovery and data mining, Business Media, 2007. Vol. 21, AAAI press Menlo Park, 1996. [27] M.D. Lytras and R. García, Semantic Web applications: a [6] F. Rabhi, M. Bandara, A. Namvar and O. Demirors, Big Data framework for industry and business exploitation–What is Analytics Has Little to Do with Analytics, in: Service Research needed for the adoption of the Semantic Web from the market and Innovation, Springer, 2017, pp. 3–17. and industry, Inderscience Publishers, 2008, pp. 93–108. [7] D. Crankshaw, J. Gonzalez and P. Bailis, Research for practice: prediction-serving systems, ACM, 2018, pp. 45–49. [28] V.R. Benjamins, J. Davies, R. Baeza-Yates, P. Mika, [8] R. Espinosa, D. García-Saiz, M. Zorrilla, J.J. Zubcoff and J.- H. Zaragoza, M. Greaves, J.M. Gomez-Perez, J. Contreras, N. Mazón, Enabling non-expert users to apply data mining for J. Domingue and D. Fensel, Near-term prospects for semantic bridging the big data divide, in: International Symposium on technologies, IEEE, 2008. Data-Driven Process Discovery and Analysis, Springer, 2013, [29] B.A. Kitchenham and S. Charters, Procedures for Performing pp. 65–86. Systematic Literature Reviews in Software Engineering, 2007. [9] S.T. March and A.R. Hevner, Integrated decision support [30] E. Mendes, A systematic review of research, systems: A data warehousing perspective, Elsevier, 2007, IEEE, 2005, p. 10. pp. 1031–1043. [31] N. Salleh, E. Mendes and J. Grundy, Empirical studies of pair [10] M. Magdon-Ismail, No free lunch for noise prediction, MIT programming for CS/SE teaching in higher education: A sys- Press, 2000, pp. 547–564. tematic literature review, IEEE, 2011, pp. 509–525. [11] A. Abelló, O. Romero, T.B. Pedersen, R. Berlanga, V. Nebot, [32] M. Petticrew and H. Roberts, Systematic reviews in the social M.J. Aramburu and A. Simitsis, Using semantic web technolo- sciences: A practical guide, John Wiley & Sons, 2008. gies for exploratory OLAP: a survey, IEEE, 2015, pp. 571– [33] K. Khan, R. Kunz, J. Kleijnen and G. Antes, Systematic reviews 588. to support evidence-based medicine, Crc Press, 2011. [12] P. Russom, Big data analytics, 2011, p. 40. [34] L. Spencer, J. Ritchie, J. Lewis and L. Dillon, Quality in quali- [13] F. Baader, The handbook: Theory, implemen- tative evaluation: a framework for assessing research evidence, tation and applications, Cambridge university press, 2003. Cabinet Office, 2003. [14] J.Z. Pan, S. Staab, U. Aßmann, J. Ebert and Y. Zhao, Ontology- [35] B. Elsevier, Scopus overview: What is it (2008). driven software development, Springer Science & Business [36] H.O. Nigro, Data Mining with Ontologies: Implementations, Media, 2012. [15] K. Petersen, R. Feldt, S. Mujtaba and M. Mattsson, Systematic Findings, and Frameworks: Implementations, Findings, and Mapping Studies in Software Engineering., in: EASE, Vol. 8, Frameworks, IGI Global, 2007. 2008, pp. 68–77. [37] A. Guazzelli, M. Zeller, W. Chen and W. G., PMML: An Open [16] C. Wohlin, Guidelines for snowballing in systematic literature Standard for Sharing Models, 2009. studies and a replication in software engineering, in: Proceed- [38] T. Kliegr, V. Svátek, M. Šimunek, D. Stastny` and A. Hazucha, ings of the 18th international conference on evaluation and as- An XML schema and a ontology for formalization of sessment in software engineering, ACM, 2014, p. 38. background knowledge in data mining, in: IRMLeS-2010, 2nd [17] J. Taylor, Framing Requirements for Predictive Analytic ESWC Workshop on Inductive Reasoning and Machine Learn- Projects with Decision Modeling (2015). ing for the Semantic Web, Heraklion, Crete, Greece, 2010. 22 Semantic Modeling for Engineering Data Analytics Solutions

[39] M. Compton, P. Barnaghi, L. Bermudez, R. GarcíA-Castro, [50] A.I. Canhoto, Ontology-Based Interpretation and Validation of O. Corcho, S. Cox, J. Graybeal, M. Hauswirth, C. Henson, Mined Knowledge: Normative and Cognitive Factors in, IGI A. Herzog et al., The SSN ontology of the W3C semantic sen- Global, 2007, p. 84. sor network incubator group, Elsevier, 2012, pp. 25–32. [51] M. Bandara, A. Behnaz, F.A. Rabhi and O. Demirors, From [40] Gene Ontology Consortium et. al., The Gene Ontology (GO) requirements to data analytics process: An ontology-based ap- database and informatics resource, Oxford Univ Press, 2004, proach (In Press), in: Proceedings of the 5th International pp. 258–261. Workshop on the Interrelations between Requirements Engi- [41] J. Rogers and A. Rector, The GALEN ontology, IOS Press, neering and Business Process Management at BPM 2018, 1995, pp. 174–178. Springer, 2018. [42] H.M. Kim, M.S. Fox and M. Grüninger, An ontology for qual- [52] A. Rajbhoj, V. Kulkarni and N. Bellarykar, Early experience ity management- enabling quality problem identification and with model-driven development of mapreduce based big data tracing, Springer, 1999, pp. 131–140. application, in: Software Engineering Conference (APSEC), [43] GeoVocab.org, GeoVocab, 2012 (accessed 04-Dec-2018), http: 2014 21st Asia-Pacific, Vol. 1, IEEE, 2014, pp. 94–97. //geovocab.org/. [53] S. Ceri, E. Della Valle, D. Pedreschi and R. Trasarti, Mega- [44] D. Martin, M. Burstein, J. Hobbs, O. Lassila, D. McDermott, modeling for big data analytics, Springer, 2012, pp. 1–15. S. McIlraith, S. Narayanan, M. Paolucci, B. Parsia, T. Payne [54] A.-L. Lamprecht, S. Naujokat, T. Margaria and B. Steffen, et al., OWL-S: Semantic markup for web services, 2004, Semantics-based composition of EMBOSS services, BioMed pp. 2007–04. [45] R. Akkiraju, J. Farrell, J.A. Miller, M. Nagarajan, A.P. Sheth Central, 2011, p. 5. and K. Verma, Web service semantics-WSDL-S (2005). [55] N. Mehandjiev, F. Lécué, M. Carpenter and F.A. Rabhi, Co- [46] J. Kopecky,` T. Vitvar, C. Bournez and J. Farrell, Sawsdl: Se- operative service composition, in: International Conference on mantic annotations for wsdl and schema, IEEE, 2007. Advanced Information Systems Engineering, Springer, 2012, [47] J. Domingue, D. Roman and M. Stollberg, Web service model- pp. 111–126. ing ontology (WSMO)-An ontology for semantic web services [56] M. Bandara, F.A. Rabhi and R. Meymandpour, Semantic (2005). Model Based Approach for Knowledge Intensive Processes [48] Markus Lanthaler, Hydra Core Vocabulary, 2017 (accessed 04- (In Press), in: In Proceedings of International Conference on Dec-2018), https://www.hydra-cg.com/spec/latest/core/. Software Process Improvement and Capabity Determination [49] T. Lebo, S. Sahoo, D. McGuinness, K. Belhajjame, J. Cheney, (SPICE), Springer, 2018. D. Corsar, D. Garijo, S. Soiland-Reyes, S. Zednik and J. Zhao, [57] P. Panov, L.N. Soldatova and S. Džeroski, Generic ontology of Prov-o: The prov ontology, 2013. datatypes, Elsevier, 2016, pp. 900–920.