Using the Lexicon from Source Code to Determine Application Domains

Total Page:16

File Type:pdf, Size:1020Kb

Using the Lexicon from Source Code to Determine Application Domains Using the Lexicon from Source Code to Determine Application Domains Andrea Capiluppi∗, Nemitari Ajienkay, Nour Ali∗, Mahir Arzoky∗, Steve Counsell∗, Giuseppe Destefanis∗ Alina Miron∗, Bhaveet Nagaria∗, Rumyana Neykova∗, Martin Shepperd∗, Stephen Swift∗, Allan Tucker∗ ∗Department of Computer Science, Brunel University London, UK yDepartment of Computer Science, Edge Hill University, UK Abstract—Context: The vast majority of software engineering As in the example reported in [21], the extensive study of research is independent of the application domain: techniques all JSON parsers available would find similarities between and tools usage is reported without any domain context. This has them or common patterns. That type of study would focus not always been the case - early in the computing era, there were totally separate application domains (for example, scientific and on one particular language (JSON), one specific domain data processing) and the research focus was often application- (parsers) and inevitably draw limited conclusions. On the specific. other hand, considering the “parsers” domain (but without Objective: This paper claims that software systems should be focusing on one single language) would show the common separated and analysed into domain clusters. We propose a code- characteristics of developing that type of systems, irrespective based approach to identify the application domain of a software system, via its lexicon. We compare its precision with the plain of their language. The thrust of this paper stems from the work textual description attached to the same system. of several prominent researchers who called the empirical Method: Using a sample of 50 Java projects, we obtained i) software engineering community to ‘go deeper, not wider’ [24] the description of each project (e.g., its ReadMe file), ii) the and ‘minding the mine, mining the mind’ [18]. As a matter lexicon extracted from its source code, and iii) a list of its main of fact, there is increasing evidence that empirical research topics extracted with the LDA information retrieval technique. We assigned a random subset of these data items to different on software systems might yield domain-specific results, for researchers (i.e., ‘experts’), and asked them to assign each item instance when clustering the studied systems by the domains to one (or more) application domain. We then evaluated the that they implement ([11], [27], [23]). precision and accuracy of the three techniques. There are two main challenges to domain-based empirical Results: Using the agreement levels between experts, We software engineering: first, there is currently no commonly observed that the ‘baseline’ dataset (i.e., the ReadMe files) agreed taxonomy of application domains [15]. Several at- obtained the highest average in terms of agreement between experts, but we also observed that the three techniques had the tempts have been performed with varying success: past re- same mode and median agreement levels. Additionally, in the search on domains has focused on creating a domain tax- cases where no agreement was reached for the baseline dataset, onomy in a top-down fashion [1], i.e., starting with a seed the two other techniques provided sufficient support. taxonomy and refining it with various techniques (e.g., via Conclusions: We conclude that using the corpora or the topics expert judgement) [12]. The second challenge is assigning a from source code can be an adequate substitution to plain description when assigning a software system to an application software system to a domain: again, expert judgements have domain. been applied more often than automatic assignments in past Index Terms—Application Domains, Source Code, Java, Latent literature [4], but it has become clear that the chosen domains Dirichlet Allocation, Expert Judgement, Machine Learning were selected as either too fine-grained or too large, thus defying the purpose of the categorisation. I. INTRODUCTION In this paper, we argue that the extraction of topics that Albeit the diversity and context of software systems have emerge from the source code can help the assignment to received some attention in the past [26], [10], contemporary domains by experts, and instead of (or aside of) reading research in the computing field is almost entirely application- the documentation that accompanies a software system [25]. independent. This has not always been the case - early in the Researchers and practitioners can assign a system to a domain computing era, there were totally separate application domains by only focusing on a limited number of core topics that (for example, scientific and data processing) and the research emerge more strongly (e.g., with larger weight) from the focus was often application-specific [15]. lexicon of source code. In the context of empirical software engineering research, For this purpose we collected the description of 50 Java while the main goal of empirical papers is to achieve the software systems, alongside the lexicon of their source code generality of the results, the domain, context and uniqueness and a list of their most relevant topics. We asked 10 researchers of a software system have not been considered very often to assign a unique subset of these data files to the most by researchers. The most common rationale for doing so is appropriate domain, and triangulated their expert opinions to to analyze projects having different application domains to discuss the following research questions: decrease threats due to the generalizability of the results. 1) RQ1: is the lexicon an acceptable substitute for the Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected] description of a software system, for the purpose of exceptions in catch blocks. Concretely, content management assigning it to a domain? systems consistently have more exception handling code and 2) RQ2: similarly, are the topics (as extracted by LDA) throw more custom exceptions, as opposed to editors/viewers, acceptable substitutes? which have less error handling code and mainly use standard This paper is organised as follows: section II deals with exceptions instead. related work and illustrates how our paper advances the state Fayad and Smidt [11] explored software frameworks and of the art. Section III describes the empirical approach that classified frameworks based on related application domains, we adopted to extract the data sources and to assign them e.g., operating system and communication frameworks and in subsets to researchers for expert judgement. Section IV user interface frameworks. The authors emphasised that in summarises the results of the expert judgement task, while contrast to earlier OO reuse techniques based on class libraries, section V discusses the findings and their relevance for practi- frameworks are targeted for particular business units (such as tioners. Section VI details the threats to valisity and section VII data processing or cellular communications) and application finally concludes. domains (such as user interfaces or real-time avionics). They also highlight the fact that the next generation of OO applica- II. RELATED WORK tion frameworks target application domains but on the other This work is an extension of a previously published pa- hand, application developers in more complex domains such per [6] where we posited that keywords, corpora and topics as telecommunications and distributed medical imaging have could significantly help in establishing the provenance of traditionally lacked standard “off-the-shelf” frameworks; as a a software system, given a list of pre-defined application result these developers in such domains largely implement, test domains. Now we plan to evaluate the plain description of and maintain software systems from scratch. These findings a system (e.g., in the form of a ReadMe file) against the further show the need to treat project by domains for more corpora and the topics extracted from the source code. The distinct empirical results or observations. The need for a means current paper should therefore be considered as the enactment of identifying which domain a project belongs to is also of the future work proposed in [6]. Prior research has shown highlighted as OSS developers for example can contribute that the number and size of open-source projects are growing frameworks for projects in the telecommunications domain exponentially and open-source projects are becoming more upon identifying such projects and their required functionality. diverse by expanding into different domains [9], [17]. In In past research, software projects have been assigned to view of this and to reduce the effort required in manual application domains by glancing at the source code, or its categorisation of software projects, Tian et al. [25] proposed general description (e.g., the ReadMe file, or the project a new technique based on text mining to categorise software documentation); creating categories and finally assigning a projects irrespective of the programming language used in their project to a category. The research by Borges et al., [4] development. Their contribution is based on the analysis of the follows that approach: the dataset contains 5,000 GitHub documentation that accompanies the software project, rather project (including 520 Java projects). There are two main than the source code, as we propose in our paper. issues with this approach: the first is that there is hardly any In general, distinct results have been observed when more consistency in how a project might get documented by its attention is paid to the categorisation of analysed software developers, meaning that the approach in [4] becomes non- projects.
Recommended publications
  • Security to Android Mobile Applications: Camera Based
    ISSN 2319-8885 Vol.04,Issue.29, August-2015, Pages:5534-5539 www.ijsetr.com Security to Android Mobile Applications: Camera Based Security and Theft Alert Mobile Phones 1 2 3 SANA SULTANA , G.VEERANJANEYULU , SALEHA FARHA 1PG Scholar, Dept of CSE, Shadan Women’s College of Engineering and Technology, Hyderabad, TS, India, E-mail: [email protected]. 2Professor, Dept of CSE, Shadan Women’s College of Engineering and Technology, Hyderabad, TS, India, E-mail: [email protected]. 3HOD, Dept of CSE, Shadan Women’s College of Engineering and Technology, Hyderabad, TS, India, E-mail: [email protected]. Abstract: Now Smartphone’s are simply small computers with added services such as GSM, radio. Thus it is true to say that next generations of operating system will be on these Smartphone and the likes of windows, IOS and android are showing us the glimpse of this future. Android has already gained significant advantages on its counterparts in terms of market share. One of the reason behind this is the most important feature of Android is that it is open-source which makes it free and allows any one person could develop their own applications and publish them freely. This openness of android brings the developers and users a wide range of convenience but simultaneously it increases the security issues. The major threat of Android users is Malware infection via Android Application Market which is targeting some loopholes in the architecture mainly on the end-users part. Attackers can implement spy cameras in malicious apps such that the phone camera is launched automatically without the device owner’s notice, and the captured photos and videos are sent out to these remote attackers.
    [Show full text]
  • Building Vision and Voice Based Robots Using Android
    1st Annual International Interdisciplinary Conference, AIIC 2013, 24-26 April, Azores, Portugal - Proceedings- BUILDING VISION AND VOICE BASED ROBOTS USING ANDROID Pradeep N. , Assistant Professor Dr. Mohammed Sharief, Professor Dr. M. Siddappa, Professor Dept. of Computer Science and Engineering, SSIT, Tumkur, Karnataka, India Abstract: This project uses Android, along with cloud based APIs and the Arduino microcontroller, to create a software-hardware system that attempts to address several current problems in robotics, including, among other things, line-following, obstacle-avoidance, voice control, voice synthesis, remote surveillance, motion & obstacle detection, face detection, remote live image streaming and remote photography in panorama & time-lapse mode. The rover, built upon a portable five-layered architecture, will use several mature and a few experimental algorithms in AI, Computer Vision, Robot Kinetics and Dynamics behind-the-scene to provide its prospective user in defense, entertainment or industrial sector a positive user experience. A minimal amount of security is achieved via. a handshake using MD5, SHA1 and AES. It even has a Twitter account for autonomous social networking. Key Words: Voice Synthesis, Voice Control, Robot-User Interaction, Face Detection, Motion and Object Detection, Remote Surveillance, Panoramic and Time-lapse Photography, Line-following and Obstacle-Avoiding Robotics, AI, Computer Vision, Image Processing, Network Security Introduction Android, which is actually a software stack, running atop the Linux kernel, is one of the most successful operating systems available to mobile users today. This success can be attributed to the fact that Android makes development of application softwares ('apps') really easy by allowing developers to code in Java (among many other languages) and thus enabling them to take advantage of the innumerable packages and modules made available by the Java community.
    [Show full text]
  • Quick Install for AWS EMR
    Quick Install for AWS EMR Version: 6.8 Doc Build Date: 01/21/2020 Copyright © Trifacta Inc. 2020 - All Rights Reserved. CONFIDENTIAL These materials (the “Documentation”) are the confidential and proprietary information of Trifacta Inc. and may not be reproduced, modified, or distributed without the prior written permission of Trifacta Inc. EXCEPT AS OTHERWISE PROVIDED IN AN EXPRESS WRITTEN AGREEMENT, TRIFACTA INC. PROVIDES THIS DOCUMENTATION AS-IS AND WITHOUT WARRANTY AND TRIFACTA INC. DISCLAIMS ALL EXPRESS AND IMPLIED WARRANTIES TO THE EXTENT PERMITTED, INCLUDING WITHOUT LIMITATION THE IMPLIED WARRANTIES OF MERCHANTABILITY, NON-INFRINGEMENT AND FITNESS FOR A PARTICULAR PURPOSE AND UNDER NO CIRCUMSTANCES WILL TRIFACTA INC. BE LIABLE FOR ANY AMOUNT GREATER THAN ONE HUNDRED DOLLARS ($100) BASED ON ANY USE OF THE DOCUMENTATION. For third-party license information, please select About Trifacta from the Help menu. 1. Release Notes . 4 1.1 Changes to System Behavior . 4 1.1.1 Changes to the Language . 4 1.1.2 Changes to the APIs . 18 1.1.3 Changes to Configuration 23 1.1.4 Changes to the Object Model . 26 1.2 Release Notes 6.8 . 30 1.3 Release Notes 6.4 . 36 1.4 Release Notes 6.0 . 42 1.5 Release Notes 5.1 . 49 2. Quick Start 55 2.1 Install from AWS Marketplace with EMR . 55 2.2 Upgrade for AWS Marketplace with EMR . 62 3. Configure 62 3.1 Configure for AWS . 62 3.1.1 Configure for EC2 Role-Based Authentication . 68 3.1.2 Enable S3 Access . 70 3.1.2.1 Create Redshift Connections 81 3.1.3 Configure for EMR .
    [Show full text]
  • Delft University of Technology Arrowsam In-Memory Genomics
    Delft University of Technology ArrowSAM In-Memory Genomics Data Processing Using Apache Arrow Ahmad, Tanveer; Ahmed, Nauman; Peltenburg, Johan; Al-Ars, Zaid DOI 10.1109/ICCAIS48893.2020.9096725 Publication date 2020 Document Version Accepted author manuscript Published in 2020 3rd International Conference on Computer Applications & Information Security (ICCAIS) Citation (APA) Ahmad, T., Ahmed, N., Peltenburg, J., & Al-Ars, Z. (2020). ArrowSAM: In-Memory Genomics Data Processing Using Apache Arrow. In 2020 3rd International Conference on Computer Applications & Information Security (ICCAIS): Proceedings (pp. 1-6). [9096725] IEEE . https://doi.org/10.1109/ICCAIS48893.2020.9096725 Important note To cite this publication, please use the final published version (if applicable). Please check the document version above. Copyright Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons. Takedown policy Please contact us and provide details if you believe this document breaches copyrights. We will remove access to the work immediately and investigate your claim. This work is downloaded from Delft University of Technology. For technical reasons the number of authors shown on this cover page is limited to a maximum of 10. © 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
    [Show full text]
  • The Platform Inside and out Release 0.8
    The Platform Inside and Out Release 0.8 Joshua Patterson – GM, Data Science RAPIDS End-to-End Accelerated GPU Data Science Data Preparation Model Training Visualization Dask cuDF cuIO cuML cuGraph PyTorch Chainer MxNet cuXfilter <> pyViz Analytics Machine Learning Graph Analytics Deep Learning Visualization GPU Memory 2 Data Processing Evolution Faster data access, less data movement Hadoop Processing, Reading from disk HDFS HDFS HDFS HDFS HDFS Read Query Write Read ETL Write Read ML Train Spark In-Memory Processing 25-100x Improvement Less code HDFS Language flexible Read Query ETL ML Train Primarily In-Memory Traditional GPU Processing 5-10x Improvement More code HDFS GPU CPU GPU CPU GPU ML Language rigid Query ETL Read Read Write Read Write Read Train Substantially on GPU 3 Data Movement and Transformation The bane of productivity and performance APP B Read Data APP B GPU APP B Copy & Convert Data CPU GPU Copy & Convert Copy & Convert APP A GPU Data APP A Load Data APP A 4 Data Movement and Transformation What if we could keep data on the GPU? APP B Read Data APP B GPU APP B Copy & Convert Data CPU GPU Copy & Convert Copy & Convert APP A GPU Data APP A Load Data APP A 5 Learning from Apache Arrow ● Each system has its own internal memory format ● All systems utilize the same memory format ● 70-80% computation wasted on serialization and deserialization ● No overhead for cross-system communication ● Similar functionality implemented in multiple projects ● Projects can share functionality (eg, Parquet-to-Arrow reader) From Apache Arrow
    [Show full text]
  • When Program Analysis Meets Bytecode Search: Targeted and Efficient Inter-Procedural Analysis of Modern Android Apps in Backdroid
    When Program Analysis Meets Bytecode Search: Targeted and Efficient Inter-procedural Analysis of Modern Android Apps in BackDroid Daoyuan Wu1, Debin Gao2, Robert H. Deng2, and Rocky K. C. Chang3 1The Chinese University of Hong Kong 2Singapore Management University 3The Hong Kong Polytechnic University Contact: [email protected] Abstract—Widely-used Android static program analysis tools, be successfully analyzed by its underlying FlowDroid tool. e.g., Amandroid and FlowDroid, perform the whole-app inter- Although HSOMiner [53] increased the size of the analyzed procedural analysis that is comprehensive but fundamentally apps, the average app size is still only 8.4MB. Even for difficult to handle modern (large) apps. The average app size has these small apps, AppContext timed out for 16.1% of the increased three to four times over five years. In this paper, we 1,002 apps tested and HSOMiner similarly failed in 8.4% explore a new paradigm of targeted inter-procedural analysis that can skip irrelevant code and focus only on the flows of security- of 3,000 apps, causing a relatively high failure rate. Hence, sensitive sink APIs. To this end, we propose a technique called this is not only a performance issue, but also the detection on-the-fly bytecode search, which searches the disassembled app burden. Additionally, third-party libraries were often ignored. bytecode text just in time when a caller needs to be located. In this For example, Amandroid by default skipped the analysis of way, it guides targeted (and backward) inter-procedural analysis 139 popular libraries, such as AdMob, Flurry, and Facebook.
    [Show full text]
  • Arrow: Integration to 'Apache' 'Arrow'
    Package ‘arrow’ September 5, 2021 Title Integration to 'Apache' 'Arrow' Version 5.0.0.2 Description 'Apache' 'Arrow' <https://arrow.apache.org/> is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. This package provides an interface to the 'Arrow C++' library. Depends R (>= 3.3) License Apache License (>= 2.0) URL https://github.com/apache/arrow/, https://arrow.apache.org/docs/r/ BugReports https://issues.apache.org/jira/projects/ARROW/issues Encoding UTF-8 Language en-US SystemRequirements C++11; for AWS S3 support on Linux, libcurl and openssl (optional) Biarch true Imports assertthat, bit64 (>= 0.9-7), methods, purrr, R6, rlang, stats, tidyselect, utils, vctrs RoxygenNote 7.1.1.9001 VignetteBuilder knitr Suggests decor, distro, dplyr, hms, knitr, lubridate, pkgload, reticulate, rmarkdown, stringi, stringr, testthat, tibble, withr Collate 'arrowExports.R' 'enums.R' 'arrow-package.R' 'type.R' 'array-data.R' 'arrow-datum.R' 'array.R' 'arrow-tabular.R' 'buffer.R' 'chunked-array.R' 'io.R' 'compression.R' 'scalar.R' 'compute.R' 'config.R' 'csv.R' 'dataset.R' 'dataset-factory.R' 'dataset-format.R' 'dataset-partition.R' 'dataset-scan.R' 'dataset-write.R' 'deprecated.R' 'dictionary.R' 'dplyr-arrange.R' 'dplyr-collect.R' 'dplyr-eval.R' 'dplyr-filter.R' 'expression.R' 'dplyr-functions.R' 1 2 R topics documented: 'dplyr-group-by.R' 'dplyr-mutate.R' 'dplyr-select.R' 'dplyr-summarize.R'
    [Show full text]
  • Självständigt Arbete På Grundnivå Independent Degree Project - First Cycle
    Självständigt arbete på grundnivå Independent degree project - first cycle Computer Engineering A framework for communicating with Android apps from the browser Karl Lindström A framework for communicating with Android apps from the browser Karl Lindström 2014-05-29 MID SWEDEN UNIVERSITY Avdelningen för Informations- och kommunikationssystem (IKS) Examiner: Ulf Jennehag, [email protected] Supervisor: Magnus Eriksson, [email protected] Author: Karl Lindström, [email protected] Degree programme: International Bachelor's Programme in Computer Engi- neering, 180 credits Main field of study: Computer Engineering Semester, year: 1, 2014 ii A framework for communicating with Android apps from the browser Karl Lindström 2014-05-29 Abstract With the recent growth of the mobile market, companies want to target mobile devices while at the same time keeping product development costs low. One way to do this is to develop web applications, which are accessed from a mobile de- vice’s web browser, instead of native applications. The same web application can then be used on different platforms such as Android and iOS. However, devices such as smart phones and tablets often include cameras and sensors that a web ap- plication may want to access, but which are only accessible from native applica- tions. A framework was developed that enables web applications to communicate with native Android applications. Native applications are launched by clicking a link in the browser, and the result produced is made available to the web applica- tion through a HTTP POST request or a local web server running on the device. Key characteristics of the framework include ease of extension and the ability to enable secure (SSL) communication if desired.
    [Show full text]
  • In Reference to RPC: It's Time to Add Distributed Memory
    In Reference to RPC: It’s Time to Add Distributed Memory Stephanie Wang, Benjamin Hindman, Ion Stoica UC Berkeley ACM Reference Format: D (e.g., Apache Spark) F (e.g., Distributed TF) Stephanie Wang, Benjamin Hindman, Ion Stoica. 2021. In Refer- A B C i ii 1 2 3 ence to RPC: It’s Time to Add Distributed Memory. In Workshop RPC E on Hot Topics in Operating Systems (HotOS ’21), May 31-June 2, application 2021, Ann Arbor, MI, USA. ACM, New York, NY, USA, 8 pages. Figure 1: A single “application” actually consists of many https://doi.org/10.1145/3458336.3465302 components and distinct frameworks. With no shared address space, data (squares) must be copied between 1 Introduction different components. RPC has been remarkably successful. Most distributed ap- Load balancer Scheduler + Mem Mgt plications built today use an RPC runtime such as gRPC [3] Executors or Apache Thrift [2]. The key behind RPC’s success is the Servers request simple but powerful semantics of its programming model. client data request cache In particular, RPC has no shared state: arguments and return Distributed object store values are passed by value between processes, meaning that (a) (b) they must be copied into the request or reply. Thus, argu- ments and return values are inherently immutable. These Figure 2: Logical RPC architecture: (a) today, and (b) with a shared address space and automatic memory management. simple semantics facilitate highly efficient and reliable imple- mentations, as no distributed coordination is required, while then return a reference, i.e., some metadata that acts as a remaining useful for a general set of distributed applications.
    [Show full text]
  • Ray: a Distributed Framework for Emerging AI Applications
    Ray: A Distributed Framework for Emerging AI Applications Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica, UC Berkeley https://www.usenix.org/conference/osdi18/presentation/nishihara This paper is included in the Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’18). October 8–10, 2018 • Carlsbad, CA, USA ISBN 978-1-939133-08-3 Open access to the Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation is sponsored by USENIX. Ray: A Distributed Framework for Emerging AI Applications Philipp Moritz,∗ Robert Nishihara,∗ Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, Ion Stoica University of California, Berkeley Abstract and their use in prediction. These frameworks often lever- age specialized hardware (e.g., GPUs and TPUs), with the The next generation of AI applications will continuously goal of reducing training time in a batch setting. Examples interact with the environment and learn from these inter- include TensorFlow [7], MXNet [18], and PyTorch [46]. actions. These applications impose new and demanding The promise of AI is, however, far broader than classi- systems requirements, both in terms of performance and cal supervised learning. Emerging AI applications must flexibility. In this paper, we consider these requirements increasingly operate in dynamic environments, react to and present Ray—a distributed system to address them. changes in the environment, and take sequences of ac- Ray implements a unified interface that can express both tions to accomplish long-term goals [8, 43].
    [Show full text]
  • Zero-Cost, Arrow-Enabled Data Interface for Apache Spark
    Zero-Cost, Arrow-Enabled Data Interface for Apache Spark Sebastiaan Jayjeet Aaron Chu Ivo Jimenez Jeff LeFevre Alvarez Chakraborty UC Santa Cruz UC Santa Cruz UC Santa Cruz Rodriguez UC Santa Cruz [email protected] [email protected] [email protected] Leiden University [email protected] [email protected] Carlos Alexandru Uta Maltzahn Leiden University UC Santa Cruz [email protected] [email protected] ABSTRACT which is a very expensive operation; (2) data processing sys- Distributed data processing ecosystems are widespread and tems require new adapters or readers for each new data type their components are highly specialized, such that efficient to support and for each new system to integrate with. interoperability is urgent. Recently, Apache Arrow was cho- A common example where these two issues occur is the sen by the community to serve as a format mediator, provid- de-facto standard data processing engine, Apache Spark. In ing efficient in-memory data representation. Arrow enables Spark, the common data representation passed between op- efficient data movement between data processing and storage erators is row-based [5]. Connecting Spark to other systems engines, significantly improving interoperability and overall such as MongoDB [8], Azure SQL [7], Snowflake [11], or performance. In this work, we design a new zero-cost data in- data sources such as Parquet [30] or ORC [3], entails build- teroperability layer between Apache Spark and Arrow-based ing connectors and converting data. Although Spark was data sources through the Arrow Dataset API. Our novel data initially designed as a computation engine, this data adapter interface helps separate the computation (Spark) and data ecosystem was necessary to enable new types of workloads.
    [Show full text]
  • ECSEL2017-1-737451 Fitoptivis from the Cloud to the Edge
    Ref. Ares(2019)3751938 - 12/06/2019 ECSEL2017-1-737451 FitOptiVis From the cloud to the edge - smart IntegraTion and OPtimisation Technologies for highly efficient Image and VIdeo processing Systems Deliverable: D5.1 - Components Analysis and Specification Due date of deliverable: (31-5-2019) Actual submission date: (11-06-2019) Start date of Project: 01 June 2018 Duration: 36 months Responsible: Zaid Al-Ars (Delft University of Technology) Revision: draft Dissemination level PU Public Restricted to other programme participants (including the Commission PP Service Restricted to a group specified by the consortium (including the RE Commission Services) Confidential, only for members of the consortium (excluding the CO Commission Services) WP5 D5.1, version 1.0 FitOptiVis ECSEL2017-1-737451 DOCUMENT INFO Author Author Company E-mail Francesca Palumbo UNISS [email protected] Carlo Sau, Tiziana UNICA [email protected]; Fanni, Luigi Raffo [email protected]; [email protected] Zaid Al-Ars TUDelft [email protected] Pablo Chaves SCHN [email protected] David Pampliega [email protected] Pablo Sánchez UC [email protected] Roman Čečil UWB [email protected] Document history Document Date Change version # V0.1 25/01/2019 Added table of content and initial inputs from survey V0.2 07/05/2019 Integrated partner contributions from Feb to Apr V0.3 14/05/2019 Integrated partner contributions till May 14 V0.4 22/05/2019 Integrated partner contributions till Brno meeting V0.5 29/05/2019 Integrated partner contributions till May 29
    [Show full text]