FITS Data Source for Apache Spark

Total Page:16

File Type:pdf, Size:1020Kb

FITS Data Source for Apache Spark FITS Data Source for Apache Spark Julien Peloton,∗ Christian Arnault, Stéphane Plaszczynski LAL, Univ. Paris-Sud, CNRS/IN2P3, Université Paris-Saclay, Orsay, France October 17, 2018 Abstract Early 2010, a new tool set into the landscape of We investigate the performance of Apache Spark, cluster computing: Apache Spark [13, 1, 2]. After a cluster computing framework, for analyzing data starting as a research project at the University of from future LSST-like galaxy surveys. Apache Spark California, Berkeley in 2009, it is now maintained attempts to address big data problems have hitherto by the Apache Software Foundation [14], and widely proved successful in the industry, but its use in the used in the industry to deal with big data problems. astronomical community still remains limited. We Apache Spark is an open-source framework for data show how to manage complex binary data structures analysis mostly written in Scala with application handled in astrophysics experiments such as binary programming interfaces (API) for Scala, Python, R, tables stored in FITS files, within a distributed en- and Java. It is based on the so-called MapReduce vironment. To this purpose, we first designed and cluster computing paradigm [4], popularized by the implemented a Spark connector to handle sets of ar- Hadoop framework [15] [3] using implicit data par- allelism and fault tolerance. In addition Spark op- bitrarily large FITS files, called spark-fits. The user interface is such that a simple file "drag-and- timizes data transfer, memory usage and communi- drop" to a cluster gives full advantage of the frame- cations between processes based on a graph analysis work. We demonstrate the very high scalability of (DAG) of the tasks to perform, largely relying on functional programming. To fully exploit the ca- spark-fits using the LSST fast simulation tool, CoLoRe, and present the methodologies for measur- pability of cluster computing, Apache Spark relies ing and tuning the performance bottlenecks for the also on a cluster manager and a distributed storage workloads, scaling up to terabytes of FITS data on system. Spark can interface with a wide variety of the Cloud@VirtualData, located at Université Paris distributed storage systems, including the Hadoop Sud. We also evaluate its performance on Cori, Distributed File System (HDFS), Apache Cassan- a High-Performance Computing system located at dra [16], Amazon S3 [17], or even Lustre [18] tra- NERSC, and widely used in the scientific commu- ditionally installed on High-Performance Comput- nity. ing (HPC) systems. Although Spark is not an in- memory technology per se1, it allows users to effi- 1 Introduction ciently use in-memory Least Recently Used (LRU) cache relying on disk only when the allocated mem- The volume of data recorded by current and future ory is not sufficient, and its popularity largely comes High Energy Physics & Astrophysics experiments, from the fact that it can be used at interactive speeds and their complexity require a broad panel of knowl- for data mining on clusters. edge in computer science, signal processing, statis- tics, and physics. Precise analysis of those data sets When it started, usages of Apache Spark were of- ten limited to naively structured data formats such arXiv:1804.07501v2 [astro-ph.IM] 15 Oct 2018 is a serious computational challenge, which cannot be done without the help of state-of-the-art tools. as Comma Separated Values (CSV), or JavaScript This requires sophisticated and robust analysis per- Object Notation (JSON) driven by the need to ana- formed on many machines, as we need to process lyze huge data sets from monitoring applications, log or simulate data sets several times. Among the fu- files from devices or web data extraction for exam- ture experiments, the Large Synoptic Survey Tele- ple. Such simple data structures allow a relatively scope (LSST [12]) will collect terabytes of data per easy and efficient distribution of the data across ma- observation night, and their efficient processing and chines. However these file formats are usually not analysis remains a major challenge. 1Spark has pluggable connectors for different persistent ∗[email protected] storage systems but it does not have native persistence code. 1 suitable for describing complex data structures, and lowing users to easily access the set of tools. In Sec. often lead to poor performance. Over the past years, 3 we evaluate the performance of spark-fits on a a handful of efficient data compression and encod- medium size cluster with various data set sizes, and ing schemes has been developed by different compa- we detail a few use cases using a LSST-like simulated nies such as Avro [19] or Parquet [20], and interfaced galaxy catalogs in Sec. 4. We discuss the impact of with Spark. Each requires a specific implementation using spark-fits on a HPC system in Sec. 5, and within the framework. conclude in Sec. 6. High Energy Physics & Astrophysics experiments typically describe their data with complex data 2 FITS data source for Apache Spark structures that are stored into structured file for- This section reviews the different building blocks of mats such as for example Hierarchical Data Format spark-fits. We first introduce a new library to 5 (HDF5) [21], ROOT files [22], or Flexible Image interpret FITS data in Scala, and then detail the Transport System (FITS) [23]. As far as connecting mechanisms of the Spark connector to manipulate the format directly to Spark is concerned, some ef- FITS data in a distributed environment. Finally we forts are being made to offer the possibility to load briefly give an example using the Scala API. ROOT files [24], or HDF5 in e.g. [25, 26]. For the 2.1 Scala FITSIO library FITS format indirect efforts are made (e.g. [7, 6, 8]) to interface the data with Spark but no native Spark In this section we recall the main FITS format fea- connector to distribute the data across machines is tures of interest for this work. A FITS file is logically available to the best of our knowledge. split into independent FITS structures called Header In this work, we investigate how to leverage Data Unit (HDU), as shown in the upper panel in Apache Spark to process and analyse future data Fig. 2. By convention, the first HDU is called pri- sets in astronomy. We study the question in the con- mary, while the others (if any) are called extensions. text of the FITS file format [5, 9] introduced more All HDU consist of one or more 2880-byte header than thirty years ago with the requirement that de- blocks immediately followed by an optional sequence velopments to the format must be backward compat- of associated 2880-byte data blocks. The header is ible. This data format has been used successfully to used to define the types of data and protocoles, and store and manipulate the data of a wide range of then the data is serialized into a more compact bi- astrophysical experiments over the years, hence it is nary format. The header contains a set of ASCII considered as one of the possible data format for fu- text characters (decimal 32 through 126) which de- ture surveys like LSST. More specifically we address scribes the following data blocks: types and struc- a number of challenges in this paper, such as: ture (1D, 2D, or 3D+), size, physical informations • How do we read a FITS file in Scala (Sec. 2.1)? or comments about the data, and so on. Then the • How do we access the data of a distributed FITS entire array of data values are represented by a con- file across machines (Sec. 2.2)? tinuous stream of bytes starting at the beginning of • What is the effect of caching the data on IO per- the first data block. formance in Spark (Secs. 3 & 5)? spark-fits is written in Scala, like Spark. Scala • What is the behavior if there is insufficient mem- is a relatively recent general-purpose programming ory across the cluster to cache the data (Sec. 3)? language (2004) providing many features of func- • What is the impact of the underlying distributed tional programming such as type inference, im- file system (Sec. 5)? mutability, lazy evaluation, and pattern matching. While several packages are available to read and To this purpose, we designed and implemented write FITS files in about 20 computer languages, a native Scala-based Spark connector, called none is written in Scala. There are Java packages spark-fits to handle sets of FITS files arbitrar- that could have been used in this work as Scala ily large. spark-fits is an open source soft- provides language interoperability with Java, but ware released under a Apache-2.0 license and avail- the structures of the different available Java pack- able as of April, 2018 from https://github.com/ ages are not meant to be used in a distributed envi- astrolabsoftware/spark-fits. ronment and they lack functional programming fea- The paper is organized as follows. We start with a tures. Therefore we released within spark-fits a review of the architecture of in Sec. 2, spark-fits new Scala library to manipulate FITS file, compat- and in particular the Scala FITSIO library to inter- ible with Scala versions 2.10 and 2.112. The library pret FITS in Scala, the Spark connector to manip- is still at its infancy and only a handful of fea- ulate FITS data in a distributed environment, and the different API (Scala, Java, Python, and R) al- 2Scala versions are mainly not backwards-compatible. 2 tures included in the popular and sophisticated C with the NameNode, the master server that man- library CFITSIO [27] is available.
Recommended publications
  • Analysis of Big Data Storage Tools for Data Lakes Based on Apache Hadoop Platform
    (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 8, 2021 Analysis of Big Data Storage Tools for Data Lakes based on Apache Hadoop Platform Vladimir Belov, Evgeny Nikulchev MIREA—Russian Technological University, Moscow, Russia Abstract—When developing large data processing systems, determined the emergence and use of various data storage the question of data storage arises. One of the modern tools for formats in HDFS. solving this problem is the so-called data lakes. Many implementations of data lakes use Apache Hadoop as a basic Among the most widely known formats used in the Hadoop platform. Hadoop does not have a default data storage format, system are JSON [12], CSV [13], SequenceFile [14], Apache which leads to the task of choosing a data format when designing Parquet [15], ORC [16], Apache Avro [17], PBF [18]. a data processing system. To solve this problem, it is necessary to However, this list is not exhaustive. Recently, new formats of proceed from the results of the assessment according to several data storage are gaining popularity, such as Apache Hudi [19], criteria. In turn, experimental evaluation does not always give a Apache Iceberg [20], Delta Lake [21]. complete understanding of the possibilities for working with a particular data storage format. In this case, it is necessary to Each of these file formats has own features in file structure. study the features of the format, its internal structure, In addition, differences are observed at the level of practical recommendations for use, etc. The article describes the features application. Thus, row-oriented formats ensure high writing of both widely used data storage formats and the currently speed, but column-oriented formats are better for data reading.
    [Show full text]
  • Developer Tool Guide
    Informatica® 10.2.1 Developer Tool Guide Informatica Developer Tool Guide 10.2.1 May 2018 © Copyright Informatica LLC 2009, 2019 This software and documentation are provided only under a separate license agreement containing restrictions on use and disclosure. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica LLC. Informatica, the Informatica logo, PowerCenter, and PowerExchange are trademarks or registered trademarks of Informatica LLC in the United States and many jurisdictions throughout the world. A current list of Informatica trademarks is available on the web at https://www.informatica.com/trademarks.html. Other company and product names may be trade names or trademarks of their respective owners. U.S. GOVERNMENT RIGHTS Programs, software, databases, and related documentation and technical data delivered to U.S. Government customers are "commercial computer software" or "commercial technical data" pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations. As such, the use, duplication, disclosure, modification, and adaptation is subject to the restrictions and license terms set forth in the applicable Government contract, and, to the extent applicable by the terms of the Government contract, the additional rights set forth in FAR 52.227-19, Commercial Computer Software License. Portions of this software and/or documentation are subject to copyright held by third parties. Required third party notices are included with the product. The information in this documentation is subject to change without notice. If you find any problems in this documentation, report them to us at [email protected].
    [Show full text]
  • Unravel Data Systems Version 4.5
    UNRAVEL DATA SYSTEMS VERSION 4.5 Component name Component version name License names jQuery 1.8.2 MIT License Apache Tomcat 5.5.23 Apache License 2.0 Tachyon Project POM 0.8.2 Apache License 2.0 Apache Directory LDAP API Model 1.0.0-M20 Apache License 2.0 apache/incubator-heron 0.16.5.1 Apache License 2.0 Maven Plugin API 3.0.4 Apache License 2.0 ApacheDS Authentication Interceptor 2.0.0-M15 Apache License 2.0 Apache Directory LDAP API Extras ACI 1.0.0-M20 Apache License 2.0 Apache HttpComponents Core 4.3.3 Apache License 2.0 Spark Project Tags 2.0.0-preview Apache License 2.0 Curator Testing 3.3.0 Apache License 2.0 Apache HttpComponents Core 4.4.5 Apache License 2.0 Apache Commons Daemon 1.0.15 Apache License 2.0 classworlds 2.4 Apache License 2.0 abego TreeLayout Core 1.0.1 BSD 3-clause "New" or "Revised" License jackson-core 2.8.6 Apache License 2.0 Lucene Join 6.6.1 Apache License 2.0 Apache Commons CLI 1.3-cloudera-pre-r1439998 Apache License 2.0 hive-apache 0.5 Apache License 2.0 scala-parser-combinators 1.0.4 BSD 3-clause "New" or "Revised" License com.springsource.javax.xml.bind 2.1.7 Common Development and Distribution License 1.0 SnakeYAML 1.15 Apache License 2.0 JUnit 4.12 Common Public License 1.0 ApacheDS Protocol Kerberos 2.0.0-M12 Apache License 2.0 Apache Groovy 2.4.6 Apache License 2.0 JGraphT - Core 1.2.0 (GNU Lesser General Public License v2.1 or later AND Eclipse Public License 1.0) chill-java 0.5.0 Apache License 2.0 Apache Commons Logging 1.2 Apache License 2.0 OpenCensus 0.12.3 Apache License 2.0 ApacheDS Protocol
    [Show full text]
  • Talend Open Studio for Big Data Release Notes
    Talend Open Studio for Big Data Release Notes 6.0.0 Talend Open Studio for Big Data Adapted for v6.0.0. Supersedes previous releases. Publication date July 2, 2015 Copyleft This documentation is provided under the terms of the Creative Commons Public License (CCPL). For more information about what you can and cannot do with this documentation in accordance with the CCPL, please read: http://creativecommons.org/licenses/by-nc-sa/2.0/ Notices Talend is a trademark of Talend, Inc. All brands, product names, company names, trademarks and service marks are the properties of their respective owners. License Agreement The software described in this documentation is licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.html. Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. This product includes software developed at AOP Alliance (Java/J2EE AOP standards), ASM, Amazon, AntlR, Apache ActiveMQ, Apache Ant, Apache Avro, Apache Axiom, Apache Axis, Apache Axis 2, Apache Batik, Apache CXF, Apache Cassandra, Apache Chemistry, Apache Common Http Client, Apache Common Http Core, Apache Commons, Apache Commons Bcel, Apache Commons JxPath, Apache
    [Show full text]
  • Hive, Spark, Presto for Interactive Queries on Big Data
    DEGREE PROJECT IN INFORMATION AND COMMUNICATION TECHNOLOGY, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2018 Hive, Spark, Presto for Interactive Queries on Big Data NIKITA GUREEV KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE TRITA TRITA-EECS-EX-2018:468 www.kth.se Abstract Traditional relational database systems can not be efficiently used to analyze data with large volume and different formats, i.e. big data. Apache Hadoop is one of the first open-source tools that provides a dis- tributed data storage system and resource manager. The space of big data processing has been growing fast over the past years and many tech- nologies have been introduced in the big data ecosystem to address the problem of processing large volumes of data, and some of the early tools have become widely adopted, with Apache Hive being one of them. How- ever, with the recent advances in technology, there are other tools better suited for interactive analytics of big data, such as Apache Spark and Presto. In this thesis these technologies are examined and benchmarked in or- der to determine their performance for the task of interactive business in- telligence queries. The benchmark is representative of interactive business intelligence queries, and uses a star-shaped schema. The performance Hive Tez, Hive LLAP, Spark SQL, and Presto is examined with text, ORC, Par- quet data on different volume and concurrency. A short analysis and con- clusions are presented with the reasoning about the choice of framework and data format for a system that would run interactive queries on big data.
    [Show full text]
  • Getting Started with Apache Avro
    Getting Started with Apache Avro By Reeshu Patel Getting Started with Apache Avro 1 Introduction Apache Avro Apache Avro is a remote procedure call and serialization framework developed with Apache's Hadoop project. This is uses JSON for defining data types and protocols, and tend to serializes data in a compact binary format. In other words, Apache Avro is a data serialization system. Its frist native use is in Apache Hadoop, where it's provide both a serialization format for persistent data, and a correct format for communication between Hadoop nodes, and from client programs to the apache Hadoop services. Avro is a data serialization system.It'sprovides: Rich data structures. A compact, fast, binary data format. A container file, to store persistent data. Remote procedure call . It's easily integration with dynamic languages. Code generation is not mendetory to read or write data files nor to use or implement Remote procedure call protocols. Code generation is as an optional optimization, only worth implementing for statically typewritten languages. Schemas of Apache Avro When Apache avro data is read, the schema use when writing it's always present. This permits every datum to be written in no per-value overheads, creating serialization both fast and small. It also facilitates used dynamic, scripting languages, and data, together with it's schema, is fully itself-describing. 2 Getting Started with Apache Avro When Apache avro data is storein a file, it's schema is store with it, so that files may be processe later by any program. If the program is reading the data expects a different schema this can be simply resolved, since twice schemas are present.
    [Show full text]
  • Kafka Schema Registry Example Java
    Kafka Schema Registry Example Java interchangeAshby repaginated his nephology his crucibles so antagonistically! spindle actinally, Trey but understand skewbald Barnabyher wheedlings never cannonballs incommutably, so inhumanly.alpine and official.Articulable Elton designs some mantillas and The example java client caches this Registry configuration options Settings to control schema registry authentication options and more. Kafka Connect and Schemas rmoff's random ramblings. To generate Java POJOs from our Avro schema files we need avro-maven-plugin. If someone Use Confluent Schema Registry on a Kafka Target. Kafka-Avro Adapter Tutorial This gospel a short tutorial on law to testify a Java. HDInsight Managed Kafka with Confluent Kafka Schema. Using the Confluent or Hortonworks schema registry Striim. As well as a partition was written with an event written generically for example java languages so you used if breaking compatibility. 30 Confluent Schema Registry Elastic HDFS Example Consumers. This is even ensure Avro Schema and Avro in Java is fully understood before occur to the confluent schema registry for Apache Kafka. Confluent schema registry it provides convenient methods to encode decode and tender new schemas using the Apache Avro serialization. For lease the treaty is shot you've defined the schema that schedule be represented as a Java. HowTo Produce Avro Messages to Kafka using Schema. Spring Boot Kafka Schema Registry by Sunil Medium. Login Name join a administrator name do the Kafka Cluster example admin. Installing and Upgrading the Confluent Schema Registry. The Debezium Tutorial shows what the records look decent when both payload and. Apache Kafka Schema Evolution Part 1 Learning Journal.
    [Show full text]
  • Impala: a Modern, Open-Source SQL Engine for Hadoop
    Impala: A Modern, Open-Source SQL Engine for Hadoop Marcel Kornacker Alexander Behm Victor Bittorf Taras Bobrovytsky Casey Ching Alan Choi Justin Erickson Martin Grund Daniel Hecht Matthew Jacobs Ishaan Joshi Lenni Kuff Dileep Kumar Alex Leblang Nong Li Ippokratis Pandis Henry Robinson David Rorke Silvius Rus John Russell Dimitris Tsirogiannis Skye Wanderman-Milne Michael Yoder Cloudera http://impala.io/ ABSTRACT that is on par or exceeds that of commercial MPP analytic Cloudera Impala is a modern, open-source MPP SQL en- DBMSs, depending on the particular workload. gine architected from the ground up for the Hadoop data This paper discusses the services Impala provides to the processing environment. Impala provides low latency and user and then presents an overview of its architecture and high concurrency for BI/analytic read-mostly queries on main components. The highest performance that is achiev- Hadoop, not delivered by batch frameworks such as Apache able today requires using HDFS as the underlying storage Hive. This paper presents Impala from a user's perspective, manager, and therefore that is the focus on this paper; when gives an overview of its architecture and main components there are notable differences in terms of how certain technical and briefly demonstrates its superior performance compared aspects are handled in conjunction with HBase, we note that against other popular SQL-on-Hadoop systems. in the text without going into detail. Impala is the highest performing SQL-on-Hadoop system, especially under multi-user workloads. As Section7 shows, 1. INTRODUCTION for single-user queries, Impala is up to 13x faster than alter- Impala is an open-source 1, fully-integrated, state-of-the- natives, and 6.7x faster on average.
    [Show full text]
  • Full-Graph-Limited-Mvn-Deps.Pdf
    org.jboss.cl.jboss-cl-2.0.9.GA org.jboss.cl.jboss-cl-parent-2.2.1.GA org.jboss.cl.jboss-classloader-N/A org.jboss.cl.jboss-classloading-vfs-N/A org.jboss.cl.jboss-classloading-N/A org.primefaces.extensions.master-pom-1.0.0 org.sonatype.mercury.mercury-mp3-1.0-alpha-1 org.primefaces.themes.overcast-${primefaces.theme.version} org.primefaces.themes.dark-hive-${primefaces.theme.version}org.primefaces.themes.humanity-${primefaces.theme.version}org.primefaces.themes.le-frog-${primefaces.theme.version} org.primefaces.themes.south-street-${primefaces.theme.version}org.primefaces.themes.sunny-${primefaces.theme.version}org.primefaces.themes.hot-sneaks-${primefaces.theme.version}org.primefaces.themes.cupertino-${primefaces.theme.version} org.primefaces.themes.trontastic-${primefaces.theme.version}org.primefaces.themes.excite-bike-${primefaces.theme.version} org.apache.maven.mercury.mercury-external-N/A org.primefaces.themes.redmond-${primefaces.theme.version}org.primefaces.themes.afterwork-${primefaces.theme.version}org.primefaces.themes.glass-x-${primefaces.theme.version}org.primefaces.themes.home-${primefaces.theme.version} org.primefaces.themes.black-tie-${primefaces.theme.version}org.primefaces.themes.eggplant-${primefaces.theme.version} org.apache.maven.mercury.mercury-repo-remote-m2-N/Aorg.apache.maven.mercury.mercury-md-sat-N/A org.primefaces.themes.ui-lightness-${primefaces.theme.version}org.primefaces.themes.midnight-${primefaces.theme.version}org.primefaces.themes.mint-choc-${primefaces.theme.version}org.primefaces.themes.afternoon-${primefaces.theme.version}org.primefaces.themes.dot-luv-${primefaces.theme.version}org.primefaces.themes.smoothness-${primefaces.theme.version}org.primefaces.themes.swanky-purse-${primefaces.theme.version}
    [Show full text]
  • Metadata in Pig Schema
    Metadata In Pig Schema Kindled and self-important Lewis lips her columbine deschool while Bearnard books some mascarons transcontinentally. Elfish Clayton transfigure some blimps after doughiest Coleman predoom wherewith. How annihilated is Ray when freed and old-established Oral crank some flatboats? Each book selection of metadata selection for easy things possible that in metadata standards tend to make a library three times has no matter what aspects regarding interface Value type checking and concurrent information retrieval skills, and guesses that make easy and metadata in pig schema, without explicitly specified a traditional think aloud in a million developers. Preservation Metadata Digital Preservation Coalition. The actual photos of schema metadata of the market and pig benchmarking cloud service for. And insulates users from schema and storage format changes. Resulting Schema When both group a relation the result is enough new relation with two columns group conclude the upset of clever original relation The. While other if pig in schema metadata from top of the incoming data analysis using. Sorry for copy must upgrade is more by data is optional thrift based on building a metadata schemas work, your career in most preferred. Metacat publishes schema metadata and businessuser-defined. Use external metadata stores Azure HDInsight Microsoft Docs. Using the Parquet File Format with Impala Hive Pig and. In detail on how do i split between concrete operational services. Spark write avro to kafka Tower Dynamics Limited. For cases where the id or other metadata fields like ttl or timestamp of the document. We mostly start via an example Avro schema and a corresponding data file.
    [Show full text]
  • Perform Data Engineering on Microsoft Azure Hdinsight (775)
    Perform Data Engineering on Microsoft Azure HDInsight (775) www.cognixia.com Administer and Provision HDInsight Clusters Deploy HDInsight clusters Create a cluster in a private virtual network, create a cluster that has a custom metastore, create a domain-joined cluster, select an appropriate cluster type based on workload considerations, customize a cluster by using script actions, provision a cluster by using Portal, provision a cluster by using Azure CLI tools, provision a cluster by using Azure Resource Manager (ARM) templates and PowerShell, manage managed disks, configure vNet peering Deploy and secure multi-user HDInsight clusters Provision users who have different roles; manage users, groups, and permissions through Apache Ambari, PowerShell, and Apache Ranger; configure Kerberos; configure service accounts; implement SSH tunneling; restrict access to data Ingest data for batch and interactive processing Ingest data from cloud or on-premises data; store data in Azure Data Lake; store data in Azure Blob Storage; perform routine small writes on a continuous basis using Azure CLI tools; ingest data in Apache Hive and Apache Spark by using Apache Sqoop, Application Development Framework (ADF), AzCopy, and AdlCopy; ingest data from an on-premises Hadoop cluster Configure HDInsight clusters Manage metastore upgrades; view and edit Ambari configuration groups; view and change service configurations through Ambari; access logs written to Azure Table storage; enable heap dumps for Hadoop services; manage HDInsight configuration, use
    [Show full text]
  • An Easy-To-Use, Scalable and Robust Messaging Solution for Smart Grid
    285 An Easy-to-use, Scalable and Robust Messaging Solution for Smart Grid Research Ferdinand von Tüllenburg, Jia Lei Du, Georg Panholzer Salzburg Research Forschungsgesellschaft mbH, Salzburg, AUSTRIA, email: {ferdinand.tuellenburg, jia.du, georg.panholzer}@salzburgresearch.at Abstract: Smart Grids are characterized by tight issues regarding security, performance, scalability, reliability coupling and intertwining between the electrical system and robustness of sending and receiving messages. and information and communication technology. Due to The paper shows the application of the messaging solution this, application layer messaging systems are regularly in context of an agent-based flexibility trading application. required for many Smart Grid applications. Especially in ELATED ORK research messaging solutions are setup from scratch. In II. R W this paper we propose a generic and easy to setup message In context of messaging systems for Smart Grid oriented middleware (MOM) solution providing robust application especially solutions based on XMPP are often and scalable messaging. used [2]. Although, XMPP is a flexible solution also Keywords: Smart Grid, Messaging API, Middleware following a MOM approach, it has weaknesses with respect to ease of deployment and configuration as well as NTRODUCTION I. I implementation especially with respect to required aspects Future electrical power systems will be characterized by a such as reliability. One example here is OpenADR[3]. new control paradigm: Decentralized controllable power Recently, with FIWARE, an open source platform is available sources such as batteries, wind generators, and PV systems which provides a large set of application programming on production side and controllable loads on consumption interfaces (APIs) for a large variety of applications also side will be constantly monitored and operated depending on providing a messaging solution for Smart Grids.
    [Show full text]