Flexible Network Analytics in the Cloud

Total Page:16

File Type:pdf, Size:1020Kb

Flexible Network Analytics in the Cloud Flexible Network Analytics in the Cloud Jon Dugan & Peter Murphy ESnet Software Engineering Group October 18, 2017 TechEx 2017, San Francisco Introduction ● Harsh realities of network analytics ● netbeam ● Demo ● Technology Stack ● Alternative Approaches ● Lessons Learned 2 Architecture 3 The Harsh Realities of Network Analytics 1. It’s a mess 3. There’s always more ● Your data isn’t neat and tidy ● More devices & more telemetry 2. Things change 4. It’s never really done ● What you need today may not ● Time and money are limited be what you need tomorrow. 4 Coping strategies 1. It’s a mess 3. There’s always more ● Design knowing things won’t ● Rely on the cloud for scaling be tidy 2. Things change 4. It’s never really done ● Keep raw data to keep your ● “What” not “How” options open 5 netbeam Network Analytics in Google Cloud Three Pillars 1. Real time analytics ○ Low latency, incomplete 2. Offline analytics ○ High latency, complete 3. Flexible data model ○ Changing needs? Recompute from raw data! Secret sauce: Apache Beam 6 What is Apache Beam? 1. The Beam Programming Model 2. SDKs for writing Beam pipelines 3. Runners for existing distributed processing backends ○ Apache Apex ○ Apache Flink ○ Apache Spark ○ Google Cloud Dataflow ○ Local runner for testing Slide courtesy of the Apache Beam Project 7 The Evolution of Apache Beam Colossus BigTable PubSub Dremel Google Cloud Dataflow Spanner Megastore Millwheel Flume Apache Beam MapReduce Slide courtesy of the Apache Beam Project 8 SNMP collection Old SNMP system Architecture Diagram system Apache Beam (Stream Processing) avro Bigtable Apache Beam BigQuery BigQuery (realtime) (Batch Processing) (immutable) (historical) ... API Client 9 SNMP collection Old SNMP system Architecture Diagram system Apache Beam (Stream) avro ● Google Pubsub ● Uses Python outside Align/rates of Google Cloud to poll devices and write Bigtable Rollups (realtime) BigQuery BigQuery to Pubsub topic 5m, 1h, 1d avg (immutable) (historical) ● Code within Google Cloud subscribes to Percentiles topic to process data ... API Client 10 SNMP collection Old SNMP system Architecture Diagram system Apache Beam (Stream) avro ● Apache Beam / Google Dataflow Align/rates ● Stream processing ● Subscribes to Bigtable Rollups (realtime) BigQuery BigQuery Pubsub topic 5m, 1h, 1d avg (immutable) (historical) Percentiles ... API Client 11 SNMP collection Old SNMP system Architecture Diagram system Apache Beam (Stream) avro ● Apache Beam / Google Dataflow Align/rates ● Stream processing ● Subscribes to Bigtable Rollups (realtime) BigQuery BigQuery Pubsub topic 5m, 1h, 1d avg (immutable) (historical) ● Raw data is written to BigQuery Percentiles ... API Client 12 SNMP collection Old SNMP system Architecture Diagram system Apache Beam (Stream) avro ● Apache Beam / Google Dataflow Align/rates ● Stream processing ● Subscribes to Bigtable Rollups (realtime) BigQuery BigQuery Pubsub topic 5m, 1h, 1d avg (immutable) (historical) ● Raw data is written to BigQuery Percentiles ● Real time transformed data ... (e.g. aligned data rates) written to Bigtable API ● Writes and makes use of meta data in BigTable (not shown) Client 13 SNMP collection Old SNMP system Architecture Diagram system Apache Beam (Stream) avro ● Cloud Bigtable ● Like HBase Align/rates ● Write to cells in rows, indexed by keys Bigtable Rollups (realtime) BigQuery BigQuery ● We write 1 day of 5m, 1h, 1d avg (immutable) (historical) data to a single row (columns are the time Percentiles of day, key is metric and day) ... ● Fast access to row by key, can serve data from here API ● Store one year Client 14 SNMP collection Old SNMP system Architecture Diagram system Apache Beam (Stream) avro ● BigQuery ● Data warehousing Align/rates solution ● Cheap storage, SQL Bigtable Rollups (realtime) BigQuery BigQuery access, but not 5m, 1h, 1d avg (immutable) (historical) suitable for real-time access Percentiles ● Allows SQL queries for ad hoc ... investigation ● We store our source of truth here API Client 15 SNMP collection Old SNMP system Architecture Diagram system Apache Beam (Stream) avro ● BigQuery ● Data warehousing Align/rates solution ● Cheap storage, SQL Bigtable Rollups (realtime) BigQuery BigQuery access, but not 5m, 1h, 1d avg (immutable) (historical) suitable for real-time access Percentiles ● Allows SQL queries for ad hoc ... investigation ● We store our source of truth here API ● Also store historical data (7 years), imported via avro files Client 16 SNMP collection Old SNMP system Architecture Diagram system Apache Beam (Stream) avro ● Apache Beam / Google Dataflow Align/rates ● Batch processing ● Run with cron job Bigtable Rollups (realtime) BigQuery BigQuery 5m, 1h, 1d avg (immutable) (historical) Percentiles ... API Client 17 SNMP collection Old SNMP system Architecture Diagram system Apache Beam (Stream) avro ● Apache Beam / Google Dataflow Align/rates ● Batch processing ● Run with cron job Bigtable Rollups (realtime) BigQuery BigQuery ● Recalculate Bigtable 5m, 1h, 1d avg (immutable) (historical) data each night from source of truth in Percentiles BigQuery ... API Client 18 SNMP collection Old SNMP system Architecture Diagram system Apache Beam (Stream) avro ● Apache Beam / Google Dataflow Align/rates ● Batch processing ● Run with cron job Bigtable Rollups (realtime) BigQuery BigQuery ● Recalculate Bigtable 5m, 1h, 1d avg (immutable) (historical) data each night from source of truth in Percentiles BigQuery ● Process Bigtable ... rows into new rows of 5min, 1 hr and 1 day aggregations API Client 19 SNMP collection Old SNMP system Architecture Diagram system Apache Beam (Stream) avro ● Apache Beam / Google Dataflow Align/rates ● Batch processing ● Run with cron job Bigtable Rollups (realtime) BigQuery BigQuery ● Recalculate Bigtable 5m, 1h, 1d avg (immutable) (historical) data each night from source of truth in Percentiles BigQuery ● Process Bigtable ... rows into new rows of 5min, 1 hr and 1 day aggregations API ● Additional pre-computed views e.g. percentiles for traffic distribution over a month Client 20 SNMP collection Old SNMP system Architecture Diagram system Apache Beam (Stream) avro ● API ● Currently runs on Align/rates App Engine ● Node.js Bigtable Rollups (realtime) BigQuery BigQuery ● Serves data out of 5m, 1h, 1d avg (immutable) (historical) Bigtable ● Timeseries data is Percentiles served as ‘tiles’, each tile is one row ... ● Would like to use Cloud Endpoints and provide a gRPC Dataserver API service (node.js) ● Looking forward to grpc-web solution Client 21 Use case example: Historical Trends 22 Use case example: Historical Trends Per-day SNMP collection Interface totals Stream to BQ BigQuery system Bigtable Old SNMP system avro BigQuery (historical) Per-month totals Bigtable rows Dataserver API (node.js) Jan 1 Jan 2 ... Dec31 snmp-daily::2017-08::$interface 1.8 Pb 1.9 Pb ... 3.1 Pb Client Jan 1991 Feb 1991 ... Sep 2017 snmp-monthly-totals 28 Gb 29 Gb ... 56 Pb 23 Use case: real time anomaly detection Generates avg for each interface over the past 3 months for that hour/day SNMP collection Baseline Stream to BQ BigQuery system generation Bigtable Anomaly detection Compares baseline to real time values to generate current deviation from normal Dataserver API Mon Mon Mon Sun (node.js) 12am 1am 2am ... 11pm baseline::5m::avg::$interface 2.1 1.9 0.3 ... 0.5 Client iface-1 iface-2 ... iface-n anomaly::5m::avg +0.1 +2.0 ... -1.5 24 Use case example: Percentiles 25 Use case example: Percentiles Daily rollups 5m avg SNMP collection Stream to Bigtable system Bigtable Percentiles Bigtable rows Dataserver API 1 2 ... 8640 (node.js) rollup-month-5m::2017-08::$interface::in 6Gbps 5Gbps ... 2Gbps Client 1 pct 2 pct ... 99 pct percentiles::2017-08::$interface::in 0.1 Gbps 0.3 Gbps ... 22.1Gbps 26 Demo 27 Example: Computing Total Traffic # Python Beam SDK pipeline = beam.Pipeline('DirectRunner') (pipeline | 'read' >> ReadFromText('./example.csv') | 'csv' >> beam.ParDo(FormatCSVDoFn()) | 'ifName key' >> beam.Map(group_by_device_interface) | 'group by iface' >> beam.GroupByKey() | 'compute rate' >> beam.FlatMap(compute_rate) | 'timestamp key' >> beam.Map(lambda row: (row['timestamp'], row['rateIn'])) | 'group by timestamp' >> beam.GroupByKey() | 'sum by timestamp' >> beam.Map(lambda rates: (rates[0], sum(rates[1]))) | 'format' >> beam.Map(lambda row: '{},{}'.format(row[0], row[1])) | 'save' >> beam.io.WriteToText('./total_by_timestamp')) pipeline.run() Full code available at: http://x1024.net/blog/2017/05/chinog-flexible-network-analytics-in-the-cloud/ 28 Our Stack ● Apache Beam using Scio Cloud Cloud ● Google Cloud Platform Dataflow Pub/Sub ○ Dataflow ○ Bigtable ○ BigQuery Cloud BigQuery Bigtable ○ Pub/Sub ○ App Engine ● Languages App Cloud ○ Scala Engine Endpoints ○ Javascript / Typescript ○ Python 29 Current Status & Future Plans Current Future Alpha version for SNMP data: More types of data: ● Ingest to BigQuery is working ● Flow data ● Migration of historical data is ● perfSONAR implemented. Awaiting final details before full conversion Machine Learning ● Streaming ingest to Bigtable still in Anomaly Detection process ● Early version of utilization visualization “Mash up” various data sources ● Simple data server can provide data to clients, but gRPC API coming ● Interface timeseries charts functional 30 Why not InfluxDB, Elastic or ${FAVORITE_DB} ● We have a data processing problem, not a data storage problem per se. ○ Beam and the ecosystem around it give a huge amount of flexibility -- can try new ideas as they occur to us ○ Ability to move to different
Recommended publications
  • Advanced Model Deployments with Tensorflow Serving Presentation.Pdf
    Most models don’t get deployed. Hi, I’m Hannes. An inefficient model deployment import json from flask import Flask from keras.models import load_model from utils import preprocess model = load_model('model.h5') app = Flask(__name__) @app.route('/classify', methods=['POST']) def classify(): review = request.form["review"] preprocessed_review = preprocess(review) prediction = model.predict_classes([preprocessed_review])[0] return json.dumps({"score": int(prediction)}) Simple Deployments @app.route('/classify', methods=['POST']) Why Flask is insufficient def classify(): review = request.form["review"] ● No consistent APIs ● No consistent payloads preprocessed_review = preprocess(review) ● No model versioning prediction = model.predict_classes( ● No mini-batching support [preprocessed_review])[0] ● Inefficient for large models return json.dumps({"score": int(prediction)}) Image: Martijn Baudoin, Unsplash TensorFlow Serving TensorFlow Serving Production ready Model Serving ● Part of the TensorFlow Extended Ecosystem ● Used internally at Google ● Highly scalable model serving solution ● Works well for large models up to 2GB TensorFlow 2.0 ready! * * With small exceptions Deploy your models in 90s ... Export your Model import tensorflow as tf TensorFlow 2.0 Export tf.saved_model.save( ● Consistent model export model, ● Using Protobuf format export_dir="/tmp/saved_model", ● Export of graphs and signatures=None estimators possible ) $ tree saved_models/ Export your Model saved_models/ └── 1555875926 ● Exported model as Protobuf ├── assets (Saved_model.pb)
    [Show full text]
  • An Evaluation of Tensorflow As a Programming Framework for HPC Applications
    DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2018 An Evaluation of TensorFlow as a Programming Framework for HPC Applications WEI DER CHIEN KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE An Evaluation of TensorFlow as a Programming Framework for HPC Applications WEI DER CHIEN Master in Computer Science Date: August 28, 2018 Supervisor: Stefano Markidis Examiner: Erwin Laure Swedish title: En undersökning av TensorFlow som ett utvecklingsramverk för högpresterande datorsystem School of Electrical Engineering and Computer Science iii Abstract In recent years, deep-learning, a branch of machine learning gained increasing popularity due to their extensive applications and perfor- mance. At the core of these application is dense matrix-matrix multipli- cation. Graphics Processing Units (GPUs) are commonly used in the training process due to their massively parallel computation capabili- ties. In addition, specialized low-precision accelerators have emerged to specifically address Tensor operations. Software frameworks, such as TensorFlow have also emerged to increase the expressiveness of neural network model development. In TensorFlow computation problems are expressed as Computation Graphs where nodes of a graph denote operation and edges denote data movement between operations. With increasing number of heterogeneous accelerators which might co-exist on the same cluster system, it became increasingly difficult for users to program efficient and scalable applications. TensorFlow provides a high level of abstraction and it is possible to place operations of a computation graph on a device easily through a high level API. In this work, the usability of TensorFlow as a programming framework for HPC application is reviewed.
    [Show full text]
  • Regeldokument
    Master’s degree project Source code quality in connection to self-admitted technical debt Author: Alina Hrynko Supervisor: Morgan Ericsson Semester: VT20 Subject: Computer Science Abstract The importance of software code quality is increasing rapidly. With more code being written every day, its maintenance and support are becoming harder and more expensive. New automatic code review tools are developed to reach quality goals. One of these tools is SonarQube. However, people keep their leading role in the development process. Sometimes they sacrifice quality in order to speed up the development. This is called Technical Debt. In some particular cases, this process can be admitted by the developer. This is called Self-Admitted Technical Debt (SATD). Code quality can also be measured by such static code analysis tools as SonarQube. On this occasion, different issues can be detected. The purpose of this study is to find a connection between code quality issues, found by SonarQube and those marked as SATD. The research questions include: 1) Is there a connection between the size of the project and the SATD percentage? 2) Which types of issues are the most widespread in the code, marked by SATD? 3) Did the introduction of SATD influence the bug fixing time? As a result of research, a certain percentage of SATD was found. It is between 0%–20.83%. No connection between the size of the project and the percentage of SATD was found. There are certain issues that seem to relate to the SATD, such as “Duplicated code”, “Unused method parameters should be removed”, “Cognitive Complexity of methods should not be too high”, etc.
    [Show full text]
  • From XML to Flat Buffers: Markup in the Twenty-Teens Warning! the Contenders
    Elliotte Rusty Harold [email protected] August 2018 From XML to Flat Buffers: Markup in the Twenty-teens Warning! The Contenders ● XML ● JSON ● YAML ● EXI ● Protobufs ● Flat Protobufs XML JSON YAML EXI Protobuf Flat Buffers App Engine X X Standard Java App Engine X Flex What Uses What Kubernetes X X From technology, tools, and systems Eclipse X I use frequently. There are many others. Maven X Ant X Google X X X X X “APIs” Publishing X XML XML ● Very well defined standard ● By far the most general format: ○ Mixed content ○ Attributes and elements ● By far the best tool support. Nothing else is close: ○ XSLT ○ XPath ○ Many schema languages: ■ W3C XSD ■ RELAX NG More Reasons to Choose XML ● Most composable for mixing and matching markup; e.g. MathML+SVG in HTML ● Does not require a schema. ● Streaming support: very large documents ● Better for interchange amongst unrelated parties ● The deeper your needs the more likely you’ll end up here. Why Not XML? ● Relatively complex for simple tasks ● Limited to no support for non-string programming types: ○ Numbers, booleans, dates, money, etc. ○ Lists, maps, sets ○ You can encode all these but APIs don’t necessarily recognize or support them. ● Lots of sharp edges to surprise the non-expert: ○ 9/10 are namespace related ○ Attribute value normalization ○ White space ● Some security issues if you’re not careful (Billion laughs) JSON ● Simple for object serialization and program data. If your data is a few basic types (int, string, boolean, float) and data structures (list, map) this works well. ● More or less standard (7-8 of them in fact) ● Consumption libraries for essentially all significant languages Why Not JSON? ● It is surprising how fast needs grow past a few basic types and data structures.
    [Show full text]
  • Trifacta Data Preparation for Amazon Redshift and S3 Must Be Deployed Into an Existing Virtual Private Cloud (VPC)
    Install Guide for Data Preparation for Amazon Redshift and S3 Version: 7.1 Doc Build Date: 05/26/2020 Copyright © Trifacta Inc. 2020 - All Rights Reserved. CONFIDENTIAL These materials (the “Documentation”) are the confidential and proprietary information of Trifacta Inc. and may not be reproduced, modified, or distributed without the prior written permission of Trifacta Inc. EXCEPT AS OTHERWISE PROVIDED IN AN EXPRESS WRITTEN AGREEMENT, TRIFACTA INC. PROVIDES THIS DOCUMENTATION AS-IS AND WITHOUT WARRANTY AND TRIFACTA INC. DISCLAIMS ALL EXPRESS AND IMPLIED WARRANTIES TO THE EXTENT PERMITTED, INCLUDING WITHOUT LIMITATION THE IMPLIED WARRANTIES OF MERCHANTABILITY, NON-INFRINGEMENT AND FITNESS FOR A PARTICULAR PURPOSE AND UNDER NO CIRCUMSTANCES WILL TRIFACTA INC. BE LIABLE FOR ANY AMOUNT GREATER THAN ONE HUNDRED DOLLARS ($100) BASED ON ANY USE OF THE DOCUMENTATION. For third-party license information, please select About Trifacta from the Help menu. 1. Quick Start . 4 1.1 Install from AWS Marketplace . 4 1.2 Upgrade for AWS Marketplace . 7 2. Configure . 8 2.1 Configure for AWS . 8 2.1.1 Configure for EC2 Role-Based Authentication . 14 2.1.2 Enable S3 Access . 16 2.1.2.1 Create Redshift Connections 28 3. Contact Support . 30 4. Legal 31 4.1 Third-Party License Information . 31 Page #3 Quick Start Install from AWS Marketplace Contents: Product Limitations Internet access Install Desktop Requirements Pre-requisites Install Steps - CloudFormation template SSH Access Troubleshooting SELinux Upgrade Documentation Related Topics This guide steps through the requirements and process for installing Trifacta® Data Preparation for Amazon Redshift and S3 through the AWS Marketplace.
    [Show full text]
  • Portable Stateful Big Data Processing in Apache Beam
    Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC Software Engineer @ Google https://s.apache.org/ffsf-2017-beam-state [email protected] / @KennKnowles Flink Forward San Francisco 2017 Agenda 1. What is Apache Beam? 2. State 3. Timers 4. Example & Little Demo What is Apache Beam? TL;DR (Flink draws it more like this) 4 DAGs, DAGs, DAGs Apache Beam Apache Flink Apache Cloud Hadoop Apache Apache Dataflow Spark Samza MapReduce Apache Apache Apache (paper) Storm Gearpump Apex (incubating) FlumeJava (paper) Heron MillWheel (paper) Dataflow Model (paper) 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 Apache Flink local, on-prem, The Beam Vision cloud Cloud Dataflow: Java fully managed input.apply( Apache Spark Sum.integersPerKey()) local, on-prem, cloud Sum Per Key Apache Apex Python local, on-prem, cloud input | Sum.PerKey() Apache Gearpump (incubating) ⋮ ⋮ 6 Apache Flink local, on-prem, The Beam Vision cloud Cloud Dataflow: Python fully managed input | KakaIO.read() Apache Spark local, on-prem, cloud KafkaIO Apache Apex ⋮ local, on-prem, cloud Apache Java Gearpump (incubating) class KafkaIO extends UnboundedSource { … } ⋮ 7 The Beam Model PTransform Pipeline PCollection (bounded or unbounded) 8 The Beam Model What are you computing? (read, map, reduce) Where in event time? (event time windowing) When in processing time are results produced? (triggers) How do refinements relate? (accumulation mode) 9 What are you computing? Read ParDo Grouping Composite Parallel connectors to Per element Group
    [Show full text]
  • Cloud Native Communication Patterns with Grpc
    Cloud Native Communication Patterns with gRPC Kasun Indrasiri Author “gRPC Up and Running” and “Microservices for Enterprise” About Me ● Author “gRPC Up & Running”, “Microservices for Enterprise” ● Product Manager/Senior Director at WSO2. ● Committer and PMC member at Apache Software Foundation. ● Founder “Bay area Microservices, APIs and Integration” meetup group. What is gRPC? ● Modern Inter-process communication technology. ● Invoking remote functions as easy as making a local function invocation. ● Contract-first. ● Binary messaging on the wire on top of HTTP2 ● Polyglot. Fundamentals of gRPC - Service Definition syntax = "proto3"; ● Defines the business capabilities of package ecommerce; your service. service ProductInfo { rpc addProduct(Product) returns (ProductID); ● Protocol Buffers used as the IDL for rpc getProduct(ProductID) returns (Product); define services. } message Product { ● Protocol Buffers : string id = 1; ○ A language-agnostic, platform-neutral, string name = 2; extensible mechanism to serializing string description = 3; float price = 4; structured data. } ● Defines service, remote methods, and message ProductID { data types. string value = 1; } ProductInfo.proto Fundamentals of gRPC - gRPC Service // AddProduct implements ecommerce.AddProduct ● gRPC service implements the func (s *server) AddProduct(ctx context.Context, in *pb.Product) (*pb.ProductID, business logic. error) { ● Generate server side skeleton from // Business logic } service definition. // GetProduct implements ecommerce.GetProduct func (s *server) GetProduct(ctx
    [Show full text]
  • The Forrester Wave™: Streaming Analytics, Q3 2019 the 11 Providers That Matter Most and How They Stack up by Mike Gualtieri September 23, 2019
    LICENSED FOR INDIVIDUAL USE ONLY The Forrester Wave™: Streaming Analytics, Q3 2019 The 11 Providers That Matter Most And How They Stack Up by Mike Gualtieri September 23, 2019 Why Read This Report Key Takeaways In our 26-criterion evaluation of streaming Software AG, IBM, Microsoft, Google, And analytics providers, we identified the 11 most TIBCO Software Lead The Pack significant ones — Alibaba, Amazon Web Forrester’s research uncovered a market in which Services, Cloudera, EsperTech, Google, IBM, Software AG, IBM, Microsoft, Google, and TIBCO Impetus, Microsoft, SAS, Software AG, and Software are Leaders; Cloudera, SAS, Amazon TIBCO Software — and researched, analyzed, Web Services, and Impetus are Strong Performers; and scored them. This report shows how each and EsperTech and Alibaba are Contenders. provider measures up and helps application Analytics Prowess, Scalability, And development and delivery (AD&D) professionals Deployment Freedom Are Key Differentiators select the right one for their needs. Depth and breadth of analytics types on streaming data are critical. But that is all for naught if streaming analytics vendors cannot also scale to handle potentially huge volumes of streaming data. Also, it’s critical that streaming analytics can be deployed where it is most needed, such as on-premises, in the cloud, and/ or at the edge. This PDF is only licensed for individual use when downloaded from forrester.com or reprints.forrester.com. All other distribution prohibited. FORRESTER.COM FOR APPLICATION DEVELOPMENT & DELIVERY PROFESSIONALS The Forrester Wave™: Streaming Analytics, Q3 2019 The 11 Providers That Matter Most And How They Stack Up by Mike Gualtieri with Srividya Sridharan and Robert Perdoni September 23, 2019 Table Of Contents Related Research Documents 2 Enterprises Must Take A Streaming-First The Future Of Machine Learning Is Unstoppable Approach To Analytics Predictions 2019: Artificial Intelligence 3 Evaluation Summary Predictions 2019: Business Insights 6 Vendor Offerings 6 Vendor Profiles Leaders Share reports with colleagues.
    [Show full text]
  • Integrating R with the Go Programming Language Using Interprocess Communication
    Integrating R with the Go programming language using interprocess communication Christoph Best, Karl Millar, Google Inc. [email protected] Statistical software in practice & production ● Production environments !!!= R development environment ○ Scale: machines, people, tools, lines of code… ● “discipline of software engineering” ○ Maintainable code, common standards and processes ○ Central problem: The programming language to use ● How do you integrate statistical software in production? ○ Rewrite everything in your canonical language? ○ Patch things together with scripts, dedicated servers, ... ? Everybody should just write Java! Programming language diversity ● Programming language diversity is hard … ○ Friction, maintenance, tooling, bugs, … ● … but sometimes you need to have it ○ Many statistics problems can “only” be solved in R* ● How do you integrate R code with production code? ○ without breaking production *though my colleagues keep pointing out that any Turing-complete language can solve any problem The Go programming language ● Open-source language, developed by small team at Google ● Aims to put the fun back in (systems) programming ● Fast compilation and development cycle, little “baggage” ● Made to feel like C (before C++) ● Made not to feel like Java or C++ (enterprise languages) ● Growing user base (inside and outside Google) Integration: Intra-process vs inter-process ● Intra-process: Link different languages through C ABI ○ smallest common denominator ○ issues: stability, ABI evolution, memory management, threads, … Can we do better? Or at least differently? ● Idea: Sick of crashes? Execute R in a separate process ○ Runs alongside main process, closely integrated: “lamprey” ● Provide communication layer between R and host process ○ A well-defined compact interface surface Integration: Intra-process vs inter-process C runtime Go R RPC client C++ runtime IPC Messages Java (library) Python RPC server ..
    [Show full text]
  • Scalable and Flexible Middleware for Dynamic Data Flows
    SCALABLEANDFLEXIBLEMIDDLEWAREFORDYNAMIC DATAFLOWS stephan boomker Master thesis Computing Science Software Engineering and Distributed Systems Primaray RuG supervisor : Prof. Dr. M. Aiello Secondary RuG supervisor : Prof. Dr. P. Avgeriou Primary TNO supervisor : MSc. E. Harmsma Secondary TNO supervisor : MSc. E. Lazovik August 16, 2016 – version 1.0 [ August 16, 2016 at 11:47 – classicthesis version 1.0 ] Stephan Boomker: Scalable and flexible middleware for dynamic data flows, Master thesis Computing Science, © August 16, 2016 [ August 16, 2016 at 11:47 – classicthesis version 1.0 ] ABSTRACT Due to the concepts of Internet of Things and Big data, the traditional client-server architecture is not sufficient any more. One of the main reasons is wide range of expanding heterogeneous applications, data sources and environments. New forms of data processing require new architectures and techniques in order to be scalable, flexible and able to handle fast dynamic data flows. The backbone of all those objects, applications and users is called the middleware. This research goes about designing and implementing a middle- ware by taking into account different state of the art tools and tech- niques. To come up to a solution which is able to handle a flexible set of sources and models across organizational borders. At the same time it is de-centralized, distributed and, although de-central able to perform semantic based system integration centrally. This is accom- plished by introducing of an architecture containing a combination of data integration patterns, semantic storage and stream processing patterns. A reference implementation is presented of the proposed architec- ture based on Apache Camel framework. This prototype provides the ability to dynamically create and change flexible and distributed data flows during runtime.
    [Show full text]
  • Big Data Analysis Using Hadoop Lecture 4 Hadoop Ecosystem
    1/16/18 Big Data Analysis using Hadoop Lecture 4 Hadoop EcoSystem Hadoop Ecosytems 1 1/16/18 Overview • Hive • HBase • Sqoop • Pig • Mahoot / Spark / Flink / Storm • Hadoop & Data Management Architectures Hive 2 1/16/18 Hive • Data Warehousing Solution built on top of Hadoop • Provides SQL-like query language named HiveQL • Minimal learning curve for people with SQL expertise • Data analysts are target audience • Ability to bring structure to various data formats • Simple interface for ad hoc querying, analyzing and summarizing large amountsof data • Access to files on various data stores such as HDFS, etc Website : http://hive.apache.org/ Download : http://hive.apache.org/downloads.html Documentation : https://cwiki.apache.org/confluence/display/Hive/LanguageManual Hive • Hive does NOT provide low latency or real-time queries • Even querying small amounts of data may take minutes • Designed for scalability and ease-of-use rather than low latency responses • Translates HiveQL statements into a set of MapReduce Jobs which are then executed on a Hadoop Cluster • This is changing 3 1/16/18 Hive • Hive Concepts • Re-used from Relational Databases • Database: Set of Tables, used for name conflicts resolution • Table: Set of Rows that have the same schema (same columns) • Row: A single record; a set of columns • Column: provides value and type for a single value • Can can be dived up based on • Partitions • Buckets Hive – Let’s work through a simple example 1. Create a Table 2. Load Data into a Table 3. Query Data 4. Drop a Table 4 1/16/18
    [Show full text]
  • Researching Algorithmic Institutions Essay
    Researching Algorithmic Institutions RESEARCHING ALGORITHMIC governance produces an essentially harnessed to the pursuit of procedures experimental condition of institutionality. That and realization of rules. Experiments lend INSTITUTIONS which becomes available for ‘disruption’ or themselves to the goal-oriented world of Liam Magee and Ned Rossiter ‘innovation’ — both institutional encodings — algorithms. As such, the invention of new Like any research centre that today is equally prescribed within silicon test-beds institutional forms and practices would seem 3 investigates the media conditions of social that propose limits to political possibility. antithetical to experiments in algorithmic organization, the Institute for Culture and Experiments privilege the repeatability and governance. Yet what if we consider Society modulates functional institutional reproducibility of action. This is characteristic experience itself as conditioned and governance with what might be said to be, of algorithmic routines that accommodate made possible by experiments in operatively and with a certain conscientious variation only through the a priori of known algorithmic governance? attention to method, algorithmic experiments. statistical parameters. Innovation, in other Surely enough, the past few decades have In this essay we convolve these terms. words, is merely a variation of the known seen a steady transformation of many As governance moves beyond Weberian within the horizon of fault tolerance. institutional settings. There are many studies proceduralism toward its algorithmic Experiments in algorithmic governance that account for such change as coinciding automation, research life itself becomes are radically dissimilar from the experience with and often directly resulting from the ways subject to institutional experimentation. of politics and culture, which can be in which neoliberal agendas have variously Parametric adjustment generates sine wave- understood as the constitutive outside of impacted organizational values and practices.
    [Show full text]