Tema: Visual Analytics of IOT Data and Data Traffic Paolo Nesi, Gianni Pantaleo

DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it

Corso Big Data Architecture Scuola di Ingegneria di Firenze Tema: Visual Analytics of IOT Data and data traffic Paolo Nesi, Gianni Pantaleo

DISIT Lab Dipartimento di Ingegneria dell’Informazione, DINFO Università degli Studi di Firenze Via S. Marta 3, 50139, Firenze, Italy Tel: +39‐055‐2758517, fax: +39‐055‐2758570 http://www.disit.org

1 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it

SUMMARY

1. Introduction • IoT/IoE in evolving Smart City environments 2. Solutions for IOT data traffic and flow Visual Analytics • The Snap4City architecture • AMMA and DevDash Tools 3. IoT Data Flows Management • IoT Brokers & Communication Protocols • Apache NiFi 4. Distributed Storage and Indexing • Configure Zookeeper • HBase storing & Solr indexing 5. Producing Visual Analytic Tools • Banana Dashboards

2 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it

SUMMARY

3 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 1. Introduction – IoT/IoE in evolving Smart City Environments

 Current State of the Art: Goals, Issues & Solutions

Users may easily access, Problems due to Storing continuously last read and monitor ingested data, discontinuity and loss values from all of data by data‐driven connected devices applications (IoT, Smart City sensors etc.) build applications and dashboards

collecting historical trends (Data Shadow) visualize, process and perform different kinds Lack of tools which efficiently Production of visual, easy of analytics on data monitor data traffic from to create tools which devices and applications quantitatively monitor messages/data Detect potential flows and traffic problems and anomalies High costs (real‐time & in personal data traffic historical traffic trends) (Quality of service) Requiring users with programming skills Providing different kinds of data analytics and visual dashboards

4 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 1. Introduction – IoT/IoE in evolving Smart City Environments

Field

Update Query / Act Context Context Producer Consumer Publish Subscribe Context Broker 5 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 1. Introduction – IoT Data Trend & Issues

 Typical use case: Personal device, data collection and visualization

Data Analytics Visual IoT Tools Dashboard User Apps Send Queries, Registration Actions Publish/Subscribe etc… Update IoT IoT Apps Sensors etc… & Services Brokers Visualize Actuators AMQP Last Data and Devices NGSi Historical Data Data Storage & Indexes Data Flow Optimization & Enrichment DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 1. Introduction – IoT Data Trend & Issues

 Each operation of reading / acting produce several calls and messages within the IoT Infrastructure

IoT Devices

IoT Brokers

7 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 1. Introduction – IoT Data Trend & Issues

 Large scale deployment… Rapid & huge growth of connections and data flows, messages etc…

8 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 1. Introduction – IoT Data Trend & Issues

# IoT > # People

≈ 7 Billions World Population

Reference source: iot-analytics.com 9 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 1. Introduction – Major IoT Platforms

Azure IoT

Google IoT

Amazon AWS

10 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 1. Introduction – Big Data Flow Ingestion, Collection & Management

Requirements:  Big Data IoT data and logs ingestion and data flow management among many different kind of device, broker protocols (MQTT, NGSi etc.) and user applications.  Support different communication modalities: push (event‐driven messages), pull, polling (periodically scheduled requests), http listening etc…  IoT Data flows Buffering / Queuing management, fault tolerance, data provenance and replication.  Persistent storing (data shadow) and Indexing IoT data and logs.  Produce visual tools to display charts about data analytics, temporal trends etc.

11 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it

SUMMARY

12 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 2. Solutions for IOT data traffic and flow Visual Analytics – The Snap4City Architecture https://www.snap4city.org

13 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 2. Solutions for IOT data traffic and flow Visual Analytics – AMMA and DevDash Tools

DevDash: Developer Dashboard AMMA: Application & Microservice Monitor & Analyzer service http://devdash.snap4city.org http://amma.snap4city.org  Data Value  Data flow control Control: collection, tool for real‐time enrichment and monitoring and indexing data from analyzing traffic and IoT devices. communication flows (IoT devices and  Drill down on applications). data, time, time‐ trends, facet filtering,  Many different geo‐spatial faceting. kind of data analytics and visualization  Apply filters up functionalities. to reach the desired data view.  Drill down on data, time, time‐trends,  View data on facet filtering, geo‐spatial map up to single faceting. device resolution.  Origin/destination  Download data from/to external/local IP. Details. 14 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 2. Solutions for IOT data traffic and flow Visual Analytics – AMMA and DevDash Tools

15 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it

SUMMARY

16 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – IoT Brokers and Communications Protocols

MQTT Protocol MQTT (Message Queuing Telemetry Transport Protocol) is a lightweight messaging protocol designed for M2M (machine to machine) released in 2010. It implements a publish‐subscribe messaging mechanism, involving three main actors:  Publishers, which produce data and send them to a broker.  Subscribers, which subscribe to a topic of interest, and receive notifications when a new message for the topic is available.  Broker, which filter data based on topic and distribute them to subscribers.

MQTT Broker Publisher Subscriber

Topic Queue Publisher Subscriber

17 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – IoT Brokers and Communications Protocols

AMQP Protocol Advanced Message Queuing Protocol (AMQP) is an Open‐standard protocol for message‐oriented applications. Similar to MQTT providing a publish / subscribe mechanism which also supports system interoperability in distributed environments thanks to an Exchange module, which is responsible for receiving publisher messages and distributing them to queues based on pre‐defined roles and conditions.

AMQP Broker Publisher Queue Subscriber

Exchange

Publisher Queue Subscriber

18 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

19 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

What is NiFi ?  NiFi (short for “Niagara Files”) is an Open Source dataflow tool that can collect, route, enrich, transform and process data in a scalable manner.  It is a processing engine based on the concepts of flow‐based programming (FBP), that was designed to manage the flow of information in an ecosystem.

Why NiFi ?  Open Source  Scalable, extensible platform  Visual Web Interface  Provide data provenance  Highly configurable × No data replication  Data‐source agnostic

20 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi is NOT:  NiFi is not a distributed computation Engine.  It’s not a complete ETL tool.  It’s not a persistent Data storage tool. It only holds data temporarily for re‐run / data provenance purposes.  It’s not a document indexer. It’s indexing capabilities are only to help in troubleshooting / debugging.

21 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi is NOT: FBP Term NiFi Term Description Information FlowFile Each object moving through the system. Packet Performs the work, doing some FlowFile combination of data routing, Black Box Processor transformation, or mediation between systems.

Bounded The linkage between processors, acting as Connection queues and allowing various Buffer processes to interact at differing rates. Maintains the knowledge of how Flow processes are connected and manages Scheduler Controller the threads and allocations thereof which all processes use. A set of processes and their connections, Process which can receive and send data via Subnet ports. A process group allows creation of Group entirely new component simply by composition of its components.

22 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Features  Highly configurable and extensible ‐ Low latency VS High throughput ‐ Modify Dataflow at runtime ‐ Build custom processors ‐ Development of single components that can be reused and combined to make more complex flows

 Data buffering and queueing ‐ Provide back‐pressure management ‐ Buffering queued data ‐ Custom prioritization schemes for queues

23 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Features  Security and data recovery ‐ Content encryption, communication over secure protocols (SSL, SSH, HTTPS) ‐ Role‐based authentication/authorization mechanism for both data transfer and user management.

 Data Provenance ‐ NiFi records and indexes fine‐grained data provenance details as objectsflow through the system, making them accessible for displaying.

24 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Features  Web‐based interface ‐ Drag and drop processors to build a flow ‐ Start, stop, and configure components in real time ‐ View errors and corresponding error messages ‐ View statistics and health of data flow ‐ Create templates of common processor & connections

25 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

26 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Architecture  NiFi is a Java based system that executes within a OS / Host JVM. JVM  Primary components: Web Server • Web Server: Hosts NiFi HTTP‐based control API Processor 1 Extension N • Flow Controller Provides Processor N Flow Controller and schedules threads for execution • Extensions: FlowFile FlowFile Content Provenance Processors, Controller Repository Repository Repository Services, etc. • Repositories: FlowFile Local Storage (state of a given FlowFile), Content (actual content bytes),

Provenance 27 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Distributed Architecture

28 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi General Installation  Download repository: http://nifi.apache.org/download.html  Two versions available: Linux (tarball) and Windows (zip)  Download the appropriate version and extract to the location from which youwanttorun the application.  Mac OSX Users may also use the tarball file or install via Homebrew by running: $ brew install nifi Install NiFi as a Service (Linux)  Navigate to the NiFi installation folder. Run: $ bin/nifi.sh install to install a service with name (the default service name is nifi). NiFi Windows Installation Tips  Install both NiFi and the (required) Java packages into (for instance) C:/nifi and C:/java, respectively, instead of C:/Program Files (avoid read‐only restrictions). 29 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Execution (Linux/MacOS)  Navigate to the NiFi installation folder. To run NiFi in the foreground, run: $ bin/nifi.sh run  Use Ctrl-C to stop the application.  To run NiFi in the background, run: $ bin/nifi.sh start  To stop the application, run: $ bin/nifi.sh stop  Starting NiFi as a service: $ sudo service nifi start  Stopping NiFi service: $ sudo service nifi stop  Check NiFi service running status:

$ sudo service nifi status 30 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Execution (Windows)  Navigate to the NiFi installation folder.

 Execute the bin/run-nifi.bat file.

 To stop the application, select the window that was launched and press Ctrl-C.

Use NiFi Web‐based Interface (All platforms)  Open a web browser and navigate to http://localhost:8080/nifi

 Port 8080 is the default port and can be changed by editing the nifi.properties file in the NiFi configuration directory.

31 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Main Components: FlowFiles NiFi FlowFile  Nifi FlowFiles structure consists of two parts: a header containing the attributes and the content. Header (Attributes)

 Attributes can be referenced via the NiFi Content expression language. (Payload)

 The payload is typically actual data that is being routed through the dataflow and can also be referenced by specific processors. HTTP Document NiFi FlowFile

32 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Main Components: FlowFiles  FlowFiles can be created, copied, cloned, merged, split, modified, deleted etc.  FlowFiles consists of a map of key/value pair attribute strings.  FlowFiles attributes typically contain a set of default attributes, then custom attributes can be added.  Default attributes:  filename – A filename that can used when storing data locally or on a remote system.  path – the directory that can be used when storing data.  uuid – A Universally Unique Identifier for each single FlowFile.  entryDate – the date and time at which the FlowFile entered the system.  lineageStartDate – The date and time at which the oldest ancestor of the FlowFile entered the system.  fileSize – Represents the size, in number of bytes, of the FlowFile’s Content.

33 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Main Components: Processors  The FlowFile Processor is the actual working NiFi component, which can perform a large variety of tasks and actions, such as: listen for incoming data; pull data from external sources; publish data to external sources; route, transform, or extract information from FlowFiles etc.  NiFi built‐in FlowFile Processors examples:  Data Ingress (Ingestion) • GetFile – Pull content from the local disk and delete the original file. • GetSFTP – Pull content from a remote system.  Routing • RouteOnAttribute – Route FlowFiles based on the values of specific FlowFile attributes. • RouteOnContent – Route FlowFiles based on the values of specific FlowFile content.  Data Transformation • CompressContent – Compress or decompress content. • ReplaceText – Use Regular Expressions to modify textual content. 34 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Main Components: Processors  Other FlowFile Processors examples:  Data Egress • PutFile – Writes the FlowFile contents to a directory on the local disk. • PutSFTP – Copies the contents of the FlowFile to a remote server.  Attribute Extraction • UpdateAttribute – Adds or updates attributes using statically defined values or dynamically derived values using NiFi’s Expression Language. • ExtractText – Creates attributes based on User defined Regular Expressions.  Splitting and Aggregation • UnpackContent – Unpacks archive formats such as TAR and ZIP and sends each file within the archive as a separate FlowFile through the dataflow.

Apache NiFi User Guide: https://nifi.apache.org/docs/nifi‐docs/html/user‐guide.html

35 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Main Components: Processors Creation  Open the NiFi web‐based user interface. Through the Components Toolbar it is possible to add elements to the FileFlow.

 To add a FlowFile Processor, drag and drop the processor icon in the FileFlow canvas. A dialog is shown to the user, in order to choose which type of Processor to use.

36 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Main Components: Processors Configuration  Contextual Menu for Processor configuration  Configure: open the configuration Tab  Start or Stop: start or stop a Processor (exclusive), depending on the current state of the Processor.  Enable or Disable: enable or disable a Processor (exclusive), depending on the current state of the Processor.  View data provenance: This option displays the NiFi Data Provenance table, with information about data provenance events for the FlowFiles routed through that Processor.  View status history: graphical representation of the Processor’s statistical information over time.  View usage: show the Processor’s usage documentation.  Center in view: center the view of the canvas on the given Processor. 37 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Main Components: Processors Configuration  Contextual Menu for Processor configuration  View connections (Upstream/Downstream): This option allows the user to see and "jump to" connections (Upstream / Downstream) that are coming into / going out of the Processor. This is particularly useful when processors connect into and out of other Process Groups.  Change color: change the color of the Processor.  Create template: This option allows the user to create a template from the selected Processor.  Copy: This option places a copy of the selected Processor on the clipboard, so that it may be pasted elsewhere on the canvas by right‐clicking on the canvas and selecting Paste. The Copy/Paste actions also may be done using the keystrokes Ctrl-C (Command‐C) and Ctrl-V (Command‐V).  Delete: This option allows the DFM to delete a Processor from the canvas.

38 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Main Components: Processor Configuration ‐ Settings  Processors are configurable. By right‐clicking on the Processor itself, and choosing the «configure» option, four configuration tabs are presented to the user: Settings ‐ This tab allows you to: • Manage Penalty and Yield • Rename the processor functionalities • Enable/Disable the processor • Set Bulletin Level for error and warning notifications

39 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Main Components: Processor Configuration ‐ Scheduling Scheduling ‐ This tab allows you to set different scheduling strategies: • Timer driven: This is the default mode. The Processor will be scheduled to run on a regular interval, set by the “Run Schedule” parameter. The “Run Duration” parameter defines how long the Processor should run each time is triggered (choosing low latency vs high throughput approaches) . • Event driven: When this mode is selected, the Processor will be triggered to run by an event, and that event occurs when FlowFiles enter Connections feeding this Processor (experimental and is not supported by all Processors). • CRON driven: When using the CRON driven scheduling mode, the Processor is scheduled to run periodically, similar to the Timer driven, but with more flexible configure options. 40 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Main Components: Processor Configuration ‐ Comments

Comments ‐ This tab simply allows users to add comments.

41 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Main Components: Processor Configuration ‐ Properties

Properties ‐ This tab allows you to configure the Processor’s specific properties. If the processor allows custom properties to be configured, the user can click the plus sign in the top‐right to add them. Some properties allow for the NiFi Expression Language.

NiFi Expression Language documentation: https://nifi.apache.org/docs/nifi‐docs/html/expression‐language‐guide.html#types

42 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Main Components: Processor Configuration – NiFi Expression Language

 The NiFi expression language is the framework fore defining and referring attributes (metadata).  The language is built on the attribute being referenced with a preceding ${ and proceeding }, for example ${inputFilePath}.  Additional terms (functions) can be added for attributes manipulation, transformation and logic expressions, for example:  Check for substring matching within attributes ${fileName:contains('Nifi')}  Append string operations ${outputPath:append('/new_directory’)}  Reformat dates ${string_date:toDate("yyyy-MM-DD")}  Mathematical operations ${totalAmount:minus(5)}  Multi‐variable comparison ( gt => “greater than”) ${variable_one:gt(${variable_two})} 43 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Main Components: NiFi Expression Language built‐in functions  Boolean Logic • prepend • endsWith • replace • contains • isNull • replaceFirst • in • notNull • replaceAll • find • isEmpty • replaceNull • matches • equals • replaceEmpty • indexOf • equalsIgnoreCase • length • lastIndexOf • gt • jsonPath • ge • lt  Encode/Decode Functions • le • escapeJson  Mathematical Operations and • and • escapeXml Numeric Manipulation • or • escapeCsv • plus • not • escapeHtml3 • minus • ifElse • escapeHtml4 • multiply • unescapeJson • divide • unescapeXml  String Manipulation • mod • unescapeCsv • toRadix • toUpper • unescapeHtml3 • fromRadix • toLower • unescapeHtml4 • random • trim • urlEncode • Math • substring • urlDecode • substringBefore • base64Encode • substringBeforeLast • base64Decode Date Manipulation • substringAfter • format • substringAfterLast  Searching • toDate • getDelimitedField • now • append • startsWith 44 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

Processor Configuration – JSONPath Expression Language

Properties ‐ When the Processor references a JSON, the JSONPath Expression Language is used. In this language, the JSON hierarchy is referenced with a $ to represent the root and the names of the nested fields get a value.

Similarly, the EvaluateXPath Expression is provided for referencing XML. 45 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Main Components: Process Groups

Process Groups are used to logically group a set of components so that the dataflow is easier to understand and maintain. Process Groups are set composed by processors and their connections. They can receive data via input ports and sends data via output ports.

46 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Main Components: Process Groups Labels: draggable colored areas that can be used to visually highlight and differentiate, for instance, different logical flows of Processors and Process Groups as well. Also, add documental text to Processors and Process Groups.

Navigate into the Process Group and back to the main Flow

47 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Main Components: Process Groups and Remote Process Groups Creation  Open the NiFi web‐based user interface. Through the Components Toolbar it is possible to add elements to the FileFlow.

 To add a Processor Group, drag and drop the processor icon in the FileFlow canvas.  Remote Process Group (RPG): Remote Process Groups are particular Process Groups which reference remote instances of NiFi. When an RPG is dragged ontothe canvas, the user is prompted for the URL of the remote NiFi instance. If the remote NiFi is a clustered instance, the URL that should be used is the URL of any NiFi instance in that cluster.

48 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Main Components: Input and Output Ports

Input Ports provide a mechanism for transferring data into a Process Group.

Output Ports provide a mechanism for transferring data from a Process Group to destinations outside of the Process Group: DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Main Components: Funnels, Templates and Labels

Funnels are used to combine the data from many Connections into a single Connection. Connections can be configured with FlowFile Prioritizers, i.e. providing the feature to Prioritize all data on that single Connection, rather than prioritizing the data on each Connection independently.

Templates can be created from sec‐ tions of the flow, or they can be imported from other dataflows. Several components may be combined together to make a larger building block to be included in a dataflow. Templates can also be exported as XML and imported into another NiFi instance.

Labels are draggable colored areas that can be used to visually highlight and differentiate, for instance, different logical flows of Processors and Process Groups as well. Also, add documental text to Processors and Process Groups. DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Main Components: Controller Services

Controller Services are shared services that can be used by reporting tasks, processors, and other services to utilize for configuration or task execution. To add a Controller Service for a reporting task, select Controller Settings from the Global Menu. DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Main Components: Controller Services

Controller Services are, in a simplified view, packages of configuration parameters and code that perform usually some actions in the background. Some examples are:

 Connections to external services, for instance databases and APIs, where the controller service encapsulates the connection parameters.  Reporting Tasks that send statistics about NiFi on a regular basis, for example to a monitoring service  Sharing state between processors and cluster nodes, for instance with cache services. DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Main Components: Controller Services Configuration ‐ General

 The Controller Services window has four configuration tabs: General, Reporting Task Controller Services, Reporting Tasks and Registry Clients.

The General tab provides settings for the overall maximum thread counts of the instance. DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Main Components: Controller Services Configuration – Reporting Task Controller Services In the Reporting Task Controller Services tab, it is possible to create new Controller Services by clicking the "+" button in the upper‐right corner: DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Main Components: Controller Services Configuration – Reporting Task Controller Services Once you have added a Controller Service, you can configure it by clicking the Configure button in the far‐right column. Other buttons in this column include Enable, Remove and Access Policies: DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Main Components: Reporting Tasks

Reporting Tasks run in the background to provide statistical reports aboutwhatis happening in the NiFi instance. The DFM adds and configures Reporting Tasks similar to the process for Controller Services. To add a Reporting Task, select Controller Settings from the Global Menu.. DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Main Components: Connections  Connections provide linkage between processors, specifying how FlowFiles should travel between processors.  Common connections are for Success and Failure, which are simple error handling for processors.  FlowFiles that are processed without fault are sent to the success queue while those with problems are sent to a failure queue.  Additional connection types: Not Found or Retry.  Enable back pressure via configurable upper bounds.  Manage queued data with priority mechanisms.  It is possible to draw a connection that loops back on the same processor (useful if the user wants the processor to try to re‐process FlowFiles if they go down a failure Relationship).

57 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Main Components: Connections ‐ Configuration  Connections present a two‐tabs configuration window. The Details tab provides information about the source and destination components (component name, component type, and Process Group in which the component lives);

58 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Main Components: Connections ‐ Configuration The Settings tab allows to configure the Connection’s Name, FlowFile Expiration, Back Pressure Thresholds, Load Balance Strategy and Prioritization.  FlowFile expiration: data that cannot be processed within the time value set by this options is automatically removed from the flow.  Load Balance Strategy: to distribute the data in a flow across the nodes in the cluster, NiFi offers the following load balance strategies: • Do not load balance (default) • Partition by attribute • Round robin • Single node (all FlowFiles will be sent to a single node in the cluster). 59 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Main Components: Connections ‐ Configuration  Back Pressure parameters (Object and Size thresholds) indicate how much data should be allowed to exist in the queue, before the component that is the source of the Connection is no longer scheduled to run. This prevent the system to be overrun with data!

60 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Main Components: Connections ‐ Configuration Available Prioritizer: data can be prioritized in the queue so that higher priority data is processed first. Prioritizers can be dragged from the top ('Available prioritizers') to the bottom ('Selected prioritizers'). Multiple prioritizers can be selected. The prioritizer that is at the top of the 'Selected prioritizers' list is the highest priority. If a prioritizer is no longer desired, it can then be dragged from the 'Selected prioritizers' list to the 'Available prioritizers' list. The following prioritizers are available:  FirstInFirstOutPrioritizer: The FlowFile that reached the connection first will be processed first.  NewestFlowFileFirstPrioritizer: The FlowFile that is newest in the dataflow will be processed first.  OldestFlowFileFirstPrioritizer: The FlowFile that is oldest in the dataflow will be processed first. This is the default scheme that is used if no prioritizers are selected’.  PriorityAttributePrioritizer: The FlowFile with the highest priority value will be processed first. Note that an UpdateAttribute processor should be used to add the "priority" attribute (alphanumeric values, being "a" a higher priority than "z", as well as "1" is a higher priority than "9”) to the FlowFiles before they reach a connection that has this prioritizer set.

61 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Variables Window  Variables can be created and configured within User Interface through a dedicated section. The variables can be used in any field that supports Expression Language.

(4)

(1)

(3)

(2)

(5)

62 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Variables Scope  Variables in a child group override the value in a parent group.

63 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Data Provenance

64 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Data Provenance

65 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Data Provenance Provenance Event Description ADDINFO Indicates a provenance event when additional information such as a new linkage to a new URI or UUID is added ATTRIBUTES_MODIF Indicates that a FlowFile’s attributes were modified in some way IED CLONE Indicates that a FlowFile is an exact duplicate of its parent FlowFile CONTENT_MODIFIE Indicates that a FlowFile’s content was modified in some way D CREATE Indicates that a FlowFile was generated from data that was not received from a remote system or external process DOWNLOAD Indicates that the contents of a FlowFile were downloaded by a user or external entity DROP Indicates a provenance event for the conclusion of an object’s life for some reason other than object expiration EXPIRE Indicates a provenance event for the conclusion of an object’s life due to the object not being processed in a timely manner FETCH Indicates that the contents of a FlowFile were overwritten using the contents of some external resource FORK Indicates that one or more FlowFiles were derived from a parent FlowFile JOIN Indicates that a single FlowFile is derived from joining together multiple parent FlowFiles RECEIVE Indicates a provenance event for receiving data from an external process REPLAY Indicates a provenance event for replaying a FlowFile ROUTE Indicates that a FlowFile was routed to a specified relationship and provides information about why the FlowFile was routed to this relationship SEND Indicates a provenance event for sending data to an external process 66 UNKNOWN Indicates that the type of provenance event is unknown because the user who is attempting DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Simple FileFlows Examples  Listen for HTTP incoming posts

67 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Simple FileFlows Examples  Query Yahoo weather API and produce a JSON

68 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Installation & Configuration  Download tarball of your favorite stable release (referred as NIFI_VERSION>.bin.tar.gz) from the NiFi repository: http://nifi.apache.org/download.html  Untar and extract to the location from which you want to run the application (referred as : $ tar zxf .bin.tar.gz

 To run NiFi in the background, run: $ bin/nifi.sh start

 To stop the application, run: $ bin/nifi.sh stop

69 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Complete Flow Example @ DISIT Lab: Collect, Extract and Index Data Traffic Logs (back‐end for AMMA Dashboard)

70 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Complete Flow Example @ DISIT Lab: Collect, Extract and Index Data Traffic Logs

[ . . . ] 71 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

NiFi Complete Flow Example @ DISIT Lab: Collect, Extract and Index Data Traffic Logs

Syslog: Collection For AMMA 72 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – Apache NiFi

Many NiFi flow examples available on the Web!

 Indexing Tweets with NiFi and Solr https://blogs.apache.org/nifi/entry/indexing_tweets_with_nifi_and

 NiFi flow to Push Tweets into Solr/Banana, HDFS/Hive https://community.hortonworks.com/articles/1282/sample‐hdfnifi‐flow‐to‐push‐tweets‐into‐ solrbanana.html DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it

SUMMARY

74 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 4. Distributed Storage and Indexing – Configure Zookeeper Setting Up the Environment

Kibana / Banana

Cloud Distributed Configuration Files

75 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 4. Distributed Storage and Indexing – Configure Zookeeper

Prerequisites (to be done for each Cluster node)  Set up 3 or more VMs or physical hosts connected to the same LAN. These machines will constitute the nodes of the distributed cluster.

 Install Java on all the cluster nodes (if not already installed): $ sudo apt update $ sudo apt install openjdk-8-jdk This will Install Open JDK 8. We will refer to the installation folder as

 Once the installation is complete, check the Java JDK version installed: $ java –version

76 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 4. Distributed Storage and Indexing – Configure Zookeeper

Prerequisites (to be done for each Cluster node)  On every node of the cluster, in order to set up PATH and JAVA_HOME variables, add the following entries to ~/.bashrc file: export JAVA_HOME= export PATH= $PATH:$JAVA_HOME/bin

 Now apply all the changes into the current running system.

$ source ~/.bashrc

 Add the following entries in the /etc/hosts file: For example: 192.168.1.1 hbase-nifi-solr-shard1 192.168.1.2 hbase-nifi-solr-shard2 192.168.1.3 hbase-nifi-solr-shard3 77 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 4. Distributed Storage and Indexing – Configure Zookeeper

78 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 4. Distributed Storage and Indexing – Configure Zookeeper

Zookeeper Installation & Configuration  Download tarball of your favorite stable release of Zookeper (referred as .tar.gz in the following) from: https://zookeeper.apache.org/releases.html

 Untar the application to a folder of your choice (referred as ) : $ sudo tar -xf .tar.gz -C

 Create a configuration file, e.g. zoo.cfg, if not present, or edit the existing one by adding or modifying the following entries: tickTime=2000 dataDir=/data clientPort=2181 initLimit=5 syncLimit=2 server.1=:2888:3888 server.2=:2888:3888 server.3=:2888:3888 79 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 4. Distributed Storage and Indexing – Configure Zookeeper

Zookeeper Installation & Configuration  Create, if not exists, or edit the file myid in the folder specified by the dataDir parameter in the zoo.cfg configuration file: $ sudo touch /data/myid

 Each Zookeeper server should have a unique number in the myid file. For example, server 1 will have value 1, server 2 will have value 2 and so on.

$ sudo sh -c "echo '1' > /data/myid" $ sudo sh -c "echo '2' > /data/myid" $ sudo sh -c "echo '3' > /data/myid"

 Start Zookeeper servers’ ensemble: $ /bin/zkServer.sh start which is equivalent to run: $ java -cp zookeeper.jar:lib/log4j-1.2.15.jar:conf \ org.apache.zookeeper.server.quorum.QuorumPeerMain zoo.cfg

80 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 4. Distributed Storage and Indexing – Configure Zookeeper

Zookeeper Installation & Configuration  Set the Java heap size (important to avoid Zookeeper swapping, which will significantly degrade performance). Conservative parameters: use a maximum heap size of 3GB for a 4GB machine. To do this, create the file java.env in /conf/ and add the following entry:

export JVMFLAGS="-Xmx2048m"

 Restart the Zookeeper service.

 To stop the Zookeeper servers’ ensemble run: $ /bin/zkServer.sh stop

81 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 4. Distributed Storage and Indexing – Hbase Storing & Solr Indexing

82 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 4. Distributed Storage and Indexing – Hbase Storing & Solr Indexing

HBase Installation & Configuration  HBase can be installed in three different fashions: 1. Standalone mode 2. Pseudo‐Distributed mode (Single‐node Hadoop system + HBase installation)

3. Fully‐Distributed mode (Multi‐node Hadoop system + HBase installation)

 On every node of the cluster download your favorite stable release (referred as .tar.gz)from: https://hbase.apache.org/downloads.html

 Untar the package in your desired folder by executing:

$ tar -xvf hbase-1.1.2-bin.tar.gz -C

83 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 4. Distributed Storage and Indexing – Hbase Storing & Solr Indexing

HBase Installation & Configuration  Open /conf/hbase-site.xml and place the following properties inside: hbase.rootdir hbase.rootdir hdfs:// file:// with Hadoop cluster with NO Hadoop cluster hbase.cluster.distributed true false: standalone and pseudo-distributed setups with managed Zookeeper true: fully-distributed with unmanaged Zookeeper Quorum (see hbase- env.sh)

[ . . . ]

84 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 4. Distributed Storage and Indexing – Hbase Storing & Solr Indexing

HBase Installation & Configuration [ . . . ] hbase.zookeeper.property.clientPort 2181 Property from ZooKeeper's config zoo.cfg. The port at which the clients will connect.

hbase.zookeeper.quorum ,, Nodes where Zookeeper peer are running !!!!

hbase.zookeeper.property.dataDir /data Property from ZooKeeper's config zoo.cfg. The directory where the snapshot is stored. [ . . . ] 85 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 4. Distributed Storage and Indexing – Hbase Storing & Solr Indexing

HBase Installation & Configuration  Open the file /conf/hbase-env.sh and edit the JAVA HOME variable with your Java installation folder : export JAVA_HOME=  Hbase contains its own instance of Zookeeper. If we installed a standalone version of Zookeeper, configure HBase so that it should not manage its own instance of Zookeeper, by setting the following parameter in the /conf/hbase-env.sh file: export HBASE_MANAGES_ZK=false  Finally, start HBase daemons (if running on a Hadoop cluster, start Hadoop daemons first by using ./start-all.sh command): $ /bin/hbase-start.sh

86 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 4. Distributed Storage and Indexing – Hbase Storing & Solr Indexing

87 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 4. Distributed Storage and Indexing – Hbase Storing & Solr Indexing

Solr Cloud Installation & Configuration On every node of the cluster do the following instructions:  Choose your favorite stable release to download at: http://archive.apache.org/dist/lucene/solr/  Extract the Solr distribution archive .tgz toachosendirectory : $ tar zxf .tgz  To start Solr, navigate to and run: $ bin/solr start  Install Solr as a service: for this purpose, Solr includes a service installation script. Just run: $ bin/install_solr_service.sh

88 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 4. Distributed Storage and Indexing – Hbase Storing & Solr Indexing

Solr Cloud Installation & Configuration

 Check Solr configuration file (typically /etc/default/solr.in.sh) and add or edit the Zookeeper host parameter: $ ZK_HOST=":2181, :2181, :2181/solr"

 Check also the JAVA home parameter: SOLR_JAVA_HOME=""  Make a copy of the schema template in Solr configuration folder: $ cp -a /srv/solr/server/solr/configsets/basic_configs/ /srv/solr/server/solr/configsets/zk_configs  Start Solr Cloud: $ /bin/solr start –c –z :2181, :2181, :2181/solr

89 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 4. Distributed Storage and Indexing – Hbase Storing & Solr Indexing

Solr Cloud Installation & Configuration  Define a new data schema for the new Solr sharded collection we are going to create. To this purpose, create a new configuration folder for the new collection, by copying the default one (/server/solr/configsets/basic_configs): $ cp -avr /server/solr/configsets/basic_configs /server/solr/configsets/

 Edit the managed-schema.xml file in the /server/solr/configsets//c onf folder, according to your data model (add the fields and their Solr types which will form the schema for all documents which will be indexed in the collection): 90 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 4. Distributed Storage and Indexing – Hbase Storing & Solr Indexing

Install Solr Cloud with Zookeeper

91 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 3. IoT Data Flows Management – EventLogger EventLogger Data Classification Model Output Input Function Description Tstmp Tstmp Timestamp in Unix‐Epoch milliseconds. o i φl1 PidLoco PidLoci φ =I Process ID Container of the logging Microservice (i.e.: IoT device l2 / process /application / service).

ComModeo ComModei φ =I Communication Mode, indicating if the logging Microservice is l3 transmitting or receiving. Agent PidLoc Agent type (i.e: Node‐RED application, ETL, Data Analytics etc.). i φl4

Lat SrvUri Latitude of the logging device or virtual machine, obtained by o i φl5 the input SrvUrii calling the Smart City API. Lngo SrvUrii φ Longitude of the logging device or virtual machine, obtained by l5 the input SrvUrii calling the Smart City API. GeoLoc Lat , Special geolocation format for representing the logging i φl6 Lngi Microservice on map and for geographical faceting functionalities.

SrvUrio SrvUrii φ =I URI of the device / service involved in the process, as l7 represented in the Km4City Ontology. IpLoc IpLoc IP of the logging Microservice, providedinIPv4orDNSformat. o i φl8 The φl8 function performs encoding check and adjustment. IpExt IpExt IP of the external host to / from which the currently logging o i φl8 Microservice (represented by PidLoci,) is transmitting / receiving

data, according to the ComModei parameter. The φl8 function performs encoding check and adjustment.

SrvScope IpLoci , IpExti φ This parameters indicates whether the Microservice Internal or l9 External. Motivation, that is the aim of the logging Microservice Motivo Motivi φ l10 (Filesystem or DB storage, Dashboard management, Smart City API call etc.). PayLoad PayLoad Measure of data flow traffic (transmitted or received, o i φl11 depending on the ComModei parameter) between IpLoci and IpExti.ItisexpressedinKB. AppName AppName Name of the Microservice application. o i φl12=I Msg Msg Text message for additional logging notes. o i φl13

92 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 4. Distributed Storage and Indexing – Hbase Storing & Solr Indexing

Solr Cloud Installation & Configuration  Set the new managed-schema configuration for the whole sharded collection, by uploading it through Zookeeper: $ /server/scripts/cloud-scripts/zkcli.sh -cmd upconfig -confdir /server/solr/configsets/zk_configs/conf/ - confname zk-conf -z :2181, :2181, :2181/solr

 Create a new Solr sharded collection with the configuration uploaded to Zookeeper by using the Solr Collections API: $ curl 'http://:8983/solr/admin/ collections?action=CREATE&name= &numShards=3&replicationFactor=3 &collection.configName=zk-conf'

93 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 4. Distributed Storage and Indexing – Hbase Storing & Solr Indexing

Solr Cloud Installation & Configuration  A new Solr sharded collection can be equivalently created from command line in one of the cluster nodes (for example in ): $ /bin/solr create -c  Then, it is necessary to spawn replicas of the newly created collection on the other servers and with the Solr Collections API:

$ curl -XGET http:// :8983/solr/admin/collections?action=ADDREPLICA&collection=port al&shard=shard1&node=server2:8983_solr $ curl -XGET http:// :8983/solr/admin/collections?action=ADDREPLICA&collection=port al&shard=shard1&node=server3:8983_solr

94 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 4. Distributed Storage and Indexing – Hbase Storing & Solr Indexing Navigate the Solr web interface in the Cloud Tab:

95 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 4. Distributed Storage and Indexing – Hbase Storing & Solr Indexing

zkui (ZooKeeper User Interface): a graphical UI for monitoring Zookeeper shared configuration for Solr: https://github.com/echoma/zkui zkui is a cross‐platform GUI for frontend for managing operations on Zookeeper clusters. Shared hierarchal namespace which is organized similarly to a standard file system

Syslog: SensoriRT‐v2: Collection Collection For AMMA For DevDash

96 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 4. Distributed Storage and Indexing – Hbase Storing & Solr Indexing  Browse the ZooKeeper node tree, edit the node's data.  Copy a node to new path recursively.  Delete a node and all its children.  Monitor the coherence of configurations files in the cluster.

managed-schema

97 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 4. Distributed Storage and Indexing – Hbase Storing & Solr Indexing Monitor cluster resources and select different collections:

98 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 4. Distributed Storage and Indexing – Hbase Storing & Solr Indexing Make queries on selected collection:

Syslog: Collection For AMMA

99 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it

SUMMARY

100 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 5. Producing Visual Analytic Tools – Banana Dashboards

101 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 5. Producing Visual Analytic Tools – Banana Dashboards

 The Open Source Banana project is a fork of Kibana.  It works with time‐series data stored in Apache Solr (upon which it’s actually installed).  It includes powerful features, such as D3.js (data‐driven Javascript), supporting dynamic and interactive views with structured data.  It is based on Angular JS, simplifying and enhance the MCV (Model‐Control‐View) paradigm in the development of web‐based and mobile applications.

102 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 5. Producing Visual Analytic Tools – Banana Dashboards

Simplify and enhance the MCV (Model-Control-View) paradigm in the development of web-based and mobile applications!

103 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 5. Producing Visual Analytic Tools – Banana Dashboards Angular JS “Hello Wolrd” Example

Hello Angular

Inserisci il tuo nome:

Hello {{name}}!

104 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 5. Producing Visual Analytic Tools – Banana Dashboards Angular JS “Hello Wolrd” Example

https://plnkr.co/edit/?p=preview

105 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 5. Producing Visual Analytic Tools – Banana Dashboards

Banana Web‐app Installation & Configuration  Download Banana .zip archive (banana-release.zip) from GitHub repository: https://github.com/lucidworks/banana  Create the folder as /server/solr- webapp/webapp/banana, if it does not exist: $ mkdir /server/solr-webapp/webapp/banana  Install the zip/unzip package, if not installed: $ sudo apt get install zip unzip  Unzip the Banana archive and place its content in the folder created earlier: $ unzip banana-release.zip –d /server/solr-webapp/webapp/banana

106 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 5. Producing Visual Analytic Tools – Banana Dashboards

Banana Web‐app Installation & Configuration  Open the Banana web‐app by browsing one of the following URL (equivalently): http://:8983/solr/banana/#/dashboard http://:8983/solr/banana/#/dashboard http://:8983/solr/banana/#/dashboard

Start creating a new dashboard

107 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 5. Producing Visual Analytic Tools – Banana Dashboards

 You can easily create also custom dashboards from scratch! https://github.com/lucidworks/banana/wiki/Tutorial:‐How‐to‐Build‐a‐Custom‐Panel

108 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 5. Producing Visual Analytic Tools – Banana Dashboards

Banana Web‐app  In the Dashboard Settings Panel, click on the Solr tab to choose the data collection.

Example: Collection for DevDash

DevDash

109 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 5. Producing Visual Analytic Tools – Banana Dashboards

Banana ‐ AMMA & DevDash Widgets and Functionalities  Time window ______

Relative Absolute Since Time window options:

110 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 5. Producing Visual Analytic Tools – Banana Dashboards

Banana ‐ AMMA & DevDash Widgets and Functionalities  Search Query ______

 Total Hits ______

Total Hits options:

111 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 5. Producing Visual Analytic Tools – Banana Dashboards

Banana ‐ AMMA & DevDash Widgets and Functionalities  Facet: In the facet panel, users can select facet fields to automatically filter all the other widgets and panels instantiated in the same dashboard.

112 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 5. Producing Visual Analytic Tools – Banana Dashboards

Banana ‐ AMMA & DevDash Widgets and Functionalities  Bar/Line Histogram: this widget is useful for monitoring time trends (in terms of counts or cumulated values). The graph visualization is stacked and grouped on the basis of a field which can be chosen by the user.

113 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 5. Producing Visual Analytic Tools – Banana Dashboards

Banana ‐ AMMA & DevDash Widgets and Functionalities  Terms: these are some histogram and pie charts in which several kinds of distribution can be shown on the basis of the Facet fields data (in terms of counts but also sum of values, mean, max, min etc.).

114 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 5. Producing Visual Analytic Tools – Banana Dashboards

Banana ‐ AMMA & DevDash Widgets and Functionalities  Sunburst: the Sunburst representation is a multi‐level ring/pie chart which allows the user to easily visualize multi‐level faceting diagrams, depending on the faceting input order.

115 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 5. Producing Visual Analytic Tools – Banana Dashboards

Banana ‐ AMMA & DevDash Widgets and Functionalities  Bettermap, reporting on map the geolocated data.  SmartCItyMap: widget obtained by modifying the Bettermap map, showing on enriched information on geolocated data , retrieved by exploiting the Km4City Smart City API developed at DISIT Lab, and providing also geo‐faceting capabilities.

116 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 5. Producing Visual Analytic Tools – Banana Dashboards

Banana ‐ AMMA & DevDash Widgets and Functionalities  Table: a table with data coming from the SOLR index, with a selection of columns and the possibility to order by column values, set clickable URL‐based fields to redirect the user to the corresponding web link, set pagination values etc.

117 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 5. Producing Visual Analytic Tools – Banana Dashboards

Banana ‐ AMMA & DevDash Widgets and Functionalities

118 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 5. Producing Visual Analytic Tools – Banana Dashboards Some Use Cases (A)

119 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 5. Producing Visual Analytic Tools – Banana Dashboards Some Use Cases (B)

(a) (b)

120 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it

Other Technologies / Approaches

121 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it

Possible integration between different frameworks

NiFi as a Producer NiFi acting as a Kafka producer.

NiFi as a Consumer In some scenarios an organization may already have an existing pipeline bringing data to Kafka. In this case NiFi can take on the role of a consumer and handle all of the logic for taking data from Kafka to wherever it needs to go.

122 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 4. Distributed Storage and Indexing – Hbase Storing & Solr Indexing

NiFi Indexing to ElasticSearch

123 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 4. Distributed Storage and Indexing – Hbase Storing & Solr Indexing

NiFi Indexing to ElasticSearch

[ . . . ] 124 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it 4. Distributed Storage and Indexing – Hbase Storing & Solr Indexing

NiFi Indexing to ElasticSearch

125 DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it

Performance Comparison: Solr VS Elasticsearch

Elasticsearch Elasticsearch

Solr Solr

126