WHITE PAPER IoT and Big Data

With the incredible growth of the Internet of Things (IoT), Big Data Introduction technologies play a vital role in processing the massive amounts of data IoT Big Data Characteristics that will be generated from Internet-connected things. IoT and Big Data are bringing significant business benefits for the early adopters. IoT Big Data Architecture

IoT Big Data Applications Data from IoT poses unique characteristics and challenges for data analytics. The typical architecture of an IoT Big Data ecosystem Conclusion starts from high-speed ingestion, data storage and covers analytics and reporting from live and historic data. Tools from Azure® and AWS® as well as open source alternatives are available for implementing an IoT Big Data ecosystem. WHITE PAPER | IOT AND BIG DATA

Introduction

The incredible growth of the Internet of Things (IoT) is changing our world. Two decades ago, the Internet was confined to computers. Since then, it has encompassed mobile devices including phones, smartphones, and tablets and today even the ubiquitous automobiles, TVs and refrigerators. As this connectivity expands, it provides an opportunity to better understand the relationships between disparate data, derive new insights and make informed decisions. Big Data technologies and techniques play a vital role in processing the massive amounts of data that are generated from Internet-connected things.

In this white paper, we examine the unique characteristics of IoT data and challenges, the typical architecture of a Big Data system and the tools available to implement this, and finally some real- world applications of IoT and Big Data.

www.thinxtream.com 2 WHITE PAPER | IOT AND BIG DATA

IoT Big Data Characteristics

It is important to understand the key characteristics of IoT and associated challenges for designing an architecture for a big data system.

Number of IoT Devices: With cost reduction, comes increased adoption and tremendous scale. We can expect tens of billions of connected devices in the future. IoT products and solutions are going to be tested on a massive scale like never before.

Multiple IoT Devices & Data Types: IoT is a system of systems and often the devices come from multiple manufacturers. Data from an individual device manufacturer or model may be quite dissimilar from that of a nearly identical device. Moreover, they may produce structured, semi- structured, and unstructured data with a wide variety of data types such as multiple text formats like XML, JSON, plain text, audio, video, sensory data, etc.

Veracity of Data: In an IoT system, there will be instances where IoT devices either send spurious data or completely fail, resulting in data aberrations, or failure to perform a required control function. Veracity refers to the quality, consistency and trustworthiness of the data, which impacts the accuracy of analytics.

www.thinxtream.com 3 WHITE PAPER | IOT AND BIG DATA

Update Frequency: Devices produce data at varied frequencies. While remote sensors produce data at a low frequency, more sophisticated things like a car produce it at a high frequency. Given this high rate of data reporting, you need advanced tools and technologies to efficiently analyze reported data.

Historical Data: Often Big Data insights are derived from current data along with historical data recorded from IoT devices to enable smart monitoring and control. However, many times you may not have the history yet and building this will require time.

Context Data: Context adds significant value to the collected IoT data. It can be either static/low update frequency information such as location or could be more dynamic like weather data. Establishing a relationship between these may pose challenges.

Privacy Issues: Storing IoT data collected from connected devices brings substantial risk relating to data privacy. Sometimes this data is shared with external systems for creating new applications. When this is done, IoT systems should ensure that end-users always remain in control of their personal information through security policies.

www.thinxtream.com 4 WHITE PAPER | IOT AND BIG DATA

IoT Big Data Architecture

Command Devices & Control Provisioning API Application Backend

Device Registry

Stream Processing

Field Cloud Devices Gateway Gateway

Cold Hot Path Machine Analytics & Reporting Storage Analytics Notifications Learning

Batch Analytic Analytics Store

Data Orchestration

Data Security & Privacy

Copyright © 2018, Microsoft Corp. Figure 1. Big Data Architecture for IoT

www.thinxtream.com 5 WHITE PAPER | IOT AND BIG DATA

Solving the challenges concerning Big Data from IoT systems demands well-thought-out design. Many big players have provided reference architectures, which is an outcome of their research and learnings from actual implementation.

Based on a logical Big Data architecture for IoT from Microsoft®, let us review the following logical blocks and the options available from Azure®, AWS® and open source.

Cloud Gateway: The Cloud Gateway captures and stores messages coming from devices. It acts as a buffer for messages, implements message queuing, reliable delivery and scale-out processing. Cloud gateway options include Azure® Event Hubs, Azure IoT Hub, AWS® Device Gateway, Amazon Kinesis®, and open source alternatives such as ®.

Stream Processing: After capturing real-time stream, the solution must filter, aggregate, and prepare data for the next step and pass it to subscribers. Azure Stream Analytics and Amazon Kinesis provide a managed stream processing service to run SQL queries on a continuous stream of data. Streaming technologies like ™ and ™ in an HDInsight cluster are some of the open source alternatives.

www.thinxtream.com 6 WHITE PAPER | IOT AND BIG DATA

For near real time analysis of the data stream, a speed layer or hot path examines and detects anomalies, recognizes patterns over time windows, or triggers alerts based on specific stream conditions.

Cold Storage: Data for non-real time processing such as batch mode is typically stored in an economical distributed file store (referred to as Data Lake) that can hold high volumes of large files in various formats. Options for implementing this storage include Azure Data Lake Store, blob containers in Azure Storage, Amazon S3®, and Amazon Glacier®.

Batch Analytics: For large sets of data, and non-real time use cases, batch jobs are the best fit for filtering, aggregating and preparing data for further analysis. These batch jobs generally involve steps such as reading source files, processing them, and writing the output to new files. Options include running U-SQL jobs in Azure Data Lake Analytics, using AWS Glue® with Amazon Athena® combinations, or open source ™, ™, or custom Map/Reduce jobs in an HDInsight ® cluster or Amazon Elastic MapReduce.

www.thinxtream.com 7 WHITE PAPER | IOT AND BIG DATA

Analytical Data Store: Big Data solutions generally prepare data for analysis and then provide an interface to query this structured data using analytical tools. Azure SQL Data Warehouse and Amazon Redshift® provide a managed service for large-scale, cloud-based data warehousing.

Analysis & Reporting: The key objective of Big Data solutions is to deliver insights into the data through analysis and reporting. For doing this, a data modeling layer is essential. Azure Analysis Service, for instance, provides a multidimensional Online Analytical Processing (OLAP) cube or tabular data models. Tools such as Microsoft Power BI® and Amazon Quicksight® aid in visualizing the results. For data scientists or data analysts, interactive data exploration may be supported with analytical notebooks such as Jupyter™.

www.thinxtream.com 8 WHITE PAPER | IOT AND BIG DATA

Orchestration: Most big data solutions consist of a fixed workflow of transforming data, routing data between multiple sources and destinations, loading the data into an analytical data store and generate a report/update dashboard. To automate these repeat steps, you can use orchestration technology.

The table overleaf summarizes the various Big Data offering from open source and commercial platforms for each of the blocks (and more) in the architecture diagram.

www.thinxtream.com 9 Open Source AWS ® Microsoft Azure ® Batch Ingest Apache ™ AWS Data Transfer Services Import/Export Service File Transfer (various options) Data Factory ™ StreamSets ™ Streaming Ingest Apache Flume Amazon Kinesis Firehose Event Hubs StreamSets IOT Hub

Persistent HDFS ™ Amazon S3, Amazon Glacier Storage Blob Storage RDBMS Amazon RDS HDFS SQL Database Transient Storage Apache Kafka Amazon Kinesis Event Hubs IOT Hub HDInsight (Kafka) Batch Processing Apache Hive Amazon Elastic MapReduce Azure Batch ®, Apache (EMR) Spark HDIns ight (Spark/Map

Open Source AWS® Microsoft Azure ® Spark Amazon EMR Hadoop Reduce) Batch Ingest Apache Sqoop™ AWS Data Transfer Services Import/Export Service File Transfer (various options) Data Factory Apache Flume™ StreamSets ™ Streaming Ingest Apache Flume Amazon Kinesis Firehose Event Hubs StreamSets IOT Hub Hadoop MapReduce ™ Amazon EMR Presto SQL Data Warehouse Persistent HDFS™ Amazon S3, Amazon Glacier Storage Blob Storage RDBMS Amazon RDS HDFS SQL Database Transient Storage Apache Kafka Amazon Kinesis Event Hubs ® IOT Hub HDInsight (Kafka) PostgreSQL AWS Batch Data Lake Analytics Batch Processing Apache Hive Amazon Elastic MapReduce Azure Batch Apache Flink®, Apache (EMR) Spark HDInsight (Spark/Map Spark Amazon EMR Hadoop Reduce) WHITE PAPER | IOT AND BIG DATA Hadoop MapReduce ™ Amazon EMR Presto SQL Data Warehouse PostgreSQL ® AWS Batch Data Lake Analytics Amazon Redshift Azure Functions Amazon Redshift Azure Functions Stream Apache Flink Amazon Kinesis Streams Stream Analytics Processing Apache Spark Amazon Kinesis Analytics HDInsight (Storm, Spark) Open Source AWS ® Microsoft Azure ® OpenApache Source Beam Amazon AWS EMR® Spark Microsoft Azure ® Batch Ingest Apache Sqoop ™ AWS Data Transfer Services Import/Export Service BatchMachine Ingest Learning ApacheScikit-learn™ Sqoop ™ AWS DataAmazon Transfer LexServices Import/ExportAzure ServiceML File Transfer (various options) Data Factory TensorFlow™File Transfer (variousAmazon options) Polly CognitiveData Factory Services Apache Flume ™ AWS Lake Formation ApacheApache Spark Flume MLLib ™ Amazon Rekognition StreamSets ™ TensorFlowStreamSets , etc. ™ Amazon Machine Learning Huge number of libraries Amazon SageMaker® Streaming Ingest Apache Flume Amazon Kinesis Firehose Event Hubs Streaming Ingest Apache Flume Amazon Kinesis Firehose Event Hubs StreamSets IOT Hub Serving Storage JanusGraphStreamSets® Amazon Neptune CosmosDBIOT Hub Graph Persistent HDFS ™ Amazon S3, Amazon Glacier Storage Blob ServingPersistent Storage Apache ImpalaHDFS® ™ + Apache AmazonAmazon S3, Amazon Redshift Glacier SQL StorageData Warehouse Blob StreamStorage Apache RDBMS Flink Amazon Amazon Kinesis Streams RDS Stream Analytics HDFS Storage RDBMS Amazon RDS HDFS BI/EDW Kudu™ Amazon Athena Analysis Services (OLAP SQL Database SQLCubes) Database Transient Storage Apache Kafka Amazon Kinesis Event Hubs ServingTransient Storage Storage ™ Kafka AmazonAmazon CloudSearch Kinesis AzureEvent SearchHubs IOT Hub Search (keywords Amazon Elasticsearch IOT Hub HDInsight (Kafka) + facets) HDInsight (Kafka) Serving Storage PostgreSQL Amazon RDS SQL DB Batch Processing Apache Hive Amazon Elastic MapReduce Azure Batch Batch Processing Apache Hive Amazon Elastic MapReduce Azure Batch RDBMS Processing Apache Spark® Amazon Kinesis Analytics HDInsight (Storm, Spark) ® Apache Flink , Apache (EMR) Spark HDIns ight (Spark/Map Apache Flink , Apache (EMR) Spark HDIns ight (Spark/Map ® Spark Amazon EMR Hadoop Reduce) Serving Storage ApacheSpark HBase AmazonAmazon EMRDynamoDB Hadoop HDInsightReduce) (HBase) Hadoop MapReduce ™ Amazon EMR Presto SQL Data Warehouse NoSQL Hadoop MapReduce ™ Amazon EMR Presto SQL DataCosmosDB Warehouse PostgreSQL ® AWS Batch Data Lake Analytics Sandboxes ApachePostgreSQL Zeppelin™® AmazonAWS EMR Batch Zeppelin DataAzure Lake Notebooks Analytics Amazon Redshift Azure Functions Notebook Amazon Redshift Azure Functions Clients/Data Apps Superset (BI) Amazon Quicksight PowerBI Stream ApacheApache Beam Flink AmazonAmazon Kinesis EMR Spark Streams Stream Analytics Stream Apache Flink Amazon Kinesis Streams Stream Analytics Processing Apache Spark Amazon Kinesis Analytics HDInsight (Storm, Spark) Processing Apache Spark Amazon Kinesis Analytics HDInsight (Storm, Spark) Apache Beam Amazon EMR Spark Orchestration ApacheApache Airflow™ Beam AmazonAWS Data EMR Pipeline Spark Data Factory ETL Tool N/A AWS Glue Data Factory Machine Learning Scikit -learn™ Amazon Lex Azure ML Machine Learning Scikit -learn™ Amazon Lex Azure ML Tensor Flow™ Amazon Polly Cognitive Services Tensor Flow™ Amazon Polly Cognitive Services

Apache Spark MLLib Amazon Rekognition Apache Spark MLLib Amazon Copyright Rekognition © 2018, Sonra Intelligence Limited. TensorFlow , etc. Amazon Machine Learning TensorFlow , etc. Amazon Machine Learning Huge number of libraries Huge number of libraries

Serving Storage JanusGraph ® FigureAmazon 2. NeptuneBig Data OfferingsCosmosDB from Open ServingSource, Storage AWS andJanusGraph Microsoft® AzureAmazon Neptune CosmosDB Graph Graph MachineServing Learning Storage Apache Scikit Impala -learn™® + Apache AmazonAmazon Redshift Lex SQL DataAzure ML Warehouse Serving Storage ® + Apache Amazon Redshift SQL Data Warehouse BI/EDW Kudu ™ Amazon Athena Analysis Services (OLAP BI/EDW Kudu ™ Amazon Athena Analysis Services (OLAP Cubes) Cubes) www.thinxtream.com 1 0 Serving Storage Apache Solr™ Amazon CloudSearch Azure Search Serving Storage Apache Solr™ Amazon CloudSearch Azure Search Search (keywords Amazon Elasticsearch Search (keywords Amazon Elasticsearch + facets) + facets) Serving Storage TensorPostgreSQL Flow ™ Amazon PollyRDS Cognitive Services SQL DB Serving Storage PostgreSQL Amazon RDS SQL DB RDBMS RDBMS Serving Storage Apache HBase ® Amazon DynamoDB HDInsight (HBase) Serving Storage Apache HBase ® Amazon DynamoDB HDInsight (HBase) NoSQL CosmosDB NoSQL CosmosDB Sandboxes Apache Zeppelin ™ Amazon EMR Zeppelin Azure Notebooks Sandboxes Apache Zeppelin ™ Amazon EMR Zeppelin Azure Notebooks Notebook Notebook Clients/Data Apps Spark MLLib (BI) AmazonAmazon Rekognition Quicksight PowerBI Clients/Data Apps Superset (BI) Amazon Quicksight PowerBI

Orchestration ™ AWS Data Pipeline Data Factory Orchestration Apache Airflow ™ AWS Data Pipeline Data Factory

ETL Tool N/A AWS Glue Data Factory ETL Tool N/A AWS Glue Data Factory TensorFlow , etc. Amazon Machine Learning Huge number of libraries

Serving Storage JanusGraph ® Amazon Neptune CosmosDB Graph Serving Storage Apache Impala ® + Apache Amazon Redshift SQL Data Warehouse BI/EDW Kudu ™ Amazon Athena Analysis Services (OLAP Cubes) Serving Storage Apache Solr ™ Amazon CloudSearch Azure Search Search (keywords Amazon Elasticsearch + facets) Serving Storage PostgreSQL Amazon RDS SQL DB RDBMS Serving Storage Apache HBase ® Amazon DynamoDB HDInsight (HBase) NoSQL CosmosDB Sandboxes Apache Zeppelin ™ Amazon EMR Zeppelin Azure Notebooks Notebook Clients/Data Apps Superset (BI) Amazon Quicksight PowerBI

Orchestration Apache Airflow ™ AWS Data Pipeline Data Factory

ETL Tool N/A AWS Glue Data Factory

WHITE PAPER | IOT AND BIG DATA

IoT Big Data Applications

Here are some real-life examples of IoT and Big Data working together in businesses.

One of the largest shipping companies in the world, UPS, captures 200 data points from each vehicle in an 80,000-strong fleet to monitor speed, miles per gallon, mileage, number of stops, and engine health. These help the company optimize usage of their fleet, decrease fuel consumption and control emissions. Barcelona has implemented smart parking meters that give residents real-time updates on their phone regarding availability of parking slots and enables them to pay through the mobile app. The John Deere Field Connect system monitors air and soil temperature, wind speed, humidity, solar radiation, rainfall and leaf wetness through environmental sensors. This helps farmers in making timely irrigation decisions. Disney uses RFID-based wearable bands to collect data on visitor movement throughout the park. Disney uses this information to streamline guest numbers at attractions, adequately staff rides and attractions, and regulate stocks at busy shops and restaurants.

www.thinxtream.com 1 1 WHITE PAPER | IOT AND BIG DATA

Jewelry store chain Alex and Ani use Bluetooth® sensors in their stores. This setup can track customer traffic in their stores and push specialized offers and messages to users' phones as they enter the store. Food Company King's Hawaiian uses data collected from connected machines in their bread production factories to monitor factory performance, reduce potential downtime of machines and lower maintenance costs. BC Hydro allows its users to track their energy use by the hour and see trends in their own usage data. With this, electricity theft has been greatly reduced and outages automatically alert the company when the power is out in a certain area.

www.thinxtream.com 1 2 WHITE PAPER | IOT AND BIG DATA

Conclusion

Data originating from an IoT solution poses unique challenges to be considered when choosing or building an IoT Analytics solution. An IoT Big Data ecosystem, starts from high-speed ingestion and covers analytics and reporting. Popular commercial platforms from Azure and AWS as well as open source alternatives are available for each of the building blocks of an IoT Big Data ecosystem. IoT and Big Data have brought significant business benefits for the early adopters.

As an IoT services provider, Thinxtream has expertise in Big Data for IoT solutions across Azure IoT Hub, AWS IoT and Open Source platforms. By leveraging the IoT expertise built over a decade, Thinxtream ensures cost-effective, quality and timely delivery of IoT solutions.

www.thinxtream.com 1 3 WHITE PAPER | IOT AND BIG DATA

References

IoT Big Data Framework Architecture Version 1.0 https://www.gsma.com/iot/wp-content/uploads/2016/11/CLP.25-v1.0.pdf

Big Data Architecture https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/

Big Data Architecture Style https://github.com/mspnp/architecture-center/blob/master/docs/guide/architecture- styles/big-data.md

10 examples of IoT and Big Data Working together https://www.zdnet.com/article/ten-examples-of-iot-and-big-data-working-well-together/

www.thinxtream.com 1 4 Thinxtream Technologies is a global software company with a portfolio of innovative software platforms, components, solutions, patents, competences and services for Internet of Things (IoT) across several industry verticals and applications, successfully enabling leading customers, including Fortune 500 companies, meet their application, product and business goals.

Interested in learning more? For more information contact: Thinxtream Technologies Pte. Ltd. Thinxtream Technologies, Inc. 220 Orchard Road #05-01 10260 SW Greenburg Road Midpoint Orchard Suite 400 Portland, OR 97223 SINGAPORE 238852 U.S.A. Phone: +65 66358625 Phone: +1 971 230-0729 Email: [email protected] Email: [email protected]

www.thinxtream.com LinkedIn/thinxtream

Copyright© 2018-2020, Thinxtream Technologies Pte. Ltd. All Rights Reserved. The information in this publication supersedes that in all previously published material. For the most up-to-date information, please visit our website at www.thinxtream.com.

Thinxtream is a registered trademark of Thinxtream Technologies Pte. Ltd. Amazon, Amazon Athena, Amazon Kinesis, Amazon Redshift, Amazon SageMaker, AWS, AWS Glue are registered trademarks of Amazon.com, Inc. Microsoft, Azure are registered trademarks of Microsoft Corp. Cloud, TensorFlow are trademarks of Google, Inc. Apache Flink, Apache Flume, Apache HBase, Apache Impala, Apache Kafka are registered trademarks and Apache Airflow, Apache Hive, , Apache Solr, Apache Spark, Apache Sqoop, Apache Storm, Apache Zeppelin, Hadoop MapReduce, HDFS are trademarks of Apache Software Foundation. JanusGraph is a registered trademark of the The Linux Foundation. StreamSets is a trademark of StreamSets, Inc. All other trademarks are the property of their respective owners.

All prices, specifications and characteristics set forth in this publication are subject to change without notice. TT-WP-006-3-1220