Big Data Platforms

Introduction into Big Data analytics Lecture 2 – Big data platforms Janusz Szwabiński Outlook of today’s talk ● Available Big Data Sets ● Project suggestions ● Big data platforms Available Big Data Sets ● Pointers to data sets – How To Get Experience Working With Large Datasets – Quora – KDNuggets – Datasets for Data Mining and Data Science – Research Pipeline – Google Public Data Directory – StackExchange Data Explorer – Kaggle Available Big Data Sets ● Generic repositories – AWS Public Datasets – Comprehensive Knowledge Archive Network – Stanford Large Network Dataset Collection – Open Flights – ASA Flight data – Wikipedia Available Big Data Sets ● Geo data – OpenStreetMap – Natural Earth Data – GeoNames – Libre Map Project – Landsat Available Big Data Sets ● Web data – Google Books n-gram – Public Terabyte Dataset Project – Common Crawl – Freebase Data – StackOverflow – UCI KDD Data Available Big Data Sets ● Government data – European Parliament proceedings – US government data – UK government data – US Patent and Trademark Office – World Bank data – Public health data sets – Aid information – UN data – Polish Statistical Office Suggestions for projects ● Trend prediction in fashion ● Quote search engine ● Real-time analysis of Twitter’s public stream with Storm ● Correlating price/volume of low volume stocks with social media – search information related to future price and volume movements – find indicators to predict abnormal price or volume changes Suggestions for projects ● Stock signal generation using real time Twitter analysis – develop a scoring mechanism that summarizes Twitter news – generate a real-time signal that could be used to make trading decisions ● Music recommendation system with geospatial information – MMTD - Million Musical Tweets Dataset ● Answer classifier based on StackOverflow data Suggestions for projects ● How to name your new-born baby? – prediction of trends in baby names around the world ● Impact of popular culture on baby names ● Error correction in OCR datasets ● Movie exploration/recommendation system ● Best transport choice ● Fake reviews detection ● Food identification in photos – see e.g. https://www.yelp.com/dataset/challenge ● Oscar/Golden Globe award analysis Suggestions for projects ● Interesting ideas for trendy writers ● Image-based geolocalization ● Animal identification in photos ● Plant identification in photos ● Currency trend analyzer – data source: http://www.histdata.com/ Big data platforms ● one stop solution for Big Data needs – integrated IT solution for developing, deploying and managing Big Data – combines several software systems, tools and hardware to provide easy to use system to enterprises ● important features: – able to accommodate new tools based on the business requirement – supports linear scale-out – has capability for rapid deployment – supports variety of data formats – provides data analysis and reporting tools – provides real-time data analysis software – has tools for searching the data through large data sets Hadoop ● http://hadoop.apache.org/ ● an open-source software framework for storing data and running applications on clusters of commodity hardware ● why it is so important? – ability to store and process huge amounts of any kind of data, quickly – computing power - Hadoop's distributed computing model processes big data fast ● the more computing nodes you use, the more processing power you have – fault tolerance - data and application processing are protected against hardware failure ● if a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail ● multiple copies of all data are stored automatically. – flexibility - unlike traditional relational databases, you don’t have to preprocess data before storing it – low cost - the open-source framework is free and uses commodity hardware to store large quantities of data – scalability - you can easily grow your system to handle more data simply by adding nodes with little administration effort Hadoop Source: https://www.sas.com/en_us/insights/big-data/hadoop.html Hadoop ● challenges: – MapReduce programming is not a good match for all problems ● good for simple information requests and problems that can be divided into independent units ● not efficient for iterative and interactive analytic tasks – a widely acknowledged talent gap - it can be difficult to find entry-level programmers who have sufficient Java skills to be productive with MapReduce ● distribution providers are racing to put relational (SQL) technology on top of Hadoop ● Hadoop administration seems part art and part science, requiring low-level knowledge of operating systems, hardware and Hadoop kernel settings – data security issues ● Kerberos authentication protocol is a great step toward making Hadoop environments secure – lacking tools for data quality and standardization Hadoop ● important application domains: – Digital Marketing Optimization – Data exploration and discovery (Product and sales data for online shopping portal and stores) – Fraud detection and prevention – Social network and relationship in the network – Fraud detection in banking – Fraud detection for telecom industry – Data retention (for retaining the long term data and for archiving purposes) – Insurance – Healthcare Hadoop based commercial platforms ● Cloudera ● Amazon EMR ● Hortonworks ● MapR ● IBM Open Platform ● Microsoft HDInsight ● Intel Distribution for Apache Hadoop ● Datastax Enterprise Analytics ● Teradata’s Hadoop for Enterprise ● Pivotal HD Cloudera ● https://www.cloudera.com/ ● one of the first commercial Hadoop based platforms ● interesting (and free) download: – QuickStarts for CDH 5.12 ● https://www.cloudera.com/downloads/quickstart_vms/5-12.html ● virtualized clusters for easy installation on your desktop ● single-node cluster that make it easy to quickly get hands- on with CDH for testing, demo, and self-learning purposes ● includes Cloudera Manager for managing the cluster ● tutorial, sample data, and scripts for getting started included ● deployed via Docker containers or VMs Amazon EMR ● https://aws.amazon.com/emr/ ● a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances ● other popular distributed frameworks such as Apache Spark, HBase, Presto, and Flink possible ● interaction with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB ● secure and reliable handling of a broad set of big data use cases, including log analysis, web indexing, data transformations (ETL), machine learning, financial analysis, scientific simulation, and bioinformatics ● simple and predictable pricing: – per-second rate for every second used, with a one-minute minimum charge – a 10-node Hadoop cluster: $0.15 per hour Amazon EMR ● good to know: AWS Free Tier (12 Month Introductory Period) – https://aws.amazon.com/free/ Hortonworks ● https://hortonworks.com/ ● a leading innovator in the data industry, creating, distributing and supporting enterprise-ready open data platforms and modern data applications ● 100% open-source software without any propriety software ● Hortonworks Hadoop distribution is enterprise ready with following features: – centralized management and configuration of clusters – built-in security and data governance – centralized security administration ● Hortonworks Sandbox – a virtual machine with Hadoop preconfigured – a set of hands-on tutorials to get you started with Hadoop. – an environment to help you explore related projects in the Hadoop ecosystem like Apache Pig, Apache Hive, Apache HCatalog and Apache HBase MapR ● https://mapr.com/ ● MapR provides access to a variety of data sources from a single computer cluster, including: – big data workloads such as Apache Hadoop and Apache Spark – a distributed file system – a multi-model database management system – event stream processing, combining analytics in real-time with operational applications ● its technology runs on both commodity hardware and public cloud computing services IBM Open Platform ● https://www-03.ibm.com/software/products/en/ibm-open-platform-with-ap ache-Hadoop ● native support for rolling upgrades for Hadoop services ● support for long-running applications within YARN for enhanced reliability & security ● heterogeneous storage in HDFS for in-memory, SSD in addition to HDD ● Spark in-memory distributed compute engine ● Java, Python & Scala languages ● Apache Hadoop projects included: HDFS, YARN, MapReduce, Ambari, Hbase, Hive, Oozie, Parquet, Parquet Format, Pig, Snappy, Solr, Spark, Sqoop, Zookeeper, Open JDK, Knox, Slider ● free IOP Quick Start Edition for non-production software: https://www.ibm.com/support/knowledgecenter/en/SSPT3X_4.2.0/com.ibm.swg .im.infosphere.biginsights.install.doc/doc/qse_main.html Microsoft HDInsight ● https://azure.microsoft.com/en-in/services/hdinsight/ ● a fully-managed cloud service for easy, fast, and cost-effective processing of massive amounts of data ● uses popular open-source frameworks such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, R & more ● enables a broad range of scenarios such as ETL, Data Warehousing, Machine Learning, IoT and more Intel Distribution for Apache Hadoop ● https://www.intel.com/content/www/us/en/software/intel-distributio n-for-apache-hadoop-software-solutions.html ● a distribution of Hadoop with Intel’s GraphBuilder and Analytics toolkit ● 90-day Trial version Datastax Enterprise Analytics ● https://www.datastax.com/ ● Big Data analytics platform

Big Data Platforms

Multimedia Big Data Processing Using Hpcc Systems

Deliver Performance and Scalability with Hitachi Vantara's Pentaho

The HPCC Cluster Computing Paradigm and an Efficient Data-Centric Programming Language Are Key Factors in Our Company's Success

HPCC Systems Open Source, Big Data Processing and Analytics See

Big Data Analytics Tool Hive

HPCC Benchmarking

Beyond Batch Processing: Towards Real-Time and Streaming Big Data

HPCC System and Its Future Aspects in Maintaining Big Data

Cloud Based Big Data Infrastructure: Architectural Components and Automated Provisioning

A Guide in the Big Data Jungle Thesis, Bachelor of Science

The ECL Programming Paradigm

HPCC Systems: Introduction to HPCC (High-Performance Computing Cluster)