
Introduction into Big Data analytics Lecture 2 – Big data platforms Janusz Szwabiński Outlook of today’s talk ● Available Big Data Sets ● Project suggestions ● Big data platforms Available Big Data Sets ● Pointers to data sets – How To Get Experience Working With Large Datasets – Quora – KDNuggets – Datasets for Data Mining and Data Science – Research Pipeline – Google Public Data Directory – StackExchange Data Explorer – Kaggle Available Big Data Sets ● Generic repositories – AWS Public Datasets – Comprehensive Knowledge Archive Network – Stanford Large Network Dataset Collection – Open Flights – ASA Flight data – Wikipedia Available Big Data Sets ● Geo data – OpenStreetMap – Natural Earth Data – GeoNames – Libre Map Project – Landsat Available Big Data Sets ● Web data – Google Books n-gram – Public Terabyte Dataset Project – Common Crawl – Freebase Data – StackOverflow – UCI KDD Data Available Big Data Sets ● Government data – European Parliament proceedings – US government data – UK government data – US Patent and Trademark Office – World Bank data – Public health data sets – Aid information – UN data – Polish Statistical Office Suggestions for projects ● Trend prediction in fashion ● Quote search engine ● Real-time analysis of Twitter’s public stream with Storm ● Correlating price/volume of low volume stocks with social media – search information related to future price and volume movements – find indicators to predict abnormal price or volume changes Suggestions for projects ● Stock signal generation using real time Twitter analysis – develop a scoring mechanism that summarizes Twitter news – generate a real-time signal that could be used to make trading decisions ● Music recommendation system with geospatial information – MMTD - Million Musical Tweets Dataset ● Answer classifier based on StackOverflow data Suggestions for projects ● How to name your new-born baby? – prediction of trends in baby names around the world ● Impact of popular culture on baby names ● Error correction in OCR datasets ● Movie exploration/recommendation system ● Best transport choice ● Fake reviews detection ● Food identification in photos – see e.g. https://www.yelp.com/dataset/challenge ● Oscar/Golden Globe award analysis Suggestions for projects ● Interesting ideas for trendy writers ● Image-based geolocalization ● Animal identification in photos ● Plant identification in photos ● Currency trend analyzer – data source: http://www.histdata.com/ Big data platforms ● one stop solution for Big Data needs – integrated IT solution for developing, deploying and managing Big Data – combines several software systems, tools and hardware to provide easy to use system to enterprises ● important features: – able to accommodate new tools based on the business requirement – supports linear scale-out – has capability for rapid deployment – supports variety of data formats – provides data analysis and reporting tools – provides real-time data analysis software – has tools for searching the data through large data sets Hadoop ● http://hadoop.apache.org/ ● an open-source software framework for storing data and running applications on clusters of commodity hardware ● why it is so important? – ability to store and process huge amounts of any kind of data, quickly – computing power - Hadoop's distributed computing model processes big data fast ● the more computing nodes you use, the more processing power you have – fault tolerance - data and application processing are protected against hardware failure ● if a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail ● multiple copies of all data are stored automatically. – flexibility - unlike traditional relational databases, you don’t have to preprocess data before storing it – low cost - the open-source framework is free and uses commodity hardware to store large quantities of data – scalability - you can easily grow your system to handle more data simply by adding nodes with little administration effort Hadoop Source: https://www.sas.com/en_us/insights/big-data/hadoop.html Hadoop ● challenges: – MapReduce programming is not a good match for all problems ● good for simple information requests and problems that can be divided into independent units ● not efficient for iterative and interactive analytic tasks – a widely acknowledged talent gap - it can be difficult to find entry-level programmers who have sufficient Java skills to be productive with MapReduce ● distribution providers are racing to put relational (SQL) technology on top of Hadoop ● Hadoop administration seems part art and part science, requiring low-level knowledge of operating systems, hardware and Hadoop kernel settings – data security issues ● Kerberos authentication protocol is a great step toward making Hadoop environments secure – lacking tools for data quality and standardization Hadoop ● important application domains: – Digital Marketing Optimization – Data exploration and discovery (Product and sales data for online shopping portal and stores) – Fraud detection and prevention – Social network and relationship in the network – Fraud detection in banking – Fraud detection for telecom industry – Data retention (for retaining the long term data and for archiving purposes) – Insurance – Healthcare Hadoop based commercial platforms ● Cloudera ● Amazon EMR ● Hortonworks ● MapR ● IBM Open Platform ● Microsoft HDInsight ● Intel Distribution for Apache Hadoop ● Datastax Enterprise Analytics ● Teradata’s Hadoop for Enterprise ● Pivotal HD Cloudera ● https://www.cloudera.com/ ● one of the first commercial Hadoop based platforms ● interesting (and free) download: – QuickStarts for CDH 5.12 ● https://www.cloudera.com/downloads/quickstart_vms/5-12.html ● virtualized clusters for easy installation on your desktop ● single-node cluster that make it easy to quickly get hands- on with CDH for testing, demo, and self-learning purposes ● includes Cloudera Manager for managing the cluster ● tutorial, sample data, and scripts for getting started included ● deployed via Docker containers or VMs Amazon EMR ● https://aws.amazon.com/emr/ ● a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances ● other popular distributed frameworks such as Apache Spark, HBase, Presto, and Flink possible ● interaction with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB ● secure and reliable handling of a broad set of big data use cases, including log analysis, web indexing, data transformations (ETL), machine learning, financial analysis, scientific simulation, and bioinformatics ● simple and predictable pricing: – per-second rate for every second used, with a one-minute minimum charge – a 10-node Hadoop cluster: $0.15 per hour Amazon EMR ● good to know: AWS Free Tier (12 Month Introductory Period) – https://aws.amazon.com/free/ Hortonworks ● https://hortonworks.com/ ● a leading innovator in the data industry, creating, distributing and supporting enterprise-ready open data platforms and modern data applications ● 100% open-source software without any propriety software ● Hortonworks Hadoop distribution is enterprise ready with following features: – centralized management and configuration of clusters – built-in security and data governance – centralized security administration ● Hortonworks Sandbox – a virtual machine with Hadoop preconfigured – a set of hands-on tutorials to get you started with Hadoop. – an environment to help you explore related projects in the Hadoop ecosystem like Apache Pig, Apache Hive, Apache HCatalog and Apache HBase MapR ● https://mapr.com/ ● MapR provides access to a variety of data sources from a single computer cluster, including: – big data workloads such as Apache Hadoop and Apache Spark – a distributed file system – a multi-model database management system – event stream processing, combining analytics in real-time with operational applications ● its technology runs on both commodity hardware and public cloud computing services IBM Open Platform ● https://www-03.ibm.com/software/products/en/ibm-open-platform-with-ap ache-Hadoop ● native support for rolling upgrades for Hadoop services ● support for long-running applications within YARN for enhanced reliability & security ● heterogeneous storage in HDFS for in-memory, SSD in addition to HDD ● Spark in-memory distributed compute engine ● Java, Python & Scala languages ● Apache Hadoop projects included: HDFS, YARN, MapReduce, Ambari, Hbase, Hive, Oozie, Parquet, Parquet Format, Pig, Snappy, Solr, Spark, Sqoop, Zookeeper, Open JDK, Knox, Slider ● free IOP Quick Start Edition for non-production software: https://www.ibm.com/support/knowledgecenter/en/SSPT3X_4.2.0/com.ibm.swg .im.infosphere.biginsights.install.doc/doc/qse_main.html Microsoft HDInsight ● https://azure.microsoft.com/en-in/services/hdinsight/ ● a fully-managed cloud service for easy, fast, and cost-effective processing of massive amounts of data ● uses popular open-source frameworks such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, R & more ● enables a broad range of scenarios such as ETL, Data Warehousing, Machine Learning, IoT and more Intel Distribution for Apache Hadoop ● https://www.intel.com/content/www/us/en/software/intel-distributio n-for-apache-hadoop-software-solutions.html ● a distribution of Hadoop with Intel’s GraphBuilder and Analytics toolkit ● 90-day Trial version Datastax Enterprise Analytics ● https://www.datastax.com/ ● Big Data analytics platform
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages38 Page
-
File Size-