Projects – Other Than Hadoop! Created By:-Samarjit Mahapatra [email protected]
Total Page:16
File Type:pdf, Size:1020Kb
Projects – other than Hadoop! Created By:-Samarjit Mahapatra [email protected] Mostly compatible with Hadoop/HDFS Apache Drill - provides low latency ad-hoc queries to many different data sources, including nested data. Inspired by Google's Dremel, Drill is designed to scale to 10,000 servers and query petabytes of data in seconds. Apache Hama - is a pure BSP (Bulk Synchronous Parallel) computing framework on top of HDFS for massive scientific computations such as matrix, graph and network algorithms. Akka - a toolkit and runtime for building highly concurrent, distributed, and fault tolerant event-driven applications on the JVM. ML-Hadoop - Hadoop implementation of Machine learning algorithms Shark - is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive. It can execute Hive QL queries up to 100 times faster than Hive without any modification to the existing data or queries. Shark supports Hive's query language, metastore, serialization formats, and user-defined functions, providing seamless integration with existing Hive deployments and a familiar, more powerful option for new ones. Apache Crunch - Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run Azkaban - batch workflow job scheduler created at LinkedIn to run their Hadoop Jobs Apache Mesos - is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark, and other applications on a dynamically shared pool of nodes. Druid - is open source infrastructure for Real²time Exploratory Analytics on Large Datasets. The system uses an always-on, distributed, shared-nothing, architecture designed for real-time querying and data ingestion. It leverages column-orientation and advanced indexing structures to allow for cost effective, arbitrary exploration of multi-billion-row tables with sub-second latencies. Apache MRUnit - a Java library that helps developers unit test Apache Hadoop map reduce jobs. hiho - Hadoop Data Integration with various databases, ftp servers, salesforce. Incremental update, dedup, append, merge your data on Hadoop white-elephant - a Hadoop log aggregator and dashboard which enables visualization of Hadoop cluster utilization across users. Tachyon - a fault tolerant distributed file system enabling reliable file sharing at memory- speed across cluster frameworks, such as Spark and MapReduce HIPI - is a library for Hadoop's MapReduce framework that provides an API for performing image processing tasks in a distributed computing environment Cassovary -a simple big graph processing library for the JVM Apache Helix - is a generic cluster management framework used for the automatic management of partitioned, replicated and distributed resources hosted on a cluster of nodes Summingbird -Streaming MapReduce with Scalding and Storm Created By:-Samarjit Mahapatra Projects – other than Hadoop! Created By:-Samarjit Mahapatra [email protected] MongoDB - an open-source document database, and the leading NoSQL database. Written in C++ Katta - is a scalable, failure tolerant, distributed, data storage for real time access. Kiji – for building Real-time Big Data Applications on Apache HBase MLBase - a platform addressing the issues of ML Developers & end users, which consists of three components --MLlib, MLI, ML Optimizer cloud9 - is a collection of Hadoop tools that tries to make working with big data a bit easier. elasticsearch - flexible and powerful open source, distributed real-time search and analytics engine for the cloud Apache Curator- is a set of Java libraries that make using Apache ZooKeeper much easier. Parquet – is a columnar storage format for Hadoop. OpenTSDB - is a distributed, scalable Time Series Database (TSDB) written on top of HBase. OpenTSDB was written to address a common need: store, index and serve metrics collected from computer systems (network gear, operating systems, applications) at a large scale, and make this data easily accessible and graphable. Giraph - is an iterative graph processing system built for high scalability. For example, it is currently used at Facebook to analyze the social graph formed by users and their connections. Giraph originated as the open-source counterpart to Pregel, the graph processing architecture developed at Google and described in a 2010 paper CouchDB - is a database that uses JSON for documents,JavaScript for MapReduce queries, and regular HTTP for an API Datafu- is a collection of user-defined functions for working with large-scale data in Hadoop and Pig. This library was born out of the need for a stable, well-tested library of UDFs for data mining and statistics. Norbert - is a cluster manager and networking layer built on top of Zookeeper Apache Samza - is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management. Apache Kafka - is publish-subscribe messaging rethought as a distributed commit log. Apache Whirr - is a set of libraries for running cloud services. HUE - a File Browser for HDFS, a Job Browser for MapReduce/YARN, an HBase Browser, query editors for Hive, Pig, Cloudera Impala and Sqoop2. Nagios – offers complete monitoring and alerting for servers, switches, applications, and services. Ganglia - is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids Apache Thrift – is software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages. Created By:-Samarjit Mahapatra Projects – other than Hadoop! Created By:-Samarjit Mahapatra [email protected] Prediction.io - is an open source machine learning server for software developers to create predictive features, such as personalization, recommendation and content discovery CloudMapReduce - A MapReduce implementation on Amazon Cloud OS Titan - is a scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. Created By:-Samarjit Mahapatra Projects – other than Hadoop! Created By:-Samarjit Mahapatra [email protected] Hadoop Alternatives Apache Spark- open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. GraphLab - a redesigned fully distributed API, HDFS integration and a wide range of new machine learning toolkits. HPCC Systems- (High Performance Computing Cluster) is a massive parallel-processing computing platform that solves Big Data problems. Dryad- is investigating programming models for writing parallel and distributed programs to scale from a small cluster to a large data-center. Stratosphere – above the cloud. Storm - is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use! R3 - is a map reduce engine written in python using a redis backend. Disco - is a lightweight, open-source framework for distributed computing based on the MapReduce paradigm. Phoenix - is a shared-memory implementation of Google's MapReduce model for data- intensive processing tasks. Plasma - PlasmaFS is a distributed filesystem for large files, implemented in user space. Plasma Map/Reduce runs the famous algorithm scheme for mapping and rearranging large files. Plasma KV is a key/value database on top of PlasmaFS Peregrine - is a map reduce framework designed for running iterative jobs across partitions of data. httpmr - A scalable data processing framework for people with web clusters. sector/sphere - sector is a high performance, scalable, and secure distributed file system. Sphere is a high performance parallel data processing engine that can process Sector data files on the storage nodes with very simple programming interfaces. Filemap - is a lightweight system for applying Unix-style file processing tools to large amounts of data stored in files. misco - is a distributed computing framework designed for mobile devices MR-MPI – is a library, which is an open-source implementation of MapReduce written for distributed-memory parallel machines on top of standard MPI message passing GridGain – in-memory computing Created By:-Samarjit Mahapatra Projects – other than Hadoop! Created By:-Samarjit Mahapatra [email protected] MapReduce Alternatives Octopy - is a fast-n-easy MapReduce implementation for Python. Cassalog - is a fully-featured data processing and querying library for Clojure or Java. The main use cases for Cascalog are processing "Big Data" on top of Hadoop or doing analysis on your local computer. Cascalog is a replacement for tools like Pig, Hive, and Cascading and operates at a significantly higher level of abstraction than those tools. Cascading – is an application framework for Java developers to simply