Implementation of Hadoop -2

Kiranmayi Ganti Performance and Scalability .

Data Ingest Using Streaming Writes

• MapR uses Spark Streaming to ingest real time data. • Spark Streaming is an extension of the core Spark API that allows enables high- throughput, fault-tolerant stream processing of live data streams. • Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ or plain old TCP sockets and be processed using complex algorithms expressed with high-level functions like map, reduce and join. • Finally, processed data can be pushed out to file systems, databases, and live dashboards. In fact, you can apply Spark’s in-built machine learning algorithms, and graph processing algorithms on data streams. SPARK API Hbase Performance

• HBase is an open source, non-relational, distributed database modeled after Google's BigTable and written in Java. It is developed as part of Apache Software Foundation's project and runs on top of HDFS (Hadoop Distributed Filesystem), providing Big Table-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data. • Latency • Consistent low latency Dependability Manageability Ambari

• The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. • Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs. Volume Support-I

• The MapR File System uses volumes as a unique management entity. • A volume is a logical unit that you create to apply policies to a set of files, directories, tables, and sub-volumes. • You can create volumes for each user, department, or project. • Mirror volumes and volume snapshots, provide data recovery and data protection functionality. • Volumes can enforce disk usage limits, set replication levels, establish ownership and control permissible actions, and measure the cost generated by different projects or departments. Volume Support -II

• When you set policies on a volume, all files contained within the volume inherit the same policies set on the volume. • Other Hadoop distributions require administrators to manage policies at the file level. • You can manage volume permissions through Access Control Lists (ACLs) in the MapR Control System or from the command line. • You can also set read, write, and execute permissions on a file or directory for users and groups with standard UNIX commands, when that volume has been mounted through NFS, or using standard hadoop fs commands. Data & Job Placement Control

• MapR data placement control lets you specify where you want your data to be placed. • It can be used to direct application data to specific locations. • It can run specific workloads on specialized hardware in the cluster. Data Access Random Read-Write NFS Access

• Random Read-Write NFS capability on Hadoop (provided by MapR) allows any machine with NFS client software to “mount” the Hadoop cluster into the local file system tree. • NFS client machines can be running Linux, Unix, Macintosh, Windows, or any other operating system that supports NFS version 3. • Mounting the cluster via NFS provides the following capabilities: All existing applications can read and write data directly from/to the Hadoop distributed file system (completely transparent to the application – it’s just POSIX I/O!) Random Read-Write NFS Access

All applications have complete read/write random and concurrent access, meaning Hadoop is no longer write-once or append-only. Inside the Hadoop cluster, the data is immediately available for processing while it’s streaming in. Security ACL

• An Access Control List (ACL) is a list of users or groups. • Each user or group in the list is paired with a defined set of permissions that limit the actions that the user or group can perform on the object secured by the ACL. • In MapR, the objects secured by ACLs are the job queue, volumes, and the cluster itself. • A job queue ACL controls who can submit jobs to a queue, kill jobs, or modify their priority. Security ACL

• A volume-level ACL controls which users and groups have access to that volume, and what actions they may perform, such as mirroring the volume, altering the volume properties, dumping or backing up the volume, or deleting the volume. • An Access Control Expression (ACE) is a combination of user, group, and role definitions. • A role is a property of a user or group that defines a set of behaviors that the user or group performs regularly. • You can use roles to implement your own custom authorization rules. ACEs are used to secure MapR-DB tables that use native storage. Wire Level Authentication

• Kerberos is a primary way companies use Hadoop for user authentication, but that can raise issues. “Working with or integrating Kerberos is difficult for many companies. MapR Wire Level Authentication

• In client-server communication: User runs maprlogin command and enters username/password--or whatever authenticator. Mapr login obtains user key from cluster over HTTPS. Hadoop client automatically uses user key to secure all RPCs. All cluster operations and access are secured with the user key • In server-server communication: MapR Wire Level Authentication

All nodes in the cluster have a cluster key All RPCs are secured with cluster key • For fine-grained access control: Full POSIX permissions on files and directories ACLs on tables, column families and columns ACLs on MapReduce jobs and queues Administration ACLs on cluster and volumes ACLs for , and Impala Hortonworks Data Platform Cloudera Data Platform MapR Data Platform Key Features of Hadoop Distributions

Conclusion

Based om the following factors, we can decide on the distribution • Proprietary File System • Distribution are not actually free • Mutable Keys References

• https://www.mapr.com/products/product-overview/apache-spark-streaming • https://spark.apache.org/docs/0.9.1/streaming-programming-guide.html • http://en.wikipedia.org/wiki/Apache_HBase • https://www.mapr.com/blog/data-processing-vocabulary-101-key-terms-you-need- know#.VS6Q3pMgjMs • https://ambari.apache.org/ • http://doc.mapr.com/display/MapR/MapR+Overview References

• https://www.mapr.com/products/m5-features • https://www.mapr.com/developercentral/code/immediate-mapreduce- continuously-ingested-data#.VSWGB7HX7Tc • http://doc.mapr.com/display/MapR/Security+Architecture#SecurityArchitecture- AuthorizationArchitecture:ACLsandACEs • http://www.idevnews.com/stories/6008 • http://www.fiercebigdata.com/story/mapr-integrates-security-hadoop/2013-10-24 References

• http://hortonworks.com • http://www.cloudera.com/content/cloudera/en/home.html/ • https://www.mapr.com/ • http://www.networkworld.com/article/2369327/software/comparing-the-top- hadoop-distributions.html • http://data-magnum.com/cloudera-vs-hortonworks-vs-mapr-has-mapr-already- won-this-contest/