Data Science: Bigtable, Mapreduce and Google File System
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Computer Trends and Technology (IJCTT) – volume 16 number 3 – Oct 2014 Data Science: Bigtable, MapReduce and Google File System Karan B. Maniar1, Chintan B. Khatri2 1,2Atharva College of Engineering, University of Mumbai, Mumbai, India Abstract a simple data model and supports dynamic control over data layout and format. Bigtable is scalable Data science is the extension of research findings and and self-managing at the same time. Wide drawing conclusions from data[1]. BigTable is built on a few of Google technologies[2]. MapReduce is a applicability, scalability, high performance, and programming model and an associated high availability are some of the goals that Bigtable implementation for processing and generating large has achieved. There are numerous Google products data sets with a parallel, distributed algorithm on a that are using Bigtable for eg. Google Analytics, cluster[3]. Google File System is designed to provide Google Finance, Orkut, Personalized Search, efficient, reliable access to data using large clusters of Writely, and Google Earth. Bigtable maps two commodity hardware[4]. This paper will discuss random string values that are row key and column Bigtable, MapReduce and Google File System, along key. Timestamp is mapped into an associated with discussing the top 10 algorithms in data mining arbitrary byte array, which in turn makes it three in brief. dimensional[6]. The API of the Bigtable provides Keywords— Data Science, Bigtable, MapReduce, many functions like single row transactions are Google File System, Top 10 algorithms in data supported, allowing cells to be used as integer mining. counters and deleting and creating tables and column families. A Bigtable is not a relational database. One of the dimensions of each table is a field for time, permitting for versioning and 1. INTRODUCTION garbage collection. Bottlenecks and inefficiencies A distributed storage system for managing can be eliminated as they emerge[5]. structured data that is planned to scale to an immense size is called Bigtable. Internet’s development has resulted in the establishment and 3. MAPREDUCE wide usage of many internet based applications[5]. Google started the development of Bigtable in 2003 The Hadoop programs perform two distinguishable to address the demands of those applications. The tasks and the term MapReduce refers to them. Map applications that require a large storage volume use job and reduce job are the two tasks. Breaking Bigtable[6]. down individual elements into tuples after taking a set of data and converting it into another set of data For processing and generating large data sets, the is known as the map job. The output from the map programming model that is used is MapReduce. job is then taken as input and those data tuples are Map and Reduce functions that are ubiquitous in converted into a smaller set of tuples. As the name functional programming are the inspiration for suggests, the map job always precedes the reduce MapReduce. It is easy to use. MapReduce job[9]. There are lots of improvements from programs may not be fast all the time[7]. Terasort records that are: Shuffle 30%+, Merge improvements. MapReduce also enables to reuse Google File System (GFS) is a scalable distributed task slots. It is also used for medical image analysis file system for data-intensive applications that are because the modern imaging techniques like 3D large. There two primary advantages of GFS are and 4D result in a large file size which requires a the hardware on which it runs is low-priced and it large amount of storage and network bandwidth. delivers high performance. GFS is a vital tool that Word count application is a common example of empowers to sustain to innovate and strike MapReduce. problems on the Internet[8]. 4. GOOGLE FILE SYSTEM 2. BIGTABLE Google File System is a closed source software. It A Bigtable is a sparse, distributed, persistent is developed by Google. The codename of its new multidimensional sorted map. For most databases version is Colossus. Its interface is a familiar file that are commercial, the scale is too large. Since system interface. Since there are hundreds of the scale is large, the cost would be very high. It is servers in a GSF cluster, a few of them could be ISSN: 2231-2803 http://www.ijcttjournal.org Page115 International Journal of Computer Trends and Technology (IJCTT) – volume 16 number 3 – Oct 2014 unavailable at any given time. To keep the system F. PageRank functioning, fast recovery and replication are the PageRank is a search ranking algorithm two strategies that are used. In fast recovery, the using hyperlinks on the Web. master and the chunkserver restore their state and start immediately irrespective of the way through G. AdaBoost which they were terminated. Replication is of two The AdaBoost algorithm is used to group types: chunk replication and master replication. In items to solve a problem or view them as a chunk replication, every chunk is replicated on whole instead of single entities. It is very many chunkservers on different racks. In master simple and has a great ability to make replication, the master scale is replicated for accurate predictions. dependability[10]. H. kNN: k-nearest neighbor classification kNN is used for classification and regression. A label is classified to an 5. TOP 10 ALGORITHMS IN DATA MINING[11] unlabelled vector. The most recurrent A. C4.5 among the k training samples, where k is C4.5 generates decision trees. These the user-defined number, nearest to that decision trees can be used for query point is assigned the label. classification. They are also referred to as statistical classifier. I. Naive Bayes If the probability an event is given i.e. the B. The k-means algorithm even has already occurred, then we can The k-means algorithm is used to partition find out the probability of an event using a given set of data into k cluster. It is an the Bayes’ Theorem. iterative procedure. The value of the number of clusters k is decided by the J. CART: Classification and Regression user. Trees CART is a method which is used for C. Support vector machines classification and regression trees. They Support vector machines cope well with are used for predicting variables that are errors during execution. They are dependent i.e. regression and predicting algorithms that perform two operations: categorical variables i.e. classification. Analyse data and recognize patters. There are two primary uses of support vector machines: Classification and estimating relationships among variables. D. The Apriori algorithm Apriori is an algorithm for frequent item set mining and association rule learning over transactional databases. E. The EM algorithm Em stands for expectation–maximization and it is an iterative algorithm. It is used to figure out maximum likelihood in models which are statistical and depend on variable that are not directly observed. ISSN: 2231-2803 http://www.ijcttjournal.org Page116 International Journal of Computer Trends and Technology (IJCTT) – volume 16 number 3 – Oct 2014 LITERATURE SURVEY Sr. No Year Paper Description 1 1998 The PageRank Citation Discusses PageRank which is the method for rating web Ranking: Bringing Order pages objectively and mechanically. to the Web 2 2001 Random forests Shows that random forests are an effective tool in prediction. 3 2003 The Google File System Presents the fact that The Google File System demonstrates the qualities essential for supporting large-scale data processing workloads on commodity hardware. 4 2004 MapReduce: Simplified Shows that the MapReduce programming model has been Data Processing on Large successfully used at Google for many different purposes. Clusters 5 2004 A storage architecture for Examines the concept of early discard for interactive search early discard in interactive of unindexed data. search 6 2005 Interpreting the data: Presents a system for automating such analyses. Parallel analysis with Sawzall 7 2006 Integrating compression Discusses how to extend C-Store, which is a column-oriented and execution in column DBMS, with a compression sub-system. oriented database systems 8 2006 Bigtable: A Distributed Describes the simple data model provided by Bigtable, which Storage System for gives clients dynamic control over data layout and format Structured Data along with the design and implementation of Bigtable. 9 2007 Dynamo: Amazon’s Presents the design and implementation of Dynamo, a highly Highly Available Key- available key-value storage system that some of Amazon’s value Store core services use to provide an “always-on” experience. 10 2007 An engineering Describes selected algorithmic and engineering problems perspective encountered, and the solutions found for them. 11 2008 Top 10 algorithms in data Presents the top 10 data mining algorithms identified by the mining IEEE International Conference on Data Mining (ICDM) in December 2006. 12 2009 "http://www.readwriteweb. Discusses the problem with Relational Databases. com/enterprise/2009/02/is- therelational-database- doomed.php. ", "ReadWrite. " 13 2010 Google Big Table Describes the differences between Bigtable and relational database and focus on the different data models used by them. 14 2012 A Few Useful Things to Summarizes twelve key lessons that machine learning Know about Machine researchers and practitioners have learned. Learning ISSN: 2231-2803 http://www.ijcttjournal.org Page117 International Journal of Computer Trends and Technology (IJCTT) – volume 16 number 3 – Oct 2014 CONCLUSION Bigtable provides good performance and high availability. The MapReduce model is easy to use, problems are easily expressible and its implementation scales to large clusters of machines. The technology that enables a very powerful general purpose cluster is Global File System. Thus the basics of Bigtable, MapReduce and Google File System were discussed in brief. REFERENCE 1. en.wikipedia.org/wiki/Data_science 2. en.wikipedia.org/wiki/BigTable 3. en.wikipedia.org/wiki/MapReduce 4.