Sharding Federation Vs. Partitioning Vs. Sharding

Sharding Federation vs. Partitioning vs. Sharding What Is Sharding? Introduction to Sharding Architectural Overview Shard Config Server Routing Process (mongos) Shard Key Chunk Sharding Setup Step-by-Step Auto Sharding Demo Hash Sharding Tag-aware Sharding Summary Lab#5 Sharding Federation vs. Partitioning vs. Sharding A federation is a set of things (usually states or regions) that together compose a centralized unit but each individually maintains some aspect of autonomy. In databases, it means that several databases hold information, but certain instances are completely responsible for different portions of the data commonly based off characteristics of the data itself. Federation is typically across machines. Federation more often applies to schemes that divide on logical boundaries, i.e. different geographic locations. Another common example is federating based on quality of service (paying users vs. free users). 1 Horizontal partitioning The concept of “Divide and Conquer”. “Dividing Tables and Indexes into manageable pieces.” Where selected subsets of rows are placed in different tables . When a view is created over all the tables, and queries directed to the view, the result is a partitioned view. Unique Regional ID Chicago NY LA Vertical partitioning The columns of a very wide table are spread across multiple tables containing distinct subsets of the columns with the same number of rows. The result is multiple tables containing the same number of rows but different columns, usually with the same primary key column in each table. Customer SSN Checking Savings Visa 2 Database partitioning Multiple physical tables are no longer involved. When a table is created as a partitioned table, database automatically places the table's rows in the correct partition, and database maintains the partitions behind the scenes. You can then perform maintenance operations on individual partitions, and properly filtered queries will access only the correct partitions. But it is still one table as far as database is concerned. Year < 1/1/2014 1/1/2004 – 12/31/2014 > 1/1/2015 Different Methods of Partitioning in Oracle Partitioning Brief Description Method Range Used when there are logical ranges of data. Possible Partitioning usage: dates, part numbers, and serial numbers. Hash Used to spread data evenly over partitions. Possible Partitioning usage: data has no logical groupings. List Used to list together unrelated data into partitions. Partitioning Possible usage: a number of states list partitioned into a region. Composite Used to range partition first, then spreads data into hash Range-Hash partitions. Possible usage: range partition by date of Partitioning birth then hash partition by name; store the results into the hash partitions. Used to range partition first, then spreads data into list Composite partitions. Possible usage: range partition by date of Range-List birth then list partition by state, then store the results Partitioning into the list partitions. 3 Single-level Partitioning MongoDB computes a hash of a field’s value, and then uses these hashes to create chunks ensuring a more random distribution of a collection in the cluster. Composite Partitioning Jan-Feb Mar-Apr May-Jun Jun-Jul 4 Why Partition? Benefits Each partition is stored in its own physical location and can be managed individually. It can function independently of the other partitions, thus providing a structure that can be better tuned for availability and performance. Operations on partitioned tables and indexes are performed in parallel by assigning different parallel execution servers. It’s entirely transparent to query statements, partitioning can be applied to almost any application. Storing partitions in separate location enables you to: Reduce the possibility of data corruption in multiple partitions Control the mapping of partitions to disk drives (important for balancing I/O load) Improve manageability, availability, and performance Partitioning is useful for many different types of applications, particularly applications that manage large volumes of data. OLTP systems often benefit from improvements in manageability and availability. Data warehousing systems benefit from performance and manageability What Is Sharding? A shard is a piece of broken ceramic, glass, rock (or some other hard material) and is often sharp and dangerous. Sharding is physically breaking large data into smaller pieces (shards) of data. The trick is putting them back together again… Sharding: A database architecture that enables horizontal scaling by splitting data into key ranges among two or more replica sets. This architecture is also known as “ range- based partitioning .” 5 Introduction to Sharding Sharding refers to the process of splitting data up and storing different portions of the data on different machines; the term partitioning is also sometimes used to describe this concept. By splitting data up across machines, it becomes possible to store more data and handle more load without requiring large or powerful machines. Manual sharding can be done with almost any database software. When an application maintains connections to several different database servers, each of which are completely independent. The application code manages storing different data on different servers and querying against the appropriate server to get data back. This approach can work well but becomes difficult to maintain when adding or removing nodes from the cluster or in the face of changing data distributions or load patterns. MongoDB supports autosharding , which eliminates some of the administrative headaches of manual sharding. The cluster handles splitting up data and rebalancing automatically. Introduction to Sharding Vertical scaling adds more CPU and storage resources to increase capacity. Scaling by adding capacity has limitations: high performance systems with large numbers of CPUs and large amount of RAM are disproportionately more expensive than smaller systems. Additionally, cloud-based providers may only allow users to provision smaller instances. As a result there is a practical maximum capability for vertical scaling. Sharding, or horizontal scaling, by contrast, divides the data set and distributes the data over multiple servers, or shards. Each shard is an independent database, and collectively, the shards make up a single logical database. 6 Sharding vs. nonSharding In a non-sharded MongoDB setup, you would have a client (mongo shell) connecting to a mongod process. In a sharded setup, the client connects to a mongos process, which abstracts the sharding away from the application. From the application’s point of view, a sharded setup looks just like a non-sharded setup. There is no need to change application code when you need to scale. Non-sharded client connection Sharded client connection Sharding • Sharding reduces the number of operations each shard handles. Each shard processes fewer operations as the cluster grows. • As a result, a cluster can increase capacity and throughput horizontally. • To insert data, the application only needs to access the shard responsible for that record. • Sharding reduces the amount of data that each server needs to store. Each shard stores less data as the cluster grows. • For example, if a database has a 1 terabyte data set, and there are 4 shards, then each shard might hold only 256GB of data. If there are 40 shards, then each shard might hold only 25GB of data. 7 Sharding and Sharing Sharding uses the “ shared nothing ” model Each server has its own data (chunk ) & processor ( mongod ) The functionality of “shared nothing” is: A controlling node, mongos , receives the query It then sends the query to the other nodes Each gets its data and sends it to the controlling node The controlling node unifies or aggregates the result sets When to Shard You’ve run out of disk space on your current machine. You want to write data faster than a single mongod can handle. You want to keep a larger proportion of data in memory to improve performance. In general, you should start with a non-sharded setup and convert it to a sharded one, if and when you need. MongoDB Sharding Model and Process MongoDB sharding effectively provide an unlimited size for data collection, which is important for any big data scenario. Sharding Process: A collection can be partitioned (by a partition key) into chunks (which can be a key range) and have chunks distributed across multiple shards . Sharding Model: Load balance write-request in MongoDB shards. Auto Sharding: use a shard index of a one or more fields as the shard key to partition data across your sharded cluster. Hash Sharding: use a hashed index of a single field as the shard key to partition data across your sharded cluster. Tag Aware Sharding: Assign specific ranges of a shard key with a specific shard or subset of shards. This association dictates the policy of the cluster balancer process as it balances the chunks around the cluster. 8 Auto Sharding in MongoDB The basic concept behind MongoDB’s sharding is to break up collections into smaller chunks . These chunks can be distributed across shards so that each shard is responsible for a subset of the total data set. Application doesn’t need to know what shard has what data, or even that our data is broken up across multiple shards, so we run a routing process called mongos in front of the shards. Mongos knows where all of the data is located, so applications can connect to it and issue requests normally. It’s transparent to applications: It still connects to a normal mongod . The router, knowing what data is on which shard, is able to forward the requests to the appropriate shard(s). If there are responses to the request, the router, mongos , collects them and sends them back to the application. Sharded Cluster Components Sharded cluster: The set of nodes comprising a sharded MongoDB deployment. Sharding occurs within a sharded cluster . Shards : Each shard is a separate mongod instance or replica set that holds a portion of the database collections. Config server : Each config server is a mongod instance that holds metadata about the cluster.

Load more