Big Data Challenges in the Era of Data deluge
Ilya Volvovski, Cleversafe Inc You are aware: data is coming
This is the new age: everything is digi zed The average storage growth rate is around 60-80% annually Every minute of every day we create: ● More than 200,000 email messages ● 48 hours of new YouTube videos ● 684,000 bits of content shared on Facebook ● More than 100,000 tweets
h p:// www.domo.com/ … and coming
● Government, intelligence gathering, maps, aerial pictures ● Experimental data by scien fic ins tu ons in areas such as physics, biology, climatology, chemistry, genomics astrophysics, ● Medical content ● Media content: movies, TV shows, sport ● Industries: Banking, Oil explora on
What do we need?
Build a different type of storage system Modern Storage System requirements ● Data Reliability ● Storage scalability ● Access scalability ● Mul ple Interfaces support ● High system availability ● Consistency ● High performance ● Security ● Self healing ● Integrity o Protec on against silent data corrup ons o Protec on against intrusion and malicious tampering
Modern storage system requirements
● Support for Disaster Recovery ● Geo dispersion o Par cularly for disaster avoidance ● Tolerance to hardware failures ● Tolerance to limping hardware ● Tiering support ● Integra on with the exis ng systems ● Efficient backend storage o Should be able to adapt to new technologies such as new disk types o A system user should not be aware of technology changes, storage system migrates data and keeps opera ng o Flexible op miza on to achieve performance requirements Modern storage system requirements
● Powerful system management o Authen ca on and access control o Scalable provisioning o Automated disk management o Hierarchical monitoring o Fault/event correla on o Fault predic on / fault tolerance o Automated recovery o Performance visualiza on o Seamless (zero down me) upgrade § Data migra on path during upgrades § Background long conversion Storage Reliability Storage system ability to preserve data despite: • Physical devices fail • Networks are not 100% reliable, messages could be lost, garbled or not delivered at all • So ware system are not 100% reliable, may crash • Storage systems could be compromised
Bits on drives will be eventually lost ● Physical destruc on ● Magne c deteriora on. Sectors become unreadable or corrupted o Latent sector errors (LSE) aka Unrecoverable read errors (URE. The bigger drive size the more LSE o Some could be detected by hardware o Lead to latent errors and eventual corrup on ● More data is lost during disk failures but silent corrup on could lead to silent data corrup on due to repair using bad bits ● Losses due to system crashes/ programma c corrup ons
John Elerath 2007 Hardware reliability measures
● Mean Time To Failure (MTTF) average me to failure ● Annualized Failure Rate (AFR) probable percent of failure per year ● Mean Time To Repair (MTTR): average me to repair loss ● 1/MTTF in years = AFR ● Mean Time To Data Loss is a derived characteris c
Source: Backblaze September, 2014 How do disks die
Mean me is a rela vely crude measure How data is protected today
• Typical ways to protect against data loss: • Replica on • RAID (Redundant Array of Independent Disks) • Combina on of the two RAID
Redundant Array of Independent Inexpensive Disks ● RAID 0 (striping) ● RAID 1 (mirroring) ● Only interested in RAID 5 and 6 o Prevents a single or double error
RAID Reliability
• Every addi onal tolerated failure The formulas are an accurate approxima on for MTTF/MTTR increases reliability by: >> N ● MTTF measured in years; MTTR in hours, at most days ● Each addi onal tolerated failure increases reliability by factor of thousand ● For a single RAID-6 (6+2) a chance of a failure in 10 years is 0.13% and expected data loss is 7.2MB
Data Loss Probability in Large Storage Systems RAID cri que
● MTTF improves but significantly slower than increase in storage capacity and disk read/write speed. LSE if present during rebuild may cause silent rebuild errors to creep into system and remain hidden leading to data corrup on. ● Can tolerate only one or two concurrent failures. The bigger the system the more chances to have concurrent failures. ● LSE chances increase with size. ● High chances correlated errors as parity disks are o en in the same loca on, rack, enclosure. ● High system capacity makes probability to fail much higher. ● Doesn't improve availability Need copies
● Typically three - Magic standard for enterprise o HDFS - Hadoop o CEPH - Distributed File Storage o Riak - Distributed Data Storage with DBMS ● Too expensive ● Less secure ● More complex to make consistent An alterna ve to RAID + Copy àIDA
An informa on dispersal algorithm (IDA) is a method of storing data along with redundant (coded) data calculated from the original set. ● K (>=1) is the number of original disks ● M (M>=0) is the number of coded disks created from the original K ● Any combina on of M, K is theore cally possible o K=1, M=0 - single unprotected disk o K=1, M>0 - M level replica on o K, M=1 - RAID-5 o K, M=2 - RAID-6 ● Storage overhead: M/K ● Computa onal complexity in general case: O(M*K) ● Only restores data if lost (erased), does not detect failures. Other methods of protec on should be used against data corrup on. Another look at IDA for storage
Another way to arrange the same process is to split the original data in K+M pieces in such a way that any K (threshold) could be used to restore the original data. There are several ways to split the original data in K parts and calculate coded
W(idth)= K+M T(hreshold) = K A li le history
1. Shamir(1979) How to share secret o Space inefficient, coded data is of the same size as the original data, total size N*size o Based on polynomial proper es o Informa on theore cally (uncondi onally) secure 2. Rabin (1987) Informa on Dispersal Algorithm (IDA) o Most space efficient: overhead is (N/K)*size o Only computa onally secure Why IDA now? IDA Enablers
Computer industry megatrends served as IDA enablers: Dynamics in changes: ● Disk ● CPU ● Networking ● RAM Historical trends in computer industry
Mechanics: Electronics: Photonics: Performance is limited by Performance is limited by Performance is limited by rota on and head chip density wavelength of light movements through fiber
Increasing slower than Increasing as fast as Increasing faster than Moore’s Moore’s Law Moore’s Law Law ● Hard Drive Speed ● CPU ● Networking speed ● Memory Capacity ● Hard Drive capacity Drives price per MB and capacity 1956: IBM 5 megabytes US$50,000 Disk trends Hard drive capacity in GB (doubles every 24 months), access speed not so much CPU Trends
Overall CPU performance keeps doubling every 18 months. This trend doesn’t stop. Network trends
Speed and cost of transmi ng bits over the network doubles every nine months. This is the fastest changing component
h p://www.ethernetalliance.org/wp-content/uploads/2012/04/Ethernetnet-Alliance- ECOC-2012-Panel-2.pdf IDA is a form of erasure coding
● The lowest overhead. Reed-Solomon(RS) ● Computa onally expensive: o With modern architecture codes becomes irrelevant
● Computa onally more efficient O(n+m), Tornado Codes use only XOR. ● May require more than M blocks to decode.
● Derived from MDS (Maximum Distance Reconstruc on codes Separable Codes, such as RS) (LRC) MS Azure ● Trade space efficiency for faster rebuild. IDA basics (Reed/Solomon)
In the general case a matrix is NxK, where K is the number of disks (or threshold) and N=K+M, sum of original and encoded disks (or width).
If any K*K submatrix is reversible, the original D0-Dk-1 could be recalculated as a system of linear algebraic equa ons from any k codewords IDA Reed-Solomon (op mized)
This encoding matrix with iden ty K*K is called systema c. Uses iden ty matrix. The input data appears in output codes. Only M values should be encoded. Galois field arithme c is used.
Reliability IDA vs RAID advantages Controlled reliability
The formula is an approxima on for MTTF/MTTR >> N
MTTDLK/N-IDA-system=
● No control over MTTF ● Li le control over MTTR ● Full control over N and K Other IDA advantages over RAID
● Data is spread across nodes: o Fewer correlated failures o Allows for mul -site recovery without replica on. o Preserves availability during node and network outages. o Could be geographically distributed. ● Advanced rebuilding and protec on techniques: o Scales well with size, mul ple agents could par cipate on rebuilding of a disk failure. o Protects against malicious corrup on and tempering. ● Security: data is not stored in a single loca on ● Computa onally very efficient with the latest hardware (instruc ons, architectures etc). ● Natural support concurrency support
IDA cri que
● Expensive rebuilding o in order to restore 1 MB, the system needs to read threshold K MB of slices. Try to avoid rebuilding by all means: predic ve drive failure algorithms. Improved network speed and CPU enhancements make it less cri cal. Research to develop minimum-storage-regenera ng (MSR) codes ● IOPs for small objects. o For objects under 100KB slices become small. Poorly scalable under this condi on. Combine replica on/IDA; SSD use for efficiency; ering; caching. SSD drives mostly eliminate IOPs issues for small objects. ● Computa onally expensive o With advances of CPU technology computa onal expense quickly goes away. Classic File based Storage Systems
The classic file based storage systems (NAS/SAN) don’t meet the modern day requirements for mul - petabyte storage. ● File system path base iden fica on is not adequate for applica on. ● Inherent directory conten on. ● Typically local File System is Based on Block Storage Does it look somewhat familiar?
Need an alterna ve Object Storage
● Non-structured ● Loca on independent ● Global unique iden fica on ● Self describing ● Associated metadata ● Typically WORM ● Tag searchable
A highly scalable alterna ve to FS, typically distributed De facto standard protocols: S3, OpenStack, CDMI It is disrup ve
System Proper es We (You) Want General Scalability ● Scale-out ● No single point of failure ● Share nothing but network resources. ● Addi onal resources should be accepted with li le overhead (may require some rebalancing but should be transparent otherwise). ● Loss of resource should be compensated (discovery and self- healing). ● Virtualized (client should not be aware of physical loca on). Storage Size Scalability No limita ons of physical size growth: ● May start small, terabytes or petabytes and grow to exabytes. ● Thin provisioning. ● Performance should scale with size. ● Storage device of different types and sizes should be combined.
Namespace Scalability • In order to build a large system one needs a huge namespace to reference all objects. A single system should be able to maintain many ZetaBytes. Namespace truly has to exceed any meaningful limits: – Address space should be such that a probability of a collision on random name genera on should be smaller than any physical random events or longer than typical data survival span by several order of magnitude. Namespace Scalability (cont)
Distributed Hash Table (DHT), a key-value structure distributed among mul ple nodes: ● Typically use consistent hashing. ● Avoids significant data movement in case of remapping. ● Prac cally Unlimited scalability. ● Decentralized, no coordinator needed. ● Fault tolerant ● Dynamic (nodes could appear and disappear). ● Expensive for high performance systems. ● Mul ple implementa on (Chord, Pastry, CAN). Used in many products (Cassandra, Scality).
Access scalability
● Number of concurrent supported clients should grow along with storage growth (access size scaling). ● Should be independent from number of storage nodes (storage size scaling). o Some limits could be based on storage size (access scales along with storage size). o Provide graceful degrada on on overload. Indexing
Data should be easy to discoverable otherwise it is useless ● Objects should be self described, contain both data and associated metadata. ● Objects should be loca on independent: could be physically reside anywhere and could be moved; client should not be aware of loca on. ● Performance should be predictable and do not depend on any central authority. Consistency
Consistency: ability for a random reader to retrieve the latest version of data. ● Immediate consistency ● Eventual consistency
Why is it important? • So ware logic could rely on the fact that if data is wri en it is available for read. Availability
Availability: a system ability to respond within a reasonable amount of me without error or meout. ● Network par on ● Hardware failure o disk o controllers o memory ● So ware crashes o Including kernels ● System upgrades ● Measured in 9 of availability, percent of the me the system could store and retrieve data (or may be par ally available: read or write, but not both).
Obstacle: Network Par on
Network par on: a condi on when some network devices can not access each other. Par on tolerance: is a system ability to con nue to func on when network par on occur.
Par on tolerance is not a choice, it is a fact of any distributed system and outside of designer control.
CAP theorem (Eric Brewer)
Consistency/Availability/Par on Tolerance (CAP). ● Only two out of three could be sa sfied simultaneously. ● Par on tolerance is a fact of life and can’t be ignored. ● Means that designing distributed storage system one should favor either Availability or Consistency. ● Latency implica ons, consistency typically requires more roundtrips. Conclusion • Storage system design should choose between availability and consistency. Or give it as a deployment op on.
High performance
● A very difficult task for distributed storage system ● System needs to be adap ve to ever changing condi ons ● Scale with storage system size ● Sensi ve to changing over me bo lenecks Performance measurements
● Throughput o Large objects o Archiving systems ● IOPs o Small objects ● Latency o Interac ve applica ons ● Time-To-First-Byte (TTFB) o Streaming Self healing / Rebuilding ● Self-healing restores the original redundancy ● Creates addi onal traffic and system load. ● Should be scalable with storage size. ● Should be adaptable to storage state. Forms of data preserva on
Rebuilding is the last but the most expensive resort for data preserva on. Others could be deployed: ● Disk failure predic on ● Disk re rement ● Tolerance to par al failures
System health
The measure of data reliability ● Visual representa on ● System events to alert ● Policy based automa c or manual switching to degraded mode is low ● Policy based automa c or manual system shutdown when reliability is dangerously low ● Feedback to the rebuilding component Security aspects (CIA)
● Confiden ality o Only authorized en es could access data. A ackers should not be able to read/write or modify. ● Integrity o Data is what you expect it to be. ● Availability o The informa on has value if the owner could actually obtain and use it Forms of data protec on
● Data at rest ● Unauthorized en es should not have access to recognizable data ● Data in mo on ● Primarily on the wire Tiering
● Human defined rules ● Automa c detec on ● Ability to move data from faster (more expensive) to slower storage and visa versa in order to improve latency for ‘hot’ objects
Universal access
Applica ons must be able to store and retrieve data using their exis ng protocols without requiring changes to underlying applica ons. ● Support the exis ng object de-facto standards o S3 o Openstack o CDMI o Open for inclusion and interoperability ● Storage systems must be able to interoperate with tradi onal file protocols o NFS o CIFS o Hadoop HDFS.
GEO dispersal
● Reduce fault correla on ● Make storage elements closer to a client ● Temporary loss of a single site should preserve both consistency and availability o R+W > IDA_Width o 2 site problem for an efficient IDA Computa onal compa bility The stored data should be analyzable ● Logs ● Intelligence data, pa ern search ● User data (indexing, face recogni on)
is the most popular analy cs system
● Hadoop support is essen al ● Na ve HDFS support facilitates ● Hadoop deployment on storage nodes Challenge: avoid data movement between storage nodes ● Apply IDA differently Conclusions
In this lecture we discussed: • Storage industry trends • Exis ng storage system limita ons • Modern storage requirements • Informa on Dispersion Algorithm and its fit to solve storage problems • Major modern storage system requirements In the next hour we discuss : • What does it take to build a prac cal modern storage system • Major prac cal system characteris cs • Examples how this principles could be applied