Big Data Challenges in the Era of Data

Ilya Volvovski, Cleversafe Inc You are aware: data is coming

This is the new age: everything is digized The average storage growth rate is around 60-80% annually Every minute of every day we create: ● More than 200,000 email messages ● 48 hours of new YouTube videos ● 684,000 bits of content shared on Facebook ● More than 100,000 tweets

hp:// www.domo.com/ … and coming

● Government, intelligence gathering, maps, aerial pictures ● Experimental data by scienfic instuons in areas such as physics, biology, climatology, chemistry, genomics astrophysics, ● Medical content ● Media content: movies, TV shows, sport ● Industries: Banking, Oil exploraon

What do we need?

Build a different type of storage system Modern Storage System requirements ● Data Reliability ● Storage ● Access scalability ● Mulple Interfaces support ● High system availability ● Consistency ● High performance ● Security ● Self healing ● Integrity o Protecon against silent data corrupons o Protecon against intrusion and malicious tampering

Modern storage system requirements

● Support for Disaster Recovery ● Geo dispersion o Parcularly for disaster avoidance ● Tolerance to hardware failures ● Tolerance to limping hardware ● Tiering support ● Integraon with the exisng systems ● Efficient backend storage o Should be able to adapt to new technologies such as new disk types o A system user should not be aware of technology changes, storage system migrates data and keeps operang o Flexible opmizaon to achieve performance requirements Modern storage system requirements

● Powerful system management o Authencaon and access control o Scalable provisioning o Automated disk management o Hierarchical monitoring o Fault/event correlaon o Fault predicon / o Automated recovery o Performance visualizaon o Seamless (zero downme) upgrade § Data migraon path during upgrades § Background long conversion Storage Reliability Storage system ability to preserve data despite: • Physical devices fail • Networks are not 100% reliable, messages could be lost, garbled or not delivered at all • Soware system are not 100% reliable, may crash • Storage systems could be compromised

Bits on drives will be eventually lost ● Physical destrucon ● Magnec deterioraon. Sectors become unreadable or corrupted o Latent sector errors (LSE) aka Unrecoverable read errors (URE. The bigger drive size the more LSE o Some could be detected by hardware o Lead to latent errors and eventual corrupon ● More data is lost during disk failures but silent corrupon could lead to silent data corrupon due to repair using bad bits ● Losses due to system crashes/ programmac corrupons

John Elerath 2007 Hardware reliability measures

● Mean Time To Failure (MTTF) average me to failure ● Annualized Failure Rate (AFR) probable percent of failure per year ● Mean Time To Repair (MTTR): average me to repair loss ● 1/MTTF in years = AFR ● Mean Time To Data Loss is a derived characterisc

Source: Backblaze September, 2014 How do disks die

Mean me is a relavely crude measure How data is protected today

• Typical ways to protect against data loss: • Replicaon • RAID (Redundant Array of Independent Disks) • Combinaon of the two RAID

Redundant Array of Independent Inexpensive Disks ● RAID 0 (striping) ● RAID 1 (mirroring) ● Only interested in RAID 5 and 6 o Prevents a single or double error

RAID Reliability

• Every addional tolerated failure The formulas are an accurate approximaon for MTTF/MTTR increases reliability by: >> N ● MTTF measured in years; MTTR in hours, at most days ● Each addional tolerated failure increases reliability by factor of thousand ● For a single RAID-6 (6+2) a chance of a failure in 10 years is 0.13% and expected data loss is 7.2MB

Data Loss Probability in Large Storage Systems RAID crique

● MTTF improves but significantly slower than increase in storage capacity and disk read/write speed. LSE if present during rebuild may cause silent rebuild errors to creep into system and remain hidden leading to data corrupon. ● Can tolerate only one or two concurrent failures. The bigger the system the more chances to have concurrent failures. ● LSE chances increase with size. ● High chances correlated errors as parity disks are oen in the same locaon, rack, enclosure. ● High system capacity makes probability to fail much higher. ● Doesn't improve availability Need copies

● Typically three - Magic standard for enterprise o HDFS - Hadoop o CEPH - Distributed File Storage o Riak - Distributed Data Storage with DBMS ● Too expensive ● Less secure ● More complex to make consistent An alternave to RAID + Copy àIDA

An informaon dispersal algorithm (IDA) is a method of storing data along with redundant (coded) data calculated from the original set. ● K (>=1) is the number of original disks ● M (M>=0) is the number of coded disks created from the original K ● Any combinaon of M, K is theorecally possible o K=1, M=0 - single unprotected disk o K=1, M>0 - M level replicaon o K, M=1 - RAID-5 o K, M=2 - RAID-6 ● Storage overhead: M/K ● Computaonal complexity in general case: O(M*K) ● Only restores data if lost (erased), does not detect failures. Other methods of protecon should be used against data corrupon. Another look at IDA for storage

Another way to arrange the same process is to split the original data in K+M pieces in such a way that any K (threshold) could be used to restore the original data. There are several ways to split the original data in K parts and calculate coded

W(idth)= K+M T(hreshold) = K A lile history

1. Shamir(1979) How to secret o Space inefficient, coded data is of the same size as the original data, total size N*size o Based on polynomial properes o Informaon theorecally (uncondionally) secure 2. Rabin (1987) Informaon Dispersal Algorithm (IDA) o Most space efficient: overhead is (N/K)*size o Only computaonally secure Why IDA now? IDA Enablers

Computer industry megatrends served as IDA enablers: Dynamics in changes: ● Disk ● CPU ● Networking ● RAM Historical trends in computer industry

Mechanics: Electronics: Photonics: Performance is limited by Performance is limited by Performance is limited by rotaon and head chip density wavelength of light movements through fiber

Increasing slower than Increasing as fast as Increasing faster than Moore’s Moore’s Law Moore’s Law Law ● Hard Drive Speed ● CPU ● Networking speed ● Memory Capacity ● Hard Drive capacity Drives price per MB and capacity 1956: IBM 5 megabytes US$50,000 Disk trends Hard drive capacity in GB (doubles every 24 months), access speed not so much CPU Trends

Overall CPU performance keeps doubling every 18 months. This trend doesn’t stop. Network trends

Speed and cost of transming bits over the network doubles every nine months. This is the fastest changing component

hp://www.ethernetalliance.org/wp-content/uploads/2012/04/Ethernetnet-Alliance- ECOC-2012-Panel-2.pdf IDA is a form of erasure coding

● The lowest overhead. Reed-Solomon(RS) ● Computaonally expensive: o With modern architecture codes becomes irrelevant

● Computaonally more efficient O(n+m), Tornado Codes use only XOR. ● May require more than M blocks to decode.

● Derived from MDS (Maximum Distance Reconstrucon codes Separable Codes, such as RS) (LRC) MS Azure ● Trade space efficiency for faster rebuild. IDA basics (Reed/Solomon)

In the general case a matrix is NxK, where K is the number of disks (or threshold) and N=K+M, sum of original and encoded disks (or width).

If any K*K submatrix is reversible, the original D0-Dk-1 could be recalculated as a system of linear algebraic equaons from any k codewords IDA Reed-Solomon (opmized)

This encoding matrix with identy K*K is called systemac. Uses identy matrix. The input data appears in output codes. Only M values should be encoded. Galois field arithmec is used.

Reliability IDA vs RAID advantages Controlled reliability

The formula is an approximaon for MTTF/MTTR >> N

MTTDLK/N-IDA-system=

● No control over MTTF ● Lile control over MTTR ● Full control over N and K Other IDA advantages over RAID

● Data is spread across nodes: o Fewer correlated failures o Allows for mul-site recovery without replicaon. o Preserves availability during and network outages. o Could be geographically distributed. ● Advanced rebuilding and protecon techniques: o Scales well with size, mulple agents could parcipate on rebuilding of a disk failure. o Protects against malicious corrupon and tempering. ● Security: data is not stored in a single locaon ● Computaonally very efficient with the latest hardware (instrucons, architectures etc). ● Natural support concurrency support

IDA crique

● Expensive rebuilding o in order to restore 1 MB, the system needs to read threshold K MB of slices. Try to avoid rebuilding by all means: predicve drive failure algorithms. Improved network speed and CPU enhancements make it less crical. Research to develop minimum-storage-regenerang (MSR) codes ● IOPs for small objects. o For objects under 100KB slices become small. Poorly scalable under this condion. Combine replicaon/IDA; SSD use for efficiency; ering; caching. SSD drives mostly eliminate IOPs issues for small objects. ● Computaonally expensive o With advances of CPU technology computaonal expense quickly goes away. Classic File based Storage Systems

The classic file based storage systems (NAS/SAN) don’t meet the modern day requirements for mul- petabyte storage. ● File system path base idenficaon is not adequate for applicaon. ● Inherent directory contenon. ● Typically local File System is Based on Block Storage Does it look somewhat familiar?

Need an alternave Object Storage

● Non-structured ● Locaon independent ● Global unique idenficaon ● Self describing ● Associated metadata ● Typically WORM ● Tag searchable

A highly scalable alternave to FS, typically distributed De facto standard protocols: S3, OpenStack, CDMI It is disrupve

System Properes We (You) Want General Scalability ● Scale-out ● No single point of failure ● Share nothing but network resources. ● Addional resources should be accepted with lile overhead (may require some rebalancing but should be transparent otherwise). ● Loss of resource should be compensated (discovery and self- healing). ● Virtualized ( should not be aware of physical locaon). Storage Size Scalability No limitaons of physical size growth: ● May start small, terabytes or petabytes and grow to exabytes. ● Thin provisioning. ● Performance should scale with size. ● Storage device of different types and sizes should be combined.

Namespace Scalability • In order to build a large system one needs a huge namespace to reference all objects. A single system should be able to maintain many ZetaBytes. Namespace truly has to exceed any meaningful limits: – Address space should be such that a probability of a collision on random name generaon should be smaller than any physical random events or longer than typical data survival span by several order of magnitude. Namespace Scalability (cont)

Distributed (DHT), a key-value structure distributed among mulple nodes: ● Typically use consistent hashing. ● Avoids significant data movement in case of remapping. ● Praccally Unlimited scalability. ● Decentralized, no coordinator needed. ● Fault tolerant ● Dynamic (nodes could appear and disappear). ● Expensive for high performance systems. ● Mulple implementaon (, Pastry, CAN). Used in many products (Cassandra, Scality).

Access scalability

● Number of concurrent supported clients should grow along with storage growth (access size scaling). ● Should be independent from number of storage nodes (storage size scaling). o Some limits could be based on storage size (access scales along with storage size). o Provide graceful degradaon on overload. Indexing

Data should be easy to discoverable otherwise it is useless ● Objects should be self described, contain both data and associated metadata. ● Objects should be locaon independent: could be physically reside anywhere and could be moved; client should not be aware of locaon. ● Performance should be predictable and do not depend on any central authority. Consistency

Consistency: ability for a random reader to retrieve the latest version of data. ● Immediate consistency ● Eventual consistency

Why is it important? • Soware logic could rely on the fact that if data is wrien it is available for read. Availability

Availability: a system ability to respond within a reasonable amount of me without error or meout. ● Network paron ● Hardware failure o disk o controllers o memory ● Soware crashes o Including kernels ● System upgrades ● Measured in 9 of availability, percent of the me the system could store and retrieve data (or may be parally available: read or write, but not both).

Obstacle: Network Paron

Network paron: a condion when some network devices can not access each other. Paron tolerance: is a system ability to connue to funcon when network paron occur.

Paron tolerance is not a choice, it is a fact of any distributed system and outside of designer control.

CAP theorem (Eric Brewer)

Consistency/Availability/Paron Tolerance (CAP). ● Only two out of three could be sasfied simultaneously. ● Paron tolerance is a fact of life and can’t be ignored. ● Means that designing distributed storage system one should favor either Availability or Consistency. ● Latency implicaons, consistency typically requires more roundtrips. Conclusion • Storage system design should choose between availability and consistency. Or give it as a deployment opon.

High performance

● A very difficult task for distributed storage system ● System needs to be adapve to ever changing condions ● Scale with storage system size ● Sensive to changing over me bolenecks Performance measurements

● Throughput o Large objects o Archiving systems ● IOPs o Small objects ● Latency o Interacve applicaons ● Time-To-First-Byte (TTFB) o Streaming Self healing / Rebuilding ● Self-healing restores the original redundancy ● Creates addional traffic and system load. ● Should be scalable with storage size. ● Should be adaptable to storage state. Forms of data preservaon

Rebuilding is the last but the most expensive resort for data preservaon. Others could be deployed: ● Disk failure predicon ● Disk rerement ● Tolerance to paral failures

System health

The measure of data reliability ● Visual representaon ● System events to alert ● Policy based automac or manual switching to degraded mode is low ● Policy based automac or manual system shutdown when reliability is dangerously low ● Feedback to the rebuilding component Security aspects (CIA)

● Confidenality o Only authorized enes could access data. Aackers should not be able to read/write or modify. ● Integrity o Data is what you expect it to be. ● Availability o The informaon has value if the owner could actually obtain and use it Forms of data protecon

● Data at rest ● Unauthorized enes should not have access to recognizable data ● Data in moon ● Primarily on the wire Tiering

● Human defined rules ● Automac detecon ● Ability to move data from faster (more expensive) to slower storage and visa versa in order to improve latency for ‘hot’ objects

Universal access

Applicaons must be able to store and retrieve data using their exisng protocols without requiring changes to underlying applicaons. ● Support the exisng object de-facto standards o S3 o Openstack o CDMI o Open for inclusion and interoperability ● Storage systems must be able to interoperate with tradional file protocols o NFS o CIFS o Hadoop HDFS.

GEO dispersal

● Reduce fault correlaon ● Make storage elements closer to a client ● Temporary loss of a single site should preserve both consistency and availability o R+W > IDA_Width o 2 site problem for an efficient IDA Computaonal compability The stored data should be analyzable ● Logs ● Intelligence data, paern search ● User data (indexing, face recognion)

is the most popular analycs system

● Hadoop support is essenal ● Nave HDFS support facilitates ● Hadoop deployment on storage nodes Challenge: avoid data movement between storage nodes ● Apply IDA differently Conclusions

In this lecture we discussed: • Storage industry trends • Exisng storage system limitaons • Modern storage requirements • Informaon Dispersion Algorithm and its fit to solve storage problems • Major modern storage system requirements In the next hour we discuss : • What does it take to build a praccal modern storage system • Major praccal system characteriscs • Examples how this principles could be applied