Analytics for Object Storage Simplified - Unified File and Object for Hadoop

Sandeep R Patil STSM, Master Inventor, IBM Spectrum Scale

Smita Raut Object Development Lead, IBM Spectrum Scale

Acknowledgement : Bill Owen, Tomer Perry, Dean Hildebrand, Piyush Chaudhary, Yong Zeng, Wei Gong, Theodore Hoover Jr, Muthuannamalai Muthiah. Agenda

• Part 1 : Need as well as Design Points for Unified File and Object  Introduction to Object Storage  Unified File & Object Access  Use Cases Enabled By UFO

• Part 2: Analytics with Unified File and Object  Big Data Analytics and Challanges  Design Points, Approach and Solution  Unified File & Object Store and Congnitive Computing Part 1 : Need as well as Design Points for Unified File and Object

 Object Storage Introduction

3 Introduction to Object Store

• Object storage is highly available, distributed, eventually consistent storage. • Data is stored as individual objects with unique identifier

● Flat addressing scheme that allows for greater scalability • Has simpler data management and access • REST-based data access • Simple atomic operations: • PUT, POST, GET, DELETE

● Usually software based that runs on commodity hardware

● Capable of scaling to 100s of petabytes

● Uses replication and/or erasure coding for availability instead of RAID

● Access over RESTful API over HTTP, which is a great fit for cloud and mobile applications – Amazon S3, Swift, CDMI API

4 Object Storage Enables The Next Generation of Data Management

Simple APIs/Semanti Scalable cs (Swift/S3, Ubiquitous Multi-Tenancy Metadata Versioning, Access Access Whole File Updates)

Simpler Scalable and Multi-Site management Cost Savings Highly- Cloud and flatter Available Storage namespace But Does it Create Yet Another Storage Island in Your Data Center…??

6  Unified File and Object Access

7 What is Unified File and Object Access ?

• Accessing object using file interfaces Object(http) Files accessed as NFS/SMB/POSIX (SMB/NFS/POSIX) and accessing file using object Objects Data ingested interfaces (REST) helps legacy applications 4 as Files 3 designed for file to seamlessly start integrating into the 2 object world. 1 • It allows object data to be accessed using Objects accessed Data ingested as Files applications designed to process files. It allows file as Objects data to be published as objects. • Multi protocol access for file and object in the same namespace (with common User ID management capability) allows supporting and hosting data oceans Swift (With Swift on File) of different types of data with multiple access options. • Optimizes various use cases and solution architectures resulting in better efficiency as well as cost savings. File Exports created on container level OR POSIX access from container level

8 Flexible Identity Management Modes

• Two Identity Management Modes • Administrators can choose based on their need and use-case

Identity Management Modes

Suitable when auth schemes for file and Suitable for unified file and object access for object are different and unified access end users. Leverage common ILM policies is for applications for file and object data based on data ownership

Local_Mode Unified_Mode

Object created by Object interface Object created from Object interface should be will be owned by internal “swift” user owned by the user doing the Object PUT (i.e FILE will be owned by UID/GID of the user) Application processing the object data Owner of the object will own and from file interface will need the required have access to the data from file file ACL to access the data. interface. Object authentication setup Users from Object and File are expected to be is independent of File common auth and coming from same directory Authentication setup service (only AD+RFC 2307 or LDAP)

9  Use Case Enabled by Unified File Object

10 Use case : Process Object Data with File-Oriented Applications and Publish Outcomes as Objects

Final processed videos available as Objects in container which is used for Media House OpenStack Cloud Platform external publishing (Tenant = Media House Subsidiaries)

Publishing Channels

VM Farm for Subsidiary 1 VM Farm for Subsidiary 2 for video processing Subsidiary 1 Subsidiary 2 for video processing Ingest Final Video (as objects) Media Objects available for streaming Virtual Virtual Virtual Virtual …. Machine Machine Machine …. Machine Container Container1 Instances Instances Instances Instances 1’ Container2 Raw media content sent for media Unified file processing which happens over files NFS Export NFS Export NFS Export on on & Object on (Object to File access) Container 1’ Container 1 Container 2 Manila Shares (NFS) exported only for Subsidiary2

Manila Shares (NFS) exported only for Subsidiary1

Files converted into objects for publishing (File to Object access) 11 We have now understood Part 1: Need as well as Design for Unified File and Object

….. Let us now deep dive on Part 2: Analytics with Unified File and Object  Big Data Analytics and Challenges

13 Analytics – Broadly Categorized Into Two Sets

Traditional Analytics Congnitive Big Data . Big data is a term for data sets that are so large or complex that traditional data processing applications (database management tools or traditional data processing applications) are inadequate. . The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. Characteristics . Volume – The quantity of generated and stored data. . Variety ‒ The type and nature of the data. . Velocity ‒ Speed at which the data is generated and processed . Variability ‒ Inconsistency of data sets can hamper processes manage it . Veracity ‒ Quality of captured data can vary greatly, affecting accuracy Challenges with the Early Big Data Storage Models

Key Business processes are It’s not just ! now depend on the analytics ! one type of

analytics

......

Ingest data at Move data to the Perform Repeat! various end points analytics engine analytics More data source than ever ! It takes hours or days Can’t just throw away data due to before, not just data you own, ! to move the data! ! regulations or business requirement but public or rented data  Design Points, Approach & Solution

17 What are the Solution Design Points that we came across?

Traditional New Gen Compute applications 6 Bring analytics to the data applications farm

Single Name Space to common name space 1 house all your data (Files and Object) 2 Unified data access with File & Object

3 Encryption for protection of your data Geographically dispersed Powered by 4 Optimize economics based on value of the data 5 management of data including disaster recovery

Flash Disk Tape Shared Nothing Off Premise Cluster How Did We Approach The Solution & address the Design Points?

- Took the Data Ocean Approach Meeting Design Point 2 Unified File and Object – as explained previously

Users and applications Client New Gen Traditional Compute workstations applications applications farm

Meeting Design Point 6 File Analytics Block OpenStack Object POSIX 4000+ Transparent Cinder Manilla Meeting Design Point 3 customers HDFS NFS SMB iSCSI Glance Swift S3 Spark Encryption DR Site

Global Namespace

Transparent Cloud Tier Site C Spectrum Scale RAID

Worldwide Data Flash Disk Tape Shared Nothing JBOD/JBOF Distribution Cluster Meeting Design Point 5 Meeting Design Point 4

| 19 Meeting Design Point 6 – Bring Analytics to Data Apache Hadoop - Key Platform for Big Data and Analytics . An open-source software framework and most popular BD&A platform . Designed for distributed storage and processing of very large data sets on computer clusters built from commodity hardware . Core of Hadoop consists of o A processing part called MapReduce o A storage part, known as Hadoop Distributed File System (HDFS) o Hadoop common libraries and components

. Leading Hadoop Distro: HortonWorks, CloudEra, MapR, IBM IOP/BigInsights Meeting Design Point 6 – Bring Analytics to Data HDFS Shortcomings

. HDFS is a shared nothing architecture, which is very inefficient for high throughput jobs (disks and cores grow in same ratio) . Costly data protection: . uses 3-way replication; limited RAID/erasure coding . Works only with Hadoop i.e weak support for File or Object protocols . Clients have to copy data from enterprise storage to HDFS in order to run Hadoop jobs, this can result in running on stale data. Meeting Design Point 6 – How to Bring Analytics to Data ?

Desired Solution: Need In place Analytics (No Copies Required). Clustered Filesystem should support HDFS Connectors How did we design to overcome the inhibitors: Developing a HDFS Transparency Connector with Unified File and Object Access

Applications hdfs://hostnameX:portnumber

Hadoop client Hadoop client Hadoop client

Higher-level languages: Hadoop FileSystem Hadoop FileSystem Hadoop FileSystem Hive, BigSQL JAQL, Pig … API API API HDFS Client HDFS Client HDFS Client

Map/Reduce API HDFS RPC over network

Hadoop FS APIs Connector

GPFS Connector GPFS Connector HDFS Client Service Service

Connector on Connector on HDFS RPC libgpfs,posix API libgpfs,posix API GPFS node GPFS node Spectrum Scale Connector Server

Power Power Spectrum Scale Linux Linux

Commodity hardware Shared storage 6/2/2023 17 Meeting Design Point 6- “In-Place” Analytics for Unified File and Object Data Achieved.

Analytics on Traditional Object Store Analytics With Unified File and Object Access

In-Place Analytics Object (http) Results returned in place Spark or Hadoop Explicit Data movement MapReduce

Data ingested as Objects Results Published as Objects with Unified File and Object fileset no data movement IBM Spectrum Scale

Object store with Unified File and Object Access – Traditional object store – Data to be copied from Object Data available as File on the same fileset. Analytics systems like object store to dedicated cluster , do the analysis Hadoop MapReduce or Spark allow the data to be directly leveraged for and copy the result back to object store for analytics. publishing Source:https://aws.amazon.com/elasticmapreduce/ No data movement i.e. In-Place immediate data analytics.

24  Unified File & Object Store and Cognitive Computing

25 Cognitive Services

• Cognitive services are based on a technology that encompasses machine learning, reasoning, natural language processing, speech and vision and more • IBM Watson Developer Cloud enables cognitive computing features in your app using IBM Watson’s Language, Vision, Speech and Data APIs.

Examples of Watson Services Some more… Visual Recognition • Alchemy Language – Text Analysis to give Speech to Text Tone Analyzer Sentiments of the Document • Language Translation – Translate and publish content in multiple languages • Personality insights – Uncover a deeper understanding of people's personality • Retrieve and Rank – Enhance information Input Audio / Voice retrieval with machine learning • Natural language Classifier – Interpret and classify natural language with confidence And Many More …

Cognitive Services help derive the meaning of data Cognitive Service with Unified File and Object How it works We need a provision to Cognitively Auto-tag Heterogeneous Unstructured data IBM Spectrum Scale in form of Object to Leverage its Object Storage benefits….

Service to feed Object to Watson

SwiftonFile based object storage Clicks Photo From Phone

Solution : Integration of Cognitive { Computing Services with Object Storage “WatsonTags” : “Eiffel“: 94.78%, for auto-tagging of unstructured data in “Tower“ : 85.81%, “Tour“: 66.82% the form of objects. } • IBM has open sourced a sample code that will allow you to auto tag objects in form of images ( hosted over IBM Spectrum Scale Object) • Solution can be extended to other object types depending on business need • The code and instruction are available on GitHub under Apache 3.0 license https://github.com/SpectrumScale/watson-spectrum-scale-object-integration 27 Thank You