Irods and Spectrum Archive
Total Page:16
File Type:pdf, Size:1020Kb
Integration of iRODS with IBM Spectrum Archive™ Enterprise Edition A flexible tiered storage archiving solution Version 1.1 (02/17/2020) Nils Haustein IBM European Storage Competence Center Mauro Tridici Euro-Mediterranean Center on Climate Change (CMCC) Contents Introduction ............................................................................................................................................ 3 Executive summary ............................................................................................................................. 4 Tiered storage file systems ..................................................................................................................... 4 IBM Spectrum Scale with IBM Spectrum Archive ............................................................................... 5 Challenges with tiered storage file systems ....................................................................................... 6 iRODS ...................................................................................................................................................... 6 Integration of iRODS with IBM Spectrum Archive .................................................................................. 7 Typical workflows ............................................................................................................................... 8 Preventing transparent recalls ............................................................................................................ 9 Solution in action .......................................................................................................................... 12 Further work ..................................................................................................................................... 14 Quota ............................................................................................................................................ 14 Automated file tagging ................................................................................................................. 16 Display file state ............................................................................................................................ 18 References ............................................................................................................................................ 21 Disclaimer.............................................................................................................................................. 22 © IBM Corporation 2020 and © CMCC Foundation 2020 Page 2 Introduction Arching means storing data over long time periods while providing search capabilities to find and access the data. One of the key requirements for archiving is cost efficiency, especially when the volume of archive data that must be kept for long period of time grows into Petabyte ranges. Depending on the type of the archived data and its value the requirements for performance and cost of storage change over time. Different storage requirements can be addressed with different types of storage media, such as Flash and SSD, disk and tapes. As Figure 1 shows, Flash and SSD media provide fast access and is most expensive. Accessing data on hard disk is slower while disks are less expensive than Flash and SSD (considering enterprise grade Flash and disk). While tape storage is suitable for storing large volumes of data over long periods of time at lowest cost, access time to data on tape is significantly higher than to data on disk. Figure 1: Storage media characteristic Different storage media like Flash and SSD, disk and tapes can be combined in a tiered storage system. The tiered storage systems have the capability to automatically move archived data between the different storage tiers based on policies. This assures that archived data is stored on the most appropriate storage media over the lifecycle. Another key component for archiving is a data management system that allows to index archived data and allow the user to search, share and access the data he is authorized for. Data management capabilities can be provided by the open source data management software iRODS [1]. In this paper we describe the challenges of tiered storage systems using tape as a storage tier and how these challenges can be addressed when combining iRODS and IBM Spectrum Archive™ Enterprise Edition. We demonstrate how to prevent transparent recalls when migrated files are accessed through iRODS and instead perform tape optimized recalls with IBM Spectrum Archive. Furthermore, we demonstrate configuring quota in iRODS that applies to the disk and tape storage tiers. And we demonstrate harvesting metadata from the file content and using this for subsequent searches. © IBM Corporation 2020 and © CMCC Foundation 2020 Page 3 Executive summary The combination with iRODS and IBM Spectrum Archive Enterprise Edition allows to establish a tiered storage environment where transparent recalls triggered by the user accessing a migrated file are prevented. Instead migrated files are recalled in a tape optimized manner which is much faster than transparent recalls and more resource gentle. The user also has the possibility to check the state of a file in order to determine if the file is on disk or on tape of the tiered storage file system. Furthermore, this combination allows to enforce quota that applies for all user data stored in the tiered storage environment, regardless if the user data is stored on the disk or the tape tier. This prevents users from storing unlimited amount of data in a tiered storage environment. The automatic data harvesting and indexing capabilities of iRODS allow to automatically extract information from files after they have been stored in the iRODS system and before these files are migrated to tape. The harvested information can be used to discover and search for data based on metadata information. Finally, iRODS’ flexible plugin architecture, its data virtualization capabilities and its indexing and search capabilities paired with the cost efficiency of tiered storage file systems provided by IBM Spectrum Scale™ with IBM Spectrum Archive™ Enterprise Edition make this combination a perfect archiving solution for any types of data. There are many other use cases that can be addressed with the combination of iRODS and IBM Spectrum Archive Enterprise Edition. Tiered storage file systems A tiered storage file system (Figure 2) provides flash, disk and tape storage within a global file system namespace where a space management component transparently moves files between the storage tiers. The file system name space is accessible by users and applications through standard file system protocols such as NFS, SMB and POSIX. Figure 1: Tiered storage file system Tiered storage file systems with tape are primarily used for archiving large volumes of data for long period of time. An example of a tiered storage file system is IBM Spectrum Scale with IBM Spectrum Archive Enterprise Edition that is explained in the next section. © IBM Corporation 2020 and © CMCC Foundation 2020 Page 4 IBM Spectrum Scale with IBM Spectrum Archive IBM Spectrum Scale™ [2] is a software-defined scalable parallel file system providing tiered storage capabilities. The IBM Spectrum Scale file systems are provided by servers that are configured in a cluster. The storage used for the file system can be within the servers or externally connected to the servers. IBM Spectrum Scale also offers exporting the file systems via standardized protocols such as NFS, SMB, HDFS and Object. IBM Spectrum Archive Enterprise Edition [3] provides the tape tier within an IBM Spectrum Scale file system. IBM Spectrum Archive manages the tape library operations and stores the data on the tape tier in the Linear Tape File System (LTFS) format [4]. As shown in Figure 3, the combination of IBM Spectrum Scale with IBM Spectrum Archive provides a tiered storage file system with different storage media including Flash and SSD, disk and tape. While Flash and disk storage are managed by IBM Spectrum Scale directly, the tape storage is managed by Spectrum Archive. IBM Spectrum Scale provides a policy engine that allows to place the files on a storage tier upon file create and migrate the files to other storage tiers over the data lifecycle. Policies are defined and tested once and can then be configured to run automatically in the background. For example, a policy can places all new files on the disk storage tier of the IBM Spectrum Scale file system and if files have not been accessed for 30 days then migrate these files to tape storage. Figure 3: Combination of IBM Spectrum Scale with IBM Spectrum Archive tape tier After the migration of files to tape these files are still visible in the IBM Spectrum Scale file system and can be readily accessed. Upon access the files are automatically copied (or recalled) from tape back to disk storage. For this purpose, the tape-ID number is stored in the file metadata. There are two kind of recalls: transparent recall and tape optimized recall. A transparent recall is triggered transparently when a migrated file is accessed in the file system. IBM Spectrum Archive loads the tape-ID referenced in the file stub, positions to the file on tape and reads the file back to disk. When multiple migrated files are accessed at the same time, then multiple independent transparent recalls are triggered, each requiring a tape to be loaded, positioned and a file