Loss at BNL (1/2) • What: Storage Controller in front of 64 TB disk array became unresponsive Backend hardware RAID controller suffered a hardware problem leading to memory cache corruption which resulted in corruption of the ZFS filesystem – Impact: Led to corruption of labels on each of the vdevs (virtual devices) in the data pool. Importing and recovering the ZFS data pool without knowing the root MOS (Meta Object Set) location is not possible • When: June 20, 2013 – Storage Controller hung and failed to restart • Was replaced by on-site spare, detected metadata corruption • Opened case with Oracle Support • Informed ADC and DDM Operations about the potential loss • Prepared list of potentially affected files and copied to DDM AFS area – Unsuccessful attempt to “roll-back” using ZFS debugger – Scanned LUNs to find traces of valid ZFS uberblocks • ZFS labels of each raw device were dumped and analyzed • Only 1 valid uberblock was found, areas where other uberblocks should have been were overwritten by random data – Rootbp data virtual addresses pointing to 512- range with all zeros – June 21, recovery attempts continued, adding experts from companies • Concluded recovery is impossible • Notified ADC and DDM operations that files are lost at BNL (2/2)

• Impact to ATLAS: 618,915 files were lost. These were disk-resident files (T0D1) belonging to datadisk, userdisk, mcdisk, groupdisk, atlasproddisk and scratchdisk. • The loss affects ATLAS managed production and user analysis BNL T1 uses ZFS Filesystem

• ZFS is a combined and logical volume manager designed by Sun Microsystems. The features of ZFS include protection against data corruption, support for high storage capacities, integration of the concepts of filesystem and volume management, snapshots and copy-on-write clones, continuous integrity checking and automatic repair – Protection against “silent data corruption” ZFS Filesystem and Protection against Data Corruption

• One major feature that distinguishes ZFS from other file systems is that ZFS is designed from the ground up with a focus on . That is, it is designed to protect the user's data on disk against silent data corruption caused by bit rot, current spikes, bugs in disk firmware, phantom writes (the write is dropped on the floor), misdirected reads/writes (the disk accesses the wrong block), DMA parity errors between the array and server memory or from the driver (since the validates data inside the array), driver errors (data winds up in the wrong buffer inside the kernel), accidental overwrites (such as swapping to a live file system), etc. ZFS and RAID

ZFS can be used with and w/o H/W RAID Controllers • ZFS can be/is used at BNL with H/W RAID – Improves READ performance and increases density – RAID controllers have sophisticated drive diagnostics • ZFS offers software RAID through its RAID-Z and mirroring organization schemes. RAID-Z is invulnerable to the write hole error, which other types of suffer from ZFS Structure • Binary tree structure – Multiple (4) copies of Uberblocks from 128 generations at different physical locations of the disk • Where the Root Block Pointer (rootbp) pointing to the rest of the tree is stored – A valid rootbp is required to reconstruct the rest of the tree • ZFS is based on transactions – Keeping transaction logs – Allowing to “roll back” transactional groups (txgs) in order to salvage the data Summary • We regret the disruption to production and analysis activities the data loss has caused, and the additional effort necessary to recover from the loss – A rare incident despite the T1 having chosen a proven technical solution and following best practices • To minimize the risk frequent are taken of uberblock array/labels to locate the meta data object set (MOS) to salvage the pool – Though rare, data will likely be lost in the future • More components, fs/storage pools growing – In this case we’ve lost 0.6% relative to the total storage capacity – Potential for loss due to disk controller failure 18% per system (2 PB) • Questionable whether we can afford linear growth based on today’s solutions Outlook • Have started a while back to look into ways to make disk storage solution a better fit to the program – Reliability, Performance and Cost – Considering Incremental and “Disruptive” approaches • At this point we don’t believe incremental changes are sufficient to cope with future needs (as far as we understand them) – To prevent facilities from losing data due to H/W, fs, OS, SPoF we need to incorporate enough redundancy to tolerate component failures throughout the system, e.g. • Distributed and redundant (fs) Meta Data (storage and service) • Data Duplication – Aim at having an implementation ready and a fraction of the T1 disk storage inventory on the new system by end of LS1 • All equipment replacements and capacity additions will be made based on the new system