Proactively Preventing Data Corruption
Total Page:16
File Type:pdf, Size:1020Kb
Proactively Preventing Data Corruption Introduction A common trait in most of the existing protective measures is that they work in Data corruption is an insidious problem in their own isolated domains or at best be- storage. There are many types of corruption tween two adjacent nodes in the I/O path. and many means to prevent them. There has been no common method for en- Enterprise class servers use error check- suring true end-to-end data integrity... until ing and correcting caches and memory to now. Before describing this new technolo- protect against single and double bit errors. gy in detail, let us take a look at how data System buses have similar protective mea- corruption is handled by currently shipping sures such as parity. Communications go- products. ing over the network are protected by check- sums. On the storage side many installations 1 Data Corruption employ RAID (Redundant Array of Inexpen- sive Disks) technology to protect against Corruption can occur as a result of bugs in disk failure. In the case of hardware both software and hardware. A common RAID the array firmware will often use ad- failure scenario involves incorrect buffers vanced checksumming techniques and me- being written to disk, often clobbering good dia scrubbing to detect and potentially cor- data. rect errors. The disk drives themselves also This latent type of corruption can go feature sophisticated error corrective mea- undetected for a long period of time. It sures, and storage protocols such as Fibre may take months before the application at- Channel and iSCSI feature a Cyclic Redun- tempts to reread the data from disk, at dancy Check (CRC) that guards against data which point the good data may have been corruption on the wire. lost forever. Short backup cycles may even At the top of the I/O stack, modern have caused all intact copies of the data to filesystems such as Oracle’s btrfs use check- be overwritten. summing techniques on both data and A crucial weapon in preventing this type filesystem metadata. This allows the filesys- of error is proactive data integrity protec- tem code to detect data that has gone bad tion: a method that prevents corrupted I/O either on disk or in transit. The filesystem requests from being written to disk. can then take corrective action, fail the I/O For several years Oracle has offered a request or notify the user. technology called HARD (Hardware Assisted 1 Figure 1 Five different protection scenarios. Starting at the bottom, the “Normal I/O” line il- lustrates the disjoint integrity coverage offered using a current operating system and standard hardware. The “HARD” line shows the protection envelope offered by the Oracle Database ac- cessing a disk array with HARD capability. “DIF” shows coverage using the integrity portions of the SCSI protocol. “DIX” shows the coverage offered by the Data Integrity Extensions. And finally at the top, “DIX + DIF” illustrates the full coverage provided by the Data Integrity Extensions in combination with T10 DIF. Resilient Data), which allows storage sys- including a checksum, to be included in an tems to verify the integrity of an Oracle I/O request. This appended data is referred database logical block before it is commit- to as integrity metadata or protection infor- ted to stable storage. Though the level of mation. protection offered by HARD is mandatory in Unfortunately, the SCSI protection enve- numerous enterprise and government de- lope only covers the path between the I/O ployments, adoption outside the mission- controller and the storage device. To rem- critical business segment has been slow. edy this, Oracle and a few select indus- The disk array vendors that license and im- try partners have collaborated to design a plement the HARD technology only offer it in method of exposing the data integrity fea- their very high end products. As a result, tures to the operating system. This technol- Oracle has been looking to provide a com- ogy, known as the Data Integrity Extensions, parable level of resiliency using an open and allows the operating system—and even ap- standards-based approach. plications such as the Oracle Database—to A recent extension to the SCSI family of generate protection data that will be veri- protocols allows extra protective measures, fied as the request goes through the entire 2 I/O stack. See Figure 1 for an illustration of the integrity coverage provided by the tech- nologies described above. Figure 2 DIF tuple contents 2 T10 Data Integrity Field protection. Each of these protection types T10 is the INCITS standards body responsible defines the contents of the three tag fields for the SCSI family of protocols. Data cor- in the DIF tuple. The guard tag contains a ruption has been a known problem in the 16-bit CRC of the 512 bytes of data in the storage industry for years and T10 has pro- sector. The application tag is for use by the vided the means to prevent it by extending application or operating system, and finally the SCSI protocol to allow integrity metadata the reference tag is used to ensure order- to be included in an I/O request. The ex- ing of the individual portions of the I/O re- tension to the SCSI block device protocol is quest. The reference tag varies depending called the Data Integrity Field or DIF. on protection type. The most common of Normal SCSI disks use a hardware sector these is Type 1 in which the reference tag size of 512 bytes. However, when used in- needs to match the 32 lower bits of the tar- side disk arrays the drives are often refor- get sector logical block address. This helps matted to a bigger sector size of 520 or 528 prevent misdirected writes, a common cor- bytes. The operating system is only exposed ruption error where data is written to the to the usual 512 bytes of data. The extra 8 wrong place on disk. or 16 bytes in each sector are used internally If the storage device detects a mismatch by the array firmware for integrity checks. between the data and the integrity metadata DIF is similar in the sense that the stor- the I/O will be rejected before it is written to age device must be reformatted to 520 byte disk. Also, since each node in the I/O path is sectors. The main difference between DIF free to inspect and verify the integrity meta- and proprietary array firmware is that the data, it is possible to isolate points of error. format of the extra 8 bytes of information For instance, it is conceivable that in the fu- per sector is well defined as well as being an ture advanced fabric switches will be able open standard. This means that every node to verify the integrity as data flows through in the I/O path can participate in generating the Storage Area Network. and verifying the integrity metadata. The fact that a storage device is format- Each DIF tuple is split up into three sec- ted using the DIF protection scheme is trans- tions called tags. There is a 16-bit guard parent to the operating system. In the case tag, a 16-bit application tag and a 32-bit ref- of a write request the I/O controller will re- erence tag. ceive a number of 512-byte buffers from the The DIF specification lists several types of operating system and proceed to generate 3 and append the appropriate 8 bytes of pro- 3 Data Integrity Extensions tection information to each sector. Upon receiving the request the SCSI disk will verify While increased resilience against errors be- that the data matches the included integrity tween controller and storage device is an metadata. In the case of a mismatch, the I/O improvement, the design goal was to en- will be rejected and an error returned to the able true end-to-end data integrity protec- operating system. tion. An obvious approach was to expose Similarly, in the case of a read request the the DIF information above the I/O controller storage device will include the protection in- level and let the operating system gain ac- formation and send 520 byte sectors to the cess to the integrity metadata. I/O controller. The controller will verify the integrity of the I/O, strip off the protection data and return 512 byte data buffers to the 3.1 Buffer Separation operating system. In other words, the added level of protec- Exposing the 520-byte sectors to the oper- tion between controller and storage device ating system is problematic, however. In- is completely transparent to the operating ternally, operating systems generally work system. Unfortunately, this also means the with sizes that are multiples of 512. On X86 operating system is unable to participate in and X86_64 hardware the system page size is the integrity verification process. This is 4096 KB. This means that 8 sectors fit nicely where the Data Integrity Extensions come in a page. It is extremely inconvenient for in. the operating system to deal with buffers that are multiples of 520 bytes. The Data Integrity Extensions allow the Allows I/O controller and storage device to ex- change protection information. operating system to gain access to the DIF contents without changing its internal Each data sector is protected by an 8-byte buffer size. This is achieved by separating integrity tuple the data buffers and the integrity metadata The contents of this tuple include a checksum buffers. The controller firmware will inter- and an incrementing counter that ensures the leave the data and integrity buffers on write I/O is intact.