Proactively Preventing Corruption

Introduction A common trait in most of the existing protective measures is that they work in is an insidious problem in their own isolated domains or at best be- storage. There are many types of corruption tween two adjacent nodes in the I/O path. and many means to prevent them. There has been no common method for en- Enterprise class servers use error check- suring true end-to-end ... until ing and correcting caches and memory to now. Before describing this new technolo- protect against single and double bit errors. gy in detail, let us take a look at how data System buses have similar protective mea- corruption is handled by currently shipping sures such as parity. Communications go- products. ing over the network are protected by check- sums. On the storage side many installations 1 Data Corruption employ RAID (Redundant Array of Inexpen- sive Disks) technology to protect against Corruption can occur as a result of bugs in disk failure. In the case of hardware both software and hardware. A common RAID the array firmware will often use ad- failure scenario involves incorrect buffers vanced checksumming techniques and me- being written to disk, often clobbering good dia scrubbing to detect and potentially cor- data. rect errors. The disk drives themselves also This latent type of corruption can go feature sophisticated error corrective mea- undetected for a long period of time. It sures, and storage protocols such as Fibre may take months before the application at- Channel and iSCSI feature a Cyclic Redun- tempts to reread the data from disk, at dancy Check (CRC) that guards against data which point the good data may have been corruption on the wire. lost forever. Short cycles may even At the top of the I/O stack, modern have caused all intact copies of the data to filesystems such as Oracle’s use check- be overwritten. summing techniques on both data and A crucial weapon in preventing this type filesystem . This allows the filesys- of error is proactive data integrity protec- tem code to detect data that has gone bad tion: a method that prevents corrupted I/O either on disk or in transit. The filesystem requests from being written to disk. can then take corrective action, fail the I/O For several years Oracle has offered a request or notify the user. technology called HARD (Hardware Assisted

1 Figure 1 Five different protection scenarios. Starting at the bottom, the “Normal I/O” line il- lustrates the disjoint integrity coverage offered using a current operating system and standard hardware. The “HARD” line shows the protection envelope offered by the Oracle ac-

cessing a disk array with HARD capability. “DIF” shows coverage using the integrity portions of

the SCSI protocol. “DIX” shows the coverage offered by the Data Integrity Extensions. And finally at the top, “DIX + DIF” illustrates the full coverage provided by the Data Integrity Extensions in

combination with T10 DIF.

Resilient Data), which allows storage sys- including a , to be included in an tems to verify the integrity of an Oracle I/O request. This appended data is referred database logical block before it is commit- to as integrity metadata or protection infor- ted to stable storage. Though the level of mation. protection offered by HARD is mandatory in Unfortunately, the SCSI protection enve- numerous enterprise and government de- lope only covers the path between the I/O ployments, adoption outside the mission- controller and the storage device. To rem- critical business segment has been slow. edy this, Oracle and a few select indus- The disk array vendors that license and im- try partners have collaborated to design a plement the HARD technology only offer it in method of exposing the data integrity fea- their very high end products. As a result, tures to the operating system. This technol- Oracle has been looking to provide a com- ogy, known as the Data Integrity Extensions, parable level of resiliency using an open and allows the operating system—and even ap- standards-based approach. plications such as the Oracle Database—to A recent extension to the SCSI family of generate protection data that will be veri- protocols allows extra protective measures, fied as the request goes through the entire

2 I/O stack. See Figure 1 for an illustration of the integrity coverage provided by the tech- nologies described above.

Figure 2 DIF tuple contents 2 T10 Data Integrity Field protection. Each of these protection types T10 is the INCITS standards body responsible defines the contents of the three tag fields for the SCSI family of protocols. Data cor- in the DIF tuple. The guard tag contains a ruption has been a known problem in the 16-bit CRC of the 512 of data in the storage industry for years and T10 has pro- sector. The application tag is for use by the vided the means to prevent it by extending application or operating system, and finally the SCSI protocol to allow integrity metadata the reference tag is used to ensure order- to be included in an I/O request. The ex- ing of the individual portions of the I/O re- tension to the SCSI block device protocol is quest. The reference tag varies depending called the Data Integrity Field or DIF. on protection type. The most common of Normal SCSI disks use a hardware sector these is Type 1 in which the reference tag size of 512 bytes. However, when used in- needs to match the 32 lower bits of the tar- side disk arrays the drives are often refor- get sector logical block address. This helps matted to a bigger sector size of 520 or 528 prevent misdirected writes, a common cor- bytes. The operating system is only exposed ruption error where data is written to the to the usual 512 bytes of data. The extra 8 wrong place on disk. or 16 bytes in each sector are used internally If the storage device detects a mismatch by the array firmware for integrity checks. between the data and the integrity metadata DIF is similar in the sense that the stor- the I/O will be rejected before it is written to age device must be reformatted to 520 disk. Also, since each node in the I/O path is sectors. The main difference between DIF free to inspect and verify the integrity meta- and proprietary array firmware is that the data, it is possible to isolate points of error. format of the extra 8 bytes of information For instance, it is conceivable that in the fu- per sector is well defined as well as being an ture advanced fabric switches will be able open standard. This means that every node to verify the integrity as data flows through in the I/O path can participate in generating the Storage Area Network. and verifying the integrity metadata. The fact that a storage device is format- Each DIF tuple is split up into three sec- ted using the DIF protection scheme is trans- tions called tags. There is a 16-bit guard parent to the operating system. In the case tag, a 16-bit application tag and a 32-bit ref- of a write request the I/O controller will re- erence tag. ceive a number of 512-byte buffers from the The DIF specification lists several types of operating system and proceed to generate

3 and append the appropriate 8 bytes of pro- 3 Data Integrity Extensions tection information to each sector. Upon receiving the request the SCSI disk will verify While increased resilience against errors be- that the data matches the included integrity tween controller and storage device is an metadata. In the case of a mismatch, the I/O improvement, the design goal was to en- will be rejected and an error returned to the able true end-to-end data integrity protec- operating system. tion. An obvious approach was to expose Similarly, in the case of a read request the the DIF information above the I/O controller storage device will include the protection in- level and let the operating system gain ac- formation and send 520 byte sectors to the cess to the integrity metadata. I/O controller. The controller will verify the integrity of the I/O, strip off the protection data and return 512 byte data buffers to the 3.1 Buffer Separation operating system. In other words, the added level of protec- Exposing the 520-byte sectors to the oper- tion between controller and storage device ating system is problematic, however. In- is completely transparent to the operating ternally, operating systems generally work system. Unfortunately, this also means the with sizes that are multiples of 512. On X86 operating system is unable to participate in and X86_64 hardware the system page size is the integrity verification process. This is 4096 KB. This means that 8 sectors fit nicely where the Data Integrity Extensions come in a page. It is extremely inconvenient for in. the operating system to deal with buffers that are multiples of 520 bytes. The Data Integrity Extensions allow the Allows I/O controller and storage device to ex- change protection information. operating system to gain access to the DIF contents without changing its internal Each data sector is protected by an 8-byte buffer size. This is achieved by separating integrity tuple the data buffers and the integrity metadata

The contents of this tuple include a checksum buffers. The controller firmware will inter- and an incrementing counter that ensures the leave the data and integrity buffers on write I/O is intact. and split them on read. Separating the data from the integrity Both I/O controller and storage device can de- tect and reject corrupted requests. metadata in the operating system also re- duces the risk of data corruption. Now two

Figure 3 T10 Data Integrity Field buffers in different locations need to match up for an I/O request to be successfully com- pleted.

4 3.2 Performance Implications Allow transfer of data integrity information to and from the host operating system. DIF protection between I/O controller and disk is handled by custom hardware at near Allow separation of data and integrity meta- data buffers. zero performance penalty. For true end-to- end data integrity, however, the application Allow a lightweight checksum algorithm to or the operating system needs to generate limit impact on operating system perfor- the protection information. Calculating the mance. checksum in software obviously comes at a performance penalty and the T10 DIF stan- Figure 4 Data Integrity Extensions dard mandates a heavyweight 16-bit CRC al- gorithm for the guard tag. the first operating system to gain true end- This CRC is quite expensive to calculate to-end data integrity protection. compared to other commonly used check- The Linux changes allow integrity meta- sums. To alleviate the impact on system data to be generated and passed through the performance the TCP/IP checksum algorithm I/O stack. Currently the extensions are on- is used instead. This results in an almost ly accessible from within the kernel, but a negligible impact on system performance. userland API is in development. The goal The Data Integrity Extensions allow this al- is for all applications to be able to benefit ternate checksum type to be used by the from the extra data protection features. operating system. The I/O controller will convert the IP checksum to the DIF CRC when Allows integrity metadata to be attached to sending a request to the storage device and an I/O request. vice versa. The net result is that a full end-to-end Allows integrity metadata to be generated au- tomatically for unmodified applications. protection envelope can be provided at a very low cost in terms of processing over- Will allow advanced applications to manually head. send and receive integrity metadata.

Figure 5 Linux Data Integrity Framework 4 Linux Data Integrity Framework

Oracle has implemented support for DIF and 5 Future Developments the I/O Controller Data Integrity Extensions in the Linux kernel. The changes are re- At a recent storage networking industry leased under the GNU General Public License conference Oracle and its partners demon- and have been submitted for inclusion in the strated an (unmodified) Oracle Database official kernel tree. With this, Linux becomes running on Linux using the data integrity

5 framework. The server used a prototype side the scope of the T10 organization. Con- Emulex fibre channel controller, a disk tray sequently, Oracle and its partners have ap- from LSI and disk drives from Seagate. We proached the Storage Networking Industry demonstrated how errors could be inject- Association and set up Data Integrity Task ed into the system, identified, isolated and Force with the intent to standardize the data remedied without causing downtime or on- integrity interfaces for applications, operat- disk corruption. ing systems and I/O controllers. The SCSI standard only governs commu- nications between I/O controller and storage Hardware products supporting DIF and the device, and as such the interface between I/O Data Integrity Extensions are scheduled for controller and the operating system is out- release in 2008.

Author

Martin K. Petersen has been involved in Linux development since the early nineties. He has worked on PA-RISC and IA-64 Linux ports for HP as well as the XFS filesystem and the Altix kernel for SGI. Martin works in Oracle’s Linux Kernel Engineering group where he focuses on enterprise storage technologies. He can be reached at [email protected].

6