Technology Products Resources Download Contacts
 

Experiments on Disk Write Back Caches

Write back caches are implemented on hard disks to enhance write performance. ATA drives, in particular, rely on write back caches to make up for the slower performance due to slower seek-time and RPM when compared to their SCSI / Fibre Channel counterparts. Some RAID controllers further implement write cache on the controllers to enhance the overall performance of the system.

With write back caching turned on by default, an ATA drive can signal the completion of writes more quickly than if it had to wait until the data was completely transferred to the disk media. However, in the even of a failure (such as power failure, hardware failure, etc.), data corruption may happen if the data on the disk cache has not been flushed out to the disk media. Another problem with ATA write back cache is that data may be flushed out to disks out-of-order, i.e., if block A arrives to cache before block B, block B may be flushed out to disk before block A. While turning the write back cache off for ATA disks will avoid data corruption problem, performance will degrade. In addition, the drives will be used in a less reliable mode, since ATA vendors do not certify the recovery of drives that deactivate write-back caching.

The chance of data corruption increases with RAID system that leaves write back cache on. A RAID system write stripes that span multiple disks. Since there is no guarantee that all data blocks in a stripe will be flushed out to disk media, the stripe may not be consistent. In the ATA world, hardware RAID vendors typically leave disk write back cache on by default. Some provide options to turn write back cache off --- for example, the user manual for 3ware RAID controller warns users that "there may be instances when you always want the computer to wait for the drive to write all the data to disk before going to its next task ... you must disable the write cache." (page 54-55 of 3ware RAID controller user manual).

The questions are: what are such instances and are they common? Also, what happens when cache have not been flushed properly and the power goes out?

In this section, we report the experiments we conducted and our findings.

Experimental Setup

We use test servers with drives connected in different configurations:

- single disk with disk cache turned on
- multiple disks connected in RAID5 / RAID10 with disk cache turned on
- multiple disks connected in RAID5 / RAID10 controllers with controller cache turned on and disk cache on (some RAID controllers come with further write cache on the controller to enhance performance)

We tested ATA RAID as well as SCSI RAID controllers. We used EXT3 and ReiserFS as the test file systems.

The test servers are connected to a Network Power Switch, which can be programmed to automatically turn power supplies on or off. Servers will run test codes automatically upon boot up. We leave the power on for each server for about 5 minutes during each run before powering the server down and then up again.

The test code consists of file system operations that are metadata intensive, i.e., the operations will constantly update the file system states. We run two sets of programs simultaneously: a dbench session with 64 clients and a simple script that constantly creates directories. These operations put loads on the file system similar to that of file servers.

We watch out for possible errors from the file systems. Write cache problems will typically cause the metadata of the file system to be in a corrupted or inconsistent states.

Experimental Results

The results are very consistent. Write back cache on disks or controllers generated file system errors that rendered the file systems either corrupted or inconsistent. Many times, the file systems can no longer be mounted. We find that the problems appear faster (typically fewer than 10 power cycles) when cache is bigger, such as when there is also write-back cache on the RAID controllers. Typically, we observed problems within 50 power cycles.

Below we show some file system error messages we have observed:

Jan 28 11:04:40 localhost kernel: attempt to access beyond end of device
Jan 28 11:04:40 localhost kernel: 09:00: rw=0, want=0, limit=156301312
Jan 28 11:04:40 localhost kernel: EXT3-fs error (device md(9,0)): read_inode_bitmap: Cannot read inode bitmap - block_group = 576, inode_bitmap = 4294967295
Jan 28 11:04:40 localhost kernel: EXT3-fs error (device md(9,0)) in ext3_new_inode: IO failure
Jan 28 11:04:40 localhost kernel: attempt to access beyond end of device
Jan 28 11:04:40 localhost kernel: 09:00: rw=0, want=0, limit=156301312
Jan 28 11:04:40 localhost kernel: EXT3-fs error (device md(9,0)): read_inode_bitmap: Cannot read inode bitmap - block_group = 576, inode_bitmap = 4294967295
Jan 28 11:04:40 localhost kernel: EXT3-fs error (device md(9,0)) in ext3_new_inode: IO failure
Jan 28 11:04:40 localhost kernel: attempt to access beyond end of device
Jan 28 11:04:40 localhost kernel: 09:00: rw=0, want=0, limit=156301312
Jan 28 11:04:40 localhost kernel: EXT3-fs error (device md(9,0)): read_inode_bitmap: Cannot read inode bitmap - block_group = 576, inode_bitmap = 4294967295
Jan 28 11:04:40 localhost kernel: EXT3-fs error (device md(9,0)) in ext3_new_inode: IO failure
Jan 28 11:04:40 localhost kernel: attempt to access beyond end of device
Jan 28 11:04:40 localhost kernel: 09:00: rw=0, want=0, limit=156301312
Jan 28 11:04:40 localhost kernel: EXT3-fs error (device md(9,0)): read_inode_bitmap: Cannot read inode bitmap - block_group = 576, inode_bitmap = 4294967295

Figure 1: Sample errors from EXT3 file system

Jan 25 00:20:20 localhost last message repeated 6 times
Jan 25 00:20:20 localhost kernel: vs-13060: reiserfs_update_sd: stat data of object [137100 137105 0x0 SD] (nlink == 1) not found (pos 2)
Jan 25 00:20:20 localhost last message repeated 5 times
Jan 25 00:20:21 localhost kernel: vs-13060: reiserfs_update_sd: stat data of object [137100 137101 0x0 SD] (nlink == 1) not found (pos 2)
Jan 25 00:20:21 localhost last message repeated 7 times
Jan 25 00:20:21 localhost kernel: vs-13060: reiserfs_update_sd: stat data of object [137100 137102 0x0 SD] (nlink == 1) not found (pos 2)
Jan 25 00:20:21 localhost kernel: vs-13060: reiserfs_update_sd: stat data of object [137100 137104 0x0 SD] (nlink == 1) not found (pos 2)
Jan 25 00:20:21 localhost kernel: vs-13060: reiserfs_update_sd: stat data of object [137100 137105 0x0 SD] (nlink == 1) not found (pos 2)
Jan 25 00:20:21 localhost last message repeated 6 times
Jan 25 00:20:21 localhost kernel: PAP-5660: reiserfs_do_truncate: wrong result -1 of search for [137100 137101 0xfffffffffffffff DIRECT]
Jan 25 00:20:21 localhost kernel: vs-5355: reiserfs_delete_solid_item: [137100 137101 0x0 SD] not found<4>PAP-5660: reiserfs_do_truncate: wrong result -1 of search for [137100 137102 0xfffffffffffffff DIRECT]
Jan 25 00:20:21 localhost kernel: vs-5355: reiserfs_delete_solid_item: [137100 137102 0x0 SD] not found<4>PAP-5660: reiserfs_do_truncate: wrong result -1 of search for [137100 137104 0xfffffffffffffff DIRECT]
Jan 25 00:20:24 localhost kernel: vs-5355: reiserfs_delete_solid_item: [137100 137104 0x0 SD] not found<4>vs-5355: reiserfs_delete_solid_item: [137100 137105 0x0 SD] not found<4>vs-13060: reiserfs_update_sd: stat data of object [137072 137074 0x0 SD] (nlink == 1) not found (pos 9)

Figure 2: Sample errors from Reiserfs file system

Jan 24 14:00:15 localhost kernel: EXT3-fs error (device sd(8,0)): ext3_free_blocks: bit already cleared for block 260096
Jan 24 14:00:15 localhost kernel: EXT3-fs error (device sd(8,0)): ext3_free_blocks: Freeing blocks not in datazone - block = 1043443757, count = 1
Jan 24 14:00:15 localhost kernel: EXT3-fs error (device sd(8,0)): ext3_free_blocks: Freeing blocks not in datazone - block = 1043443756, count = 1
Jan 24 14:00:15 localhost kernel: EXT3-fs error (device sd(8,0)): ext3_free_blocks: Freeing blocks not in datazone - block = 1043443756, count = 1
Jan 24 14:00:15 localhost kernel: Assertion failure in journal_forget_R80fce437() at transaction.c:1208: "!jh->b_committed_data"
Jan 24 14:00:15 localhost kernel: ------------[ cut here ]------------
Jan 24 14:00:15 localhost kernel: kernel BUG at transaction.c:1208!
 

Figure 3: Assertion errors from EXT3 file system

We have reported the file system errors to the Linux community and are working with them to better understand the nature of the problems. In our communications, we have received confirmations that the errors we have seen are due to disk write back cache.

The Lesson

The lesson is if write back cache is turned on, it is not difficult to create metadata inconsistency or corruption at the file system upon power failure. Using existing RAID solutions with write back cache turned on may lead to irrecoverable data corruption of the file system.

References

bulletDefinition of Write Back Cache at SNIA Dictionary site:
"A caching technique in which the completion of a write request is signaled as soon as the data is in cache, and actual writing to non-volatile media occurs at a later time. Write-back cache includes an inherent risk that an application will take some action predicated on the write completion signal, and a system failure before the data is written to non-volatile media will cause media contents to be inconsistent with that subsequent action. For this reason, good write-back cache implementations include mechanisms to preserve cache contents across system failures (including power failures) and to flush the cache at system restart time. "
 
bullet Hdparm is a Linux utility for accessing and controlling the parameters of ATA disks. It can be used to turn the write caching off.
Also, different perspectives on Write-Back Caches.
 
bulletDiscussions of Write Back Cache for hardware RAID  products
bulletOn Write Back Caches at Compaq Smart Array 5i Controller Q&A section 
"Q12. Does the Smart Array 5i support write-back cache?
A12.   No, Compaq believes that data integrity is the most important feature of any of our array controller products. Write-back cache is vulnerable to power drops. With higher-end Compaq array controllers (e.g., the Smart Array 5300 family), write-back cache is protected by a unique removable battery backed cache daughter board. Since this is a costly feature to implement, standard with higher end array controllers, the Smart Array 5i does not support battery backed write-back cache."
 
bulletOn disabling Write Back Cache for 3ware RAID controller (pg. 54-55).
"The Escalade ATA RAID Controller gives you a choice of disabling the write cache for your disk arrays. Write cache is used to store data locally on the drive before it is written to the disk, allowing the computer to continue with its next task. Enabling the write cache results in the most efficient access times for your computer system. There may be instances when you always want the computer to wait for the drive to write all the data to disk before going on to its next task. For this case, you must disable the write cache."
 
bulletPostings on the Internet on write back caches:
bullet Discussion of write back caches at netbsd.org.
"if you want *real* protection (that is, metadata consistency) you must (on netbsd and linux) disable write cache. using writeback cache on the drive, you're only protected from some things (accidently hit reset, kernel panic). you're not protected from power failure. i have a ups, but i still disable write cache. a ups can fail, and a machine's psu can fail as well."
bullet Article on ReiserFS tuning and how to work with disk write back cache.
"If you have an UPS, enable write caching by default, and configure your UPS daemon to automatically disable write caching when a power failure occurs. "
bullet Article on ReiserFS by Chris Mason.
"For performance benchmarks, some of the new drives have write-back caching by default. This means the drive reports a write is completed before it is actually on the media. The block is still in the drive's cache, where the writes can be reordered. If this happens, metadata changes might be written before the log commit blocks, leading to corruption if the machine loses power. It is very important to disable write-back caching on both IDE and SCSI drives.

Some hardware RAID controllers provide a battery-backed write-back cache that preserves the cache contents if the system loses power. These should be safe to use, but the cache battery should be checked often. A dramatic performance increase can be seen with these write caches, especially for log intensive applications like mail servers. "

bullet Post from Echostar on disk write back cache for set top device.
"
... when talking to drive manufacturers, we are told that if the write cache is disabled, the life of the drive is substantially reduced... In our application, (consumer set top box) we cannot always cleanly shut down the system.  "

 

Acknowledgements

We like to thank members of the Linux community, in particular Hans Reiser and Stephen Tweedie, with helping us understand write back cache issues and file system corruptions.

 

Last update: October 27, 2003. Copyright © 2003 Boon Storage Technologies, Inc. All Rights Reserved.