Filesystems: Data Preservation, fsync, and Benchmarks Pt. 1

515

The potential data loss issue on system crash when running the ext4 filesystem has recently received wide coverage (LWN, Theodore Ts’o blog, Slashdot). Many users expressed the opinion that ext4 should not be more prone to losing data on system crash than ext3 was.

The opposing opinion is that developers and users should not rely on ext3 as the sole benchmark of filesystem semantics, highlighting that users might be running XFS, JFS or any number of the filesystems that ship with the Linux kernel — choice is good. It has also been highlighted that even if all of these filesystems in the Linux kernel implement the rename semantics that are widely desired, it does not help applications work correctly on other platforms like MacOS, BSD and Solaris.

It should be noted that there is a patch queued for kernel 2.6.30 that gives ext4 similar semantics to ext3 on system crash. But the general issue of fsync before close, and how user space applications can try to achieve data consistency in the face of power loss without sacrificing performance is still open for discussion. After all, as much as many love the extX series of filesystems, there are those who prefer if applications work properly on many filesystems and platforms.

The core issue as discussed on Theodore Ts’o blog is a method that many applications use to ensure that a file update is atomic. The common solution here is to create a new file, write the new byte contents to the new file, close the new file, and rename the new file over the old file. The expected outcome is that after the system crashes during this procedure either the old file will remain completely intact, or the new file in it’s entirety will have replaced the old file including the complete byte contents of the new file.

The thorn in the side of this expectation of rename is the caching that is employed to obtain better performance. A filesystem normally caches updates and writes to files and then performs a group of these writes at once. If a file is being written to in little chunks by an application it is much faster to cache that data in RAM and write it out in one go every X seconds. Once you’ve paid the price of a disk head seek you might as well try to write as much data sequentially as you can.

The problem with ext4 that caused the whole fsync debate was that it would cache the byte contents of the new file and might have performed the updates to the file name metadata before the new byte contents were committed to disk. While this may seem counter intuitive to application developers, there are performance gains to be had by writing out the data in an order that suits the filesystem developer. And as the application has not requested that the byte contents of the new file be written to disk (using fsync) it is perfectly valid for the filesystem implementation to decide in what order it wants to commit writes to disk. As ext4 is extent based, it tried to delay not only the writing of data but the very allocation of disk space to write that data into. By delaying allocation, ext4 can allocate a single extent when it writes the entire file to disk, thus reducing file fragmentation.

The implicit expectation of application developers is that when a rename(2) call is made to rename the file at oldpath to have newpath, the kernel will make sure that the file at oldpath will have its contents completely written to disk before making the name change permanent on disk. Numerous readers have pointed out that this is effectively a write barrier: do not update the filename metadata until all the bytes that were written to oldpath have been safely delivered to disk.

Assuming again a rename call to move oldpath to newpath. If a filesystem does not implement rename by flushing out any contents of the file at oldpath before updating the file name information then, after a crash, you might see the file at newpath as an empty file. The data your application wrote to oldpath was cached and didn’t get a chance to be written to disk before the power loss, but the metadata was updated so that oldpath replaced newpath, so you don’t have the contents of either file any more — just a zero byte file at newpath.

There are two sides to the system call interface: the developers of applications vs the developers of kernel filesystems. Looking at the man page of fsync(2) you’ll notice that the sole parameter to the system call is the file descriptor you want to ensure is synced to disk. Throwing around rough numbers, you might expect a disk to have a 100 megabyte / second sequential (0.1 megabytes / millisecond) write speed and maybe 10 millisecond seek time. So if you as an application developer have updated 100 kilobytes of data, you might expect an fsync(2) to take 11 milliseconds (1 seek + transfer time) to complete. Even if you wrote a few megabytes of data, you would hope that the fsync would return to you within about 30-40 milliseconds. Both of these are hardly a noticeable delay even if the fsync is done in the main GUI thread of your application.

One issue that pops up to complicate these above calculations is that a SATA disk may cache the writes that a kernel filesystem tells it to do. In order to ensure that the disk has transferred the data to the spinning platter instead of just its internal 8 or 16mb on disk cache, the kernel has to issue a specialSATA command to flush all the updated data in the disk memory cache to the platters. From reading over the comments on Ted’s blog, it appears that the filesystem can not just tell the disk to ensure a subset of bytes is on the platters.

Because flushing at the SATA level is such a broad “flush it all”, the filesystem might as well write out some other data at the same time as the fsync(2) is being done. After all, there is a latency to be paid for ensuring that data is committed, and if you have to ensure that the entire disk cache is flushed to the platters you might as well take advantage of the SATA flush to put out some other data.

One of the other points of contention about using fsync is that on an ext3 filesystem, data other than that belonging to your file descriptor is likely to be written to disk too. This is not just a little bit of data, effectively an fsync(2) has the price of a generic sync(2) call on ext3, which is not really the expectation of the developer. This caused things like the infamous Firefox issue where it would call fsync frequently, but when you were compiling a kernel on the same ext3 filesystem, that fsync also flushed out cached file writes for the compilation process which caused Firefox to wait for long periods of time for fsync(2) to complete.

The other big issue of adding an fsync into the open, write, close, rename chain is that as a developer you don’t really care if the data is written to disk right away. So you are calling a function that does not return until it explicitly sends your data to disk even though you don’t really need it there right now. All you want to tell the filesystem is you want it to be there before the file rename is made permanent. But there is currently no easy way to convey this message to the kernel. There are many who wince at the liberal addition of fsync calls to force the filesystem to write the data before the metadata instead of because they absolutely need the data on the disk platter right now.

This perceived fsync misuse raised the desire for a new fbarrier() call to tell the filesystem you want a write barrier inserted, so everything before the barrier is written to disk before anything after it. The downside of this is that a new system call would have to be exposed not only to C programs but also to everything else, like Perl, shell scripts, emacs lisp etc.

The whole issue of ensuring that the data of a file is written to disk before the metadata of a rename is sometimes not even a concern. If you are executing a build procedure and the system crashes you are likely to be able to just “clean” the build environment and rebuild everything once the system is brought up again. Therefore it is argued that you don’t really want rename to implicitly flush data to disk and slow down other currently executing filesystems operations in doing so.

Some mount options mentioned for ext4 are: alloc_da_alloc=0 turns off the implicit flushing of file data before performing rename that was queued for 2.6.30 to fix the zero file byte after crash problem. It was mentioned that you might consider alloc_da_alloc=0 on laptops without SSDs to prevent a rename from causing the disk to spin up for the immediate flush. The nodelalloc option turns off delayed allocation giving the exact same semantic after a crash that ext3’s data=ordered does.

In Part 2 of this series, we’ll examine benchmarks and the price of data consistency.