Inspecting disk IO performance with fio

36909

 

Storage performance has failed to keep up with that of other major components of computer systems. Hard disks have gotten larger, but their speed has not kept pace with the relative speed improvements in RAM and CPU technology. The potential for your hard drive to be your system’s performance bottleneck makes knowing how fast your disks and filesystems are and getting quantitative measurements on any improvements you can make to the disk subsystem important. One way to make disk access faster is to use more disks in combination, as in a RAID-5 configuration.

 

 

To get a basic idea of how fast a physical disk can be accessed from Linux you can use the hdparm tool with the -T and -t options. The -T option takes advantage of the Linux disk cache and gives an indication of how much information the system could read from a disk if the disk were fast enough to keep up. The -t option also reads the disk through the cache, but without any precaching of results. Thus -t can give an idea of how fast a disk can deliver information stored sequentially on disk.

The hdparm tool isn’t the best indicator of real-world performance. It operates at a very low level; once you place a filesystem onto a disk partition you might get significantly different results. You will also see large differences in speed between sequential access and random access. It would also be good to be able to benchmark a filesystem stored on a group of disks in a RAID configuration.

fio was created to allow benchmarking specific disk IO workloads. It can issue its IO requests using one of many synchronous and asynchronous IO APIs, and can also use various APIs which allow many IO requests to be issued with a single API call. You can also tune how large the files fio uses are, at what offsets in those files IO is to happen at, how much delay if any there is between issuing IO requests, and what if any filesystem sync calls are issued between each IO request. A sync call tells the operating system to make sure that any information that is cached in memory has been saved to disk and can thus introduce a significant delay. The options to fio allow you to issue very precisely defined IO patterns and see how long it takes your disk subsystem to complete these tasks.

fio is packaged in the standard repository for Fedora 8 and is available for openSUSE through the openSUSE Build Service. Users of Debian-based distributions will have to compile from source with the make; sudo make install combination.

The first test you might like to perform is for random read IO performance. This is one of the nastiest IO loads that can be issued to a disk, because it causes the disk head to seek a lot, and disk head seeks are extremely slow operations relative to other hard disk operations. One area where random disk seeks can be issued in real applications is during application startup, when files are requested from all over the hard disk. You specify fio benchmarks using configuration files with an ini file format. You need only a few parameters to get started. rw=randread tells fio to use a random reading access pattern, size=128m specifies that it should transfer a total of 128 megabytes of data before calling the test complete, and the directory parameter explicitly tells fio what filesystem to use for the IO benchmark. On my test machine, the /tmp filesystem is an ext3 filesystem stored on a RAID-5 array consisting of three 500GB Samsung SATA disks. If you don’t specify directory, fio uses the current directory that the shell is in, which might not be what you want. The configuration file and invocation is shown below.


$ cat random-read-test.fio 
; random read of 128mb of data

[random-read]
rw=randread
size=128m
directory=/tmp/fio-testing/data

$ fio random-read-test.fio
random-read: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
Starting 1 process
random-read: Laying out IO file(s) (1 file(s) / 128MiB)
Jobs: 1 (f=1): [r] [100.0% done] [  3588/     0 kb/s] [eta 00m:00s]
random-read: (groupid=0, jobs=1): err= 0: pid=30598
  read : io=128MiB, bw=864KiB/s, iops=211, runt=155282msec
    clat (usec): min=139, max=148K, avg=4736.28, stdev=6001.02
    bw (KiB/s) : min=  227, max= 5275, per=100.12%, avg=865.00, stdev=362.99
  cpu          : usr=0.07%, sys=1.27%, ctx=32783, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     issued r/w: total=32768/0, short=0/0
     lat (usec): 250=34.92%, 500=0.36%, 750=0.02%, 1000=0.05%
     lat (msec): 2=0.41%, 4=12.80%, 10=44.96%, 20=5.16%, 50=0.94%
     lat (msec): 100=0.37%, 250=0.01%

Run status group 0 (all jobs):
   READ: io=128MiB, aggrb=864KiB/s, minb=864KiB/s, maxb=864KiB/s, mint=155282msec, maxt=155282msec

Disk stats (read/write):
  dm-6: ios=32768/148, merge=0/0, ticks=154728/12490, in_queue=167218, util=99.59%

 

fio produces many figures in this test. Overall, higher values for bandwidth and lower values for latency constitute better results.

 

The bw result shows the average bandwidth achieved by the test. The clat and bw lines show information about the completion latency and bandwidth respectively. The completion latency is the time between submitting a request and it being completed. The min, max, average, and standard deviation for the latency and bandwidth are shown. In this case, the standard deviation for both completion latency and bandwidth is quite large relative to the average value, so some IO requests were served much faster than others. The CPU line shows you how much impact the IO load had on the CPU, so you can tell if the processor in the machine is too slow for the IO you want to perform. The IO depths section is more interesting when you are testing an IO workload where multiple requests for IO can be outstanding at any point in time as is done in the next example. Because the above test only allowed a single IO request to be issued at any time, the IO depths were at 1 for 100% of the time. The latency figures indented under the IO depths section show an overview of how long each IO request took to complete; for these results, almost half the requests took between 4 and 10 milliseconds between when the IO request was issued and when the result of that request was reported. The latencies are reported as intervals, so the 4=12.80%, 10=44.96% section reports that 44.96% of requests took more than 4 (the previous reported value) and up to 10 milliseconds to complete.

 

The large READ line third from last shows the average, min, and max bandwidth for each execution thread or process. fio lets you define many threads or processes to all submit work at the same time during a benchmark, so you can have many threads, each using synchronous APIs to perform IO, and benchmark the result of all these threads running at once. This lets you test IO workloads that are closer to many server applications, where a new thread or process is spawned to handle each connecting client. In this case we have only one thread. As the READ line near the bottom of output shows, the single thread has an 864Kbps aggregate bandwidth (aggrb) which tells you that either the disk is slow or the manner in which IO is submitted to the disk system is not friendly, causing the disk head to perform many expensive seeks and thus producing a lower overall IO bandwidth. If you are submitting IO to the disk in a friendly way you should be getting much closer to the speeds that hdparm reports (typically around 40-60Mbps).

 

I performed the same test again, this time using the Linux asynchronous IO subsystem in direct IO mode with the possibility, based on the iodepth parameter, of eight requests for asynchronous IO being issued and not fulfilled because the system had to wait for disk IO at any point in time. The choice of allowing up to only eight IO requests in the queue was arbitrary, but typically an application will limit the number of outstanding requests so the system does not become bogged down. In this test, the benchmark reported almost three times the bandwidth. The abridged results are shown below. The IO depths show how many asynchronous IO requests were issued but had not returned data to the application during the course of execution. The figures are reported for intervals from the previous figure; for example, the 8=96.0% tells you that 96% of the time there were five, six, seven, or eight requests in the async IO queue, while, based on 4=4.0%, 4% of the time there were only three or four requests in the queue.

 


$ cat random-read-test-aio.fio 
; same as random-read-test.fio 
; ...
ioengine=libaio
iodepth=8
direct=1
invalidate=1

$ fio random-read-test-aio.fio
random-read: (groupid=0, jobs=1): err= 0: pid=31318
  read : io=128MiB, bw=2,352KiB/s, iops=574, runt= 57061msec
    slat (usec): min=8, max=260, avg=25.90, stdev=23.23
    clat (usec): min=1, max=124K, avg=13901.91, stdev=12193.87
    bw (KiB/s) : min=    0, max= 5603, per=97.59%, avg=2295.43, stdev=590.60
...
  IO depths    : 1=0.1%, 2=0.1%, 4=4.0%, 8=96.0%, 16=0.0%, 32=0.0%, >=64=0.0%
...
Run status group 0 (all jobs):
   READ: io=128MiB, aggrb=2,352KiB/s, minb=2,352KiB/s, maxb=2,352KiB/s, mint=57061msec, maxt=57061msec

 

Random reads are always going to be limited by the seek time of the disk head. Because the async IO test could issue as many as eight IO requests before waiting for any to complete, there was more chance for reads in the same disk area to be completed together, and thus an overall boost in IO bandwidth.

 

The HOWTO file from the fio distribution gives full details of the options you can use to specify benchmark workloads. One of the more interesting parameters is rw, which can specify sequential or random reads and or writes in many combinations. The ioengine parameter can select how the IO requests are issued to the kernel. The invalidate option causes the kernel buffer and page cache to be invalidated for a file before beginning the benchmark. The runtime specifies that a test should run for a given amount of time and then be considered complete. The thinktime parameter inserts a specified delay between IO requests, which is useful for simulating a real application that would normally perform some work on data that is being read from disk. fsync=n can be used to issue a sync call after every n writes issued. write_iolog and read_iolog cause fio to write or read a log of all the IO requests issued. With these commands you can capture a log of the exact IO commands issued, edit that log to give exactly the IO workload you want, and benchmark those exact IO requests. The iolog options are great for importing an IO access pattern from an existing application for use with fio.

 

Simulating servers

 

You can also specify multiple threads or processes to all submit IO work at the same time to benchmark server-like filesystem interaction. In the following example I have four different processes, each issuing their own IO loads to the system, all running at the same time. I’ve based the example on having two memory-mapped query engines, a background updater thread, and a background writer thread. The difference between the two writing threads is that the writer thread is to simulate writing a journal, whereas the background updater must read and write (update) data. bgupdater has a thinktime of 40 microseconds, causing the process to sleep for a little while after each completed IO.

 


$ cat four-threads-randio.fio
; Four threads, two query, two writers.

[global]
rw=randread
size=256m
directory=/tmp/fio-testing/data
ioengine=libaio
iodepth=4
invalidate=1
direct=1

[bgwriter]
rw=randwrite
iodepth=32

[queryA]
iodepth=1
ioengine=mmap
direct=0
thinktime=3

[queryB]
iodepth=1
ioengine=mmap
direct=0
thinktime=5

[bgupdater]
rw=randrw
iodepth=16
thinktime=40
size=32m

$ fio four-threads-randio.fio
bgwriter: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32
queryA: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=mmap, iodepth=1
queryB: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=mmap, iodepth=1
bgupdater: (g=0): rw=randrw, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
Starting 4 processes

bgwriter: (groupid=0, jobs=1): err= 0: pid=3241
  write: io=256MiB, bw=7,480KiB/s, iops=1,826, runt= 35886msec
    slat (usec): min=9, max=106K, avg=35.29, stdev=583.45
    clat (usec): min=117, max=224K, avg=17365.99, stdev=24002.00
    bw (KiB/s) : min=    0, max=14636, per=72.30%, avg=5746.62, stdev=5225.44
  cpu          : usr=0.40%, sys=4.13%, ctx=18254, majf=0, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.4%, 8=3.3%, 16=59.7%, 32=36.5%, >=64=0.0%
     issued r/w: total=0/65536, short=0/0
     lat (usec): 250=0.05%, 500=0.33%, 750=0.70%, 1000=1.11%
     lat (msec): 2=7.06%, 4=14.91%, 10=27.10%, 20=21.82%, 50=20.32%
     lat (msec): 100=4.74%, 250=1.86%
queryA: (groupid=0, jobs=1): err= 0: pid=3242
  read : io=256MiB, bw=589MiB/s, iops=147K, runt=   445msec
    clat (usec): min=2, max=165, avg= 3.48, stdev= 2.38
  cpu          : usr=70.05%, sys=30.41%, ctx=91, majf=0, minf=65545
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     issued r/w: total=65536/0, short=0/0
     lat (usec): 4=76.20%, 10=22.51%, 20=1.17%, 50=0.05%, 100=0.05%
     lat (usec): 250=0.01%

queryB: (groupid=0, jobs=1): err= 0: pid=3243
  read : io=256MiB, bw=455MiB/s, iops=114K, runt=   576msec
    clat (usec): min=2, max=303, avg= 3.48, stdev= 2.31
    bw (KiB/s) : min=464158, max=464158, per=1383.48%, avg=464158.00, stdev= 0.00
  cpu          : usr=73.22%, sys=26.43%, ctx=69, majf=0, minf=65545
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     issued r/w: total=65536/0, short=0/0
     lat (usec): 4=76.81%, 10=21.61%, 20=1.53%, 50=0.02%, 100=0.03%
     lat (usec): 250=0.01%, 500=0.01%

bgupdater: (groupid=0, jobs=1): err= 0: pid=3244
  read : io=16,348KiB, bw=1,014KiB/s, iops=247, runt= 16501msec
    slat (usec): min=7, max=42,515, avg=47.01, stdev=665.19
    clat (usec): min=1, max=137K, avg=14215.23, stdev=20611.53
    bw (KiB/s) : min=    0, max= 1957, per=2.37%, avg=794.90, stdev=495.94
  write: io=16,420KiB, bw=1,018KiB/s, iops=248, runt= 16501msec
    slat (usec): min=9, max=42,510, avg=38.73, stdev=663.37
    clat (usec): min=202, max=229K, avg=49803.02, stdev=34393.32
    bw (KiB/s) : min=    0, max= 1840, per=10.89%, avg=865.54, stdev=411.66
  cpu          : usr=0.53%, sys=1.39%, ctx=12089, majf=0, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.3%, 8=22.8%, 16=76.8%, 32=0.0%, >=64=0.0%
     issued r/w: total=4087/4105, short=0/0
     lat (usec): 2=0.02%, 4=0.04%, 20=0.01%, 50=0.06%, 100=1.44%
     lat (usec): 250=8.81%, 500=4.24%, 750=2.56%, 1000=1.17%
     lat (msec): 2=2.36%, 4=2.62%, 10=9.47%, 20=13.57%, 50=29.82%
     lat (msec): 100=19.07%, 250=4.72%

Run status group 0 (all jobs):
   READ: io=528MiB, aggrb=33,550KiB/s, minb=1,014KiB/s, maxb=589MiB/s, mint=445msec, maxt=16501msec
  WRITE: io=272MiB, aggrb=7,948KiB/s, minb=1,018KiB/s, maxb=7,480KiB/s, mint=16501msec, maxt=35886msec

Disk stats (read/write):
  dm-6: ios=4087/69722, merge=0/0, ticks=58049/1345695, in_queue=1403777, util=99.74%

 

As one would expect, the bandwidth the array achieved in the query and writer processes was vastly different. Queries are performed at about 500Mbps while writing comes in at 1Mbps or 7.5Mbps depending on whether it is read/write or purely write performance respectively. The IO depths show the number of pending IO requests that are queued when an IO request is issued. For example, for the bgupdater process, nearly 1/4 of the async IO requests are being fulfilled with eight or less requests in the queue of a potential 16. In contrast, the bgwriter has more than half of its requests performed with 16 or less pending requests in the queue.

 

To contrast with the three-disk RAID-5 configuration, I reran the four-threads-randio.fio test on a single Western Digital 750GB drive. The bgupdater process achieved less than half the bandwidth and each of the query processes ran at 1/3 the overall bandwidth. For this test the Western Digital drive was on a different computer with different CPU and RAM specifications as well, so any comparison should be taken with a grain of salt.

 


bgwriter: (groupid=0, jobs=1): err= 0: pid=14963
  write: io=256MiB, bw=6,545KiB/s, iops=1,597, runt= 41013msec
queryA: (groupid=0, jobs=1): err= 0: pid=14964
  read : io=256MiB, bw=160MiB/s, iops=39,888, runt=  1643msec
queryB: (groupid=0, jobs=1): err= 0: pid=14965
  read : io=256MiB, bw=163MiB/s, iops=40,680, runt=  1611msec
bgupdater: (groupid=0, jobs=1): err= 0: pid=14966
  read : io=16,416KiB, bw=422KiB/s, iops=103, runt= 39788msec
  write: io=16,352KiB, bw=420KiB/s, iops=102, runt= 39788msec
   READ: io=528MiB, aggrb=13,915KiB/s, minb=422KiB/s, maxb=163MiB/s, mint=1611msec, maxt=39788msec
  WRITE: io=272MiB, aggrb=6,953KiB/s, minb=420KiB/s, maxb=6,545KiB/s, mint=39788msec, maxt=41013msec

 

The vast array of ways that fio can issue its IO requests lends it to benchmarking IO patterns and the use of various APIs to perform that IO. You can also run identical fio configurations on different filesystems or underlying hardware to see what difference changes at that level will make to performance.

 

Benchmarking different IO request systems for a particular IO pattern can be handy if you are about to write an IO-intensive application but are not sure which API and design will work best on your hardware. For example, you could keep the disk system and RAM fixed and see how well an IO load would be serviced using memory-mapped IO or the Linux asyncio interface. Of course this requires you to have a very intricate knowledge of the typical IO requests that your application will issue. If you already have a tool that uses something like memory-mapped files, then you can get IO patterns for typical use from the existing tool, feed them into fio using different IO engines, and get a reasonable picture of whether it might be worth porting the application to a different IO API for better performance.