Testing storage for corruption bugs
If you are suffering from more than your fair share of silent corruption then you might have a buggy storage system. Here’s a couple of tools that exercise your storage with workloads that are simple and transparent whilst also being quite effective at triggering corruption bugs.
A successful run of either of these tools does not prove the absence of bugs, of course, nor does it say anything about corruptions that only occur when the data has been sitting on disk for an extended period of time. However a failure definitely indicates that your storage is not working as required. It’s always worth running tests like these when commissioning a new system.
Corruption on power loss
System calls like Linux’s fsync()
are supposed to guarantee that some data
has genuinely been written to durable storage and will still be there even if
power is lost and subsequently restored. Durable writes can be expensive, so
it’s pretty common to encounter systems which are configured to ignore
fsync()
calls which gives the impression of better performance in benchmarks.
Ignoring fsync()
calls is obviously very dangerous and will lead to data loss
or corruption in a power outage. It’s also a very subtle misconfiguration since
you probably can’t tell whether fsync()
is really working or not without a
genuine power outage. In particular if you’re testing VMs it’s unlikely to be
enough to “power down” the VM and keep the host running: you really need to
abruptly remove power from the host to find bugs of this nature.
The venerable diskchecker.pl
script is pretty good at shaking
out cases where fsync()
is not working as it’s supposed to.
This script does quite a bit of I/O and involves pulling power cords out of things while they’re running so it’s not a great idea to run it on systems while they’re in production.
Write-time corruption
The excellent stress-ng
tool
recently merged a
patch
which improves its ability to detect corruption introduced at write time. I
suggest invoking it something like this:
$ sudo stress-ng --hdd 32 \
--hdd-opts wr-seq,rd-rnd \
--hdd-write-size 8k \
--hdd-bytes 30g \
--temp-path $MOUNT_POINT \
--verify
Some notes on the options:
--hdd 32
: This sets the number of concurrent processes writing to disk. I
suggest setting it higher than the number of cores to force some
context-switching which may help surface more bugs.
--hdd-opts wr-seq,rd-rnd
: This says to write the file sequentially from start
to finish and then perform random reads during verification. This most closely
matches Lucene’s access pattern, although Lucene may use mmap()
for reads
instead of read()
and
friends. It’s also
reasonable to use wr-seq,rd-seq
.
--hdd-write-size 8k
: Lucene writes in 8kB chunks by default.
--hdd-bytes 30g
: This says how large a file each process should write. The
total size of all the files written must of course not exceed the space on the
filesystem, but should be large enough that the reads cannot all be served from
pagecache. In this example there are 32 processes each writing 30GB of data,
which adds up to about 1TB. You can also try dropping pagecache every now and
then during the test with echo 3 | sudo tee /proc/sys/vm/drop_caches
.
--temp-path $MOUNT_POINT
: This says where to create the files used for the
test, which must of course be on the filesystem you suspect to be buggy.
--verify
: This says to check that the data read is what was written, which is
required to detect cases of silent corruption. Without this option the test
will still go through the motions of reading and writing data but only reports
a failure if any of the I/O actually fails.
This is a pretty aggressive test so it’s not a good idea to run it on systems while they’re in production. By default it will run for 24 hours which I think is reasonable.
Make sure that you’re using at least version 0.12.01 of stress-ng
too: older
versions did not test things so thoroughly.