Friday, December 03, 2010

Religion in IT

Joerg posted:
Interesting statement in a searchdatacenter article about the IDC numbers
“When you sell against Dell, you sell against price. When you sell against HP, you sell against technical stuff -- the feeds and speeds. When you're up against IBM, you're not selling against boxes but against solutions or business outcomes that happen to include hardware. But, when you get to the Sun guys, it's about religion. You can't get to those guys. One guy told me last year that he would get off his Sun box when he dies."

Thursday, December 02, 2010

Linux, O_SYNC and Write Barriers

We all love Linux... sometimes it is better not to look under its hood though as you never know what you might find.

I stumbled across a very interesting discussion on a Linux kernel mailing list. It is dated August 2009 so you may have already read it.

There is a related RH bug.

I'm a little bit surprised by RH attitude in this ticket. IMHO they should have fixed it and maybe provide a tunable which would enable/disable new behavior instead of keeping the broken implementation. But at least in recent man pages they have clarified it in the Notes section of open(2):
"POSIX provides for three different variants of synchronized I/O, corresponding to the flags O_SYNC, O_DSYNC, and O_RSYNC. Currently (2.6.31), Linux only implements O_SYNC, but glibc maps O_DSYNC and O_RSYNC to the same numerical value as O_SYNC. Most Linux file systems don't actually implement the POSIX O_SYNC semantics, which require all metadata updates of a write to be on disk on returning to userspace, but only the O_DSYNC semantics, which require only actual file data and metadata necessary to retrieve it to be on disk by the time the system call returns."

Then there is another even more interesting discussion about write barriers:
"All of them fail to commit drive caches under some circumstances;
even fsync on ext3 with barriers enabled (because it doesn't
commit a journal record if there were writes but no inode change
with data=ordered)."
and also this one:
"No, fsync() doesn't always flush the drive's write cache. It often
does, any I think many people are under the impression it always does, but it doesn't.

Try this code on ext3:

fd = open ("test_file", O_RDWR | O_CREAT | O_TRUNC, 0666);

while (1) {
    char byte;
    usleep (100000);
    pwrite (fd, &byte, 1, 0);
    fsync (fd);

It will do just over 10 write ops per second on an idle system (13 on
mine), and 1 flush op per second.

That's because ext3 fsync() only does a journal commit when the inode has changed. The inode mtime is changed by write only with 1 second granularity. Without a journal commit, there's no barrier, which translates to not flushing disk write cache.

If you add "fchmod (fd, 0644); fchmod (fd, 0664);" between the write
and fsync, you'll see at least 20 write ops and 20 flush ops per
second, and you'll here the disk seeking more. That's because the
fchmod dirties the inode, so fsync() writes the inode with a journal

It turns out even _that_ is not sufficient according to the kernel
internals. A journal commit uses an ordered request, which isn't the
same as a flush potentially, it just happens to use flush in this
instance. I'm not sure if ordered requests are actually implemented
by any drivers at the moment. If not now, they will be one day.

We could change ext3 fsync() to always do a journal commit, and depend on the non-existence of block drivers which do ordered (not flush) barrier requests. But there's lots of things wrong with that. Not least, it sucks performance for database-like applications and virtual machines, a lot due to unnecessary seeks. That way lies wrongness.

Rightness is to make fdatasync() work well, with a genuine flush (or
equivalent (see FUA), only when required, and not a mere ordered
barrier), no inode write, and to make sync_file_range()[*] offer the
fancier applications finer controls which reflect what they actually
This is really scary. I wonder how many developers knew about it especially when coding for Linux when data safety was paramount. Sometimes it feels that some Linux developers are coding to win benchmarks and do not necessarily care about data safety, correctness and standards like POSIX. What is even worse is that some of them don't even bother to tell you about it in official documentation (at least the O_SYNC/O_DSYNC issue is documented in the man page now).