Attached are four PDF files showing graphs of the output of vmstat 1. During these tests the machine's in init 3, with a minimal install of RHEL AS 4 update 5 using default partitioning (small /boot, 2GB swap, large / on LVM, ext3) as suggested by the installer. There are test runs for both the SMP and the nonSMP kernels.
Write cache on the card is switched on, the drives are in 3Gb/s SATA II mode with NCQ enabled and queuing enabled on the card itself. The firmware's the latest (codeset 220.127.116.11 from 3Ware, fw FE9X 3.08.02.005), as is the driver (2.26.05.007) and the card's got 128MB installed on it. The RAID 1 array has been initialised and verified with 3dm2.
The test commands are:
Read test: sync; time -p `dd if=/dev/sda of=/dev/null bs=1M count=X`
... where X is 3072, 4096, 6162 and 20480 ie approx 3G, 4G, 6G and 20G of data
Write test: sync; time -p `dd if=/dev/zero of=[filename] bs=1M count=X`
... where X is as above, and [filename] is on / partition.
vmstat's "blocks in" comes in typically at around 80000 blocks/s, but take a look at the "blocks out" graphs in the attached PDFs - they're extremely 'bursty' - anything up to 1,000,000+ blocks/s at times, followed by long periods of nothing at all.
Well, that's fine, you might think - all that IO's been handed off to the 3Ware card for processing so it's only to be expected that enormous chunks get thrown at it periodically and it just muches through it for a bit, updating the RAID 1 array, before asking for some more.
Trouble is, during these periods of not much going on the system performance in terms of responsiveness - 'feel', as it were - goes off a cliff with as much as a minute between typing 'ls' in a small directory and getting back any output.
Loadave heads ever upwards - reaching 12+ in some instances (and before anyone starts telling me loadave is calculated differently in 2.6 kernels, I've read that debate and understand it's not the best indicator of what's going on), with processes like pdflush, kjournald and kswapd hanging around for ages in D state (uninterruptible sleep).
I've seen up to 8 pdflush processes like this during the 20G run, and believe me this has a major impact on things.
Where this really starts to bite you is when doing intensive IO with files greater than available RAM - there are some impressive throughput figures for filesizes that fit easily in RAM - naturally, plenty of room for the OS to work out its own async IO if it's got room to breath - here's a table of the timed tests for the various sizes:
... yes, single processor 3G write throughput is greater than with SMP. Also the 20G writes - though 35MB/s is nothing to write home about. "WD" in the header is a mistake, these are Seagate disks, not Western Digital ones.
Before anyone asks, tweaking setra=16384, nr_requests=512 or the scheduler (deadline vs cfq) in line with 3Ware's tuning hints has no impact on the specific problem - ie the collapse in responsiveness of the entire system under certain intensive IO operations. Neither does jumpering the card slot down to 66MHz from 133MHz, nor does taking LVM out of the picture have any effect.
I've read a ton of stuff over the last week, building up every more interesting Google searches as I learn more about this - the most fruitful being today, when having spent a load of time graphing vmstats output in Excel I finally Googled 3ware vmstats and came across this very recent post, which I suspect has hit the nail right on the head:
Too many years of awful 3ware performance. - Discussion@SR
I'll probably add to this thread as and when I discover other things, but in the meanwhile if anyone else has come across this performance/responsiveness cliff and worked out how to bypass it, please get in touch.
On with the vmstats graphs - download all the PDFs below and look at what's going on for yourself. All those processes in 'b', 100% iowait, at one stage in the 20G write test the machine actually runs out of memory and sendmail gets automatically killed (I'm in init 3 remember).
Oh, - and finally, why 20G? Well 3Ware's own "Benchmarking the 3ware 9000 Controller Using Linux Kernel 2.6" document says to use 40x installed RAM - but their test machine's only got 512MB installed. I've got 4GB installed and haven't the patience to wait for a 160GB file to be written out in million block bursts separated by minute-long gaps. I reckon 5x installed RAM's achieving the same (exhausting any cache behaviour).
Right, now, really, those graphs... (see below)