Monday, December 31, 2007

Give me my 2nd CPU back!

I logged into a server which has been put into production recently and has been running for over 100 days now. It's HP DL380 G4 server running Solaris 10 11/06. By looking into graphs I could see 50% of CPU utilization in a system for last 2 weeks. By using mpstat I could see 100% CPU utilization on CPU 1 in SYS (kernel). The outcome was that cpu 1 was not available for applications. As there's not much load yet prstat didn't show anything consuming cpu. I quickly checked different server, exactly the same spec, and everything was fine there. Nothing in logs either.

So lets dtrace! :)

# mpstat 1
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 3 0 0 419 218 191 0 0 24 0 25 0 0 0 100
1 0 0 0 2 0 0 0 0 49 0 0 0 100 0 0
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 10 0 0 649 309 347 0 0 25 0 174 0 2 0 98
1 0 0 0 2 0 0 0 0 50 0 0 0 100 0 0
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 2 0 0 651 313 366 11 0 22 0 193 8 2 0 90
1 0 0 0 2 0 0 0 0 50 0 0 0 100 0 0
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 0 870 453 563 0 0 26 0 174 0 2 0 98
1 0 0 0 2 0 0 0 0 49 0 0 0 100 0 0

All we can see on cpu 1 is that there are some interrupts but no csw/icsw/syscl/migr at all.
It looks like it's stuck entirely in kernel.
To confirm that nothing from user space is actually calling some system calls:

# dtrace -n sched:::on-cpu'/cpu == 1/{self->t=timestamp;}' \
-n sched:::off-cpu'/self->t/{@[execname,pid]=sum(timestamp-self->t);self->t=0;}' \
-n tick-2s'{printa(@);}'
dtrace: description 'sched:::on-cpu' matched 3 probes
dtrace: description 'sched:::off-cpu' matched 3 probes
dtrace: description 'tick-2s' matched 1 probe
0 42146 :tick-2s

0 42146 :tick-2s



So nothing was put by scheduler on cpu 1.

Lets check what's running in kernel on cpu 0 first (for comparison with cpu 1 later):

# dtrace -n profile-101'/cpu == 0/{@[stack()]=count();}' -n tick-5s'{printa(@);exit(0);}'
dtrace: description 'profile-101' matched 1 probe
dtrace: description 'tick-5s' matched 1 probe
0 42148 :tick-5s







So in 99% it's being IDLE.
Now lets do the same for CPU 1.

# dtrace -n profile-101'/cpu == 1/{@[stack()]=count();}' -n tick-5s'{printa(@);exit(0);}'
dtrace: description 'profile-101' matched 1 probe
dtrace: description 'tick-5s' matched 1 probe
0 42148 :tick-5s







Above output confirms we're stuck in some interrupt, probably related to usb.

[checking for usba_vlog function]

# dtrace -n fbt:usba:usba_vlog:entry'{@[stringof arg3]=count();}' -n tick-5s'{exit(0);}'
dtrace: description 'fbt:usba:usba_vlog:entry' matched 1 probe
dtrace: description 'tick-5s' matched 1 probe
0 42148 :tick-5s

uhci_intr: Controller halted 720725
887 /*
888 * This should not occur. It occurs only if a HC controller
889 * experiences internal problem.
890 */
891 if (intr_status & USBSTS_REG_HC_HALTED) {
892 USB_DPRINTF_L2(PRINT_MASK_INTR, uhcip->uhci_log_hdl,
893 "uhci_intr: Controller halted");
894 cmd_reg = Get_OpReg16(USBCMD);
895 Set_OpReg16(USBCMD, (cmd_reg | USBCMD_REG_HC_RUN));
896 }

I'm not sure how bad it is but it doesn't look good.

Lets look around for USB.

# cfgadm -al
Ap_Id Type Receptacle Occupant Condition
usb0/1 unknown empty unconfigured ok
usb0/2 unknown empty unconfigured ok
usb1/1 unknown empty unconfigured ok
usb1/2 unknown empty unconfigured ok
usb2/1 unknown empty unconfigured ok
usb2/2 unknown empty unconfigured ok
usb3/1 unknown empty unconfigured ok
usb3/2 unknown empty unconfigured ok
usb4/1 unknown empty unconfigured ok
usb4/2 unknown empty unconfigured ok
usb4/3 unknown empty unconfigured ok
usb4/4 unknown empty unconfigured ok
usb4/5 unknown empty unconfigured ok
usb4/6 unknown empty unconfigured ok
usb4/7 unknown empty unconfigured ok
usb4/8 unknown empty unconfigured ok

# mpstat 1
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 3 0 0 419 218 191 0 0 24 0 25 0 0 0 100
1 0 0 0 2 0 0 0 0 49 0 0 0 100 0 0
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 10 0 0 666 311 363 3 0 17 0 176 0 2 0 98
1 0 0 0 2 0 0 0 0 50 0 0 0 100 0 0
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 0 678 315 360 0 0 23 0 159 0 0 0 100
1 0 0 0 2 0 0 0 0 50 0 0 0 100 0 0

Didn't help.

# modinfo |grep -i usb
83 fffffffff01c1000 14000 56 1 ehci (USB EHCI Driver 1.14)
84 fffffffff01d5000 28dd8 - 1 usba (USBA: USB Architecture 2.0 1.59)
85 fffffffff01fd000 c6d8 58 1 uhci (USB UHCI Controller Driver 1.47)

# devfsadm -vC
# mpstat 1
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 3 0 0 419 218 191 0 0 24 0 25 0 0 0 100
1 0 0 0 2 0 0 0 0 49 0 0 0 100 0 0
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 10 0 0 664 315 363 0 0 23 0 181 0 0 0 100
1 0 0 0 2 0 0 0 0 50 0 0 0 100 0 0
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 0 947 501 694 1 0 26 0 279 0 2 0 98
1 0 0 0 2 0 0 0 0 51 0 0 0 100 0 0
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 0 650 307 356 0 0 25 0 175 1 1 0 98
1 0 0 0 2 0 0 0 0 50 0 0 0 100 0 0

Still didn't help.

# modunload -i 84
can't unload the module: Device busy
# modunload -i 85
can't unload the module: Device busy
# modunload -i 83
can't unload the module: Device busy
# modunload -i 84
can't unload the module: Device busy
# modunload -i 85
can't unload the module: Device busy
# cfgadm -al
cfgadm: Configuration administration not supported

# modinfo |grep -i usb
83 fffffffff01c1000 14000 56 1 ehci (USB EHCI Driver 1.14)
84 fffffffff01d5000 28dd8 - 1 usba (USBA: USB Architecture 2.0 1.59)
85 fffffffff01fd000 c6d8 58 1 uhci (USB UHCI Controller Driver 1.47)

# mpstat 1
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 3 0 0 419 218 191 0 0 24 0 25 0 0 0 100
1 0 0 0 2 0 0 0 0 49 0 0 0 100 0 0
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 0 737 453 337 1 40 5 0 187 0 1 0 99
1 10 0 0 164 0 323 1 36 6 0 110 0 1 0 99
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 0 526 310 221 0 26 0 0 102 0 0 0 100
1 0 0 0 92 0 149 0 27 0 0 64 0 0 0 100
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 1 520 308 203 0 21 0 0 93 1 0 0 99
1 0 0 0 100 0 168 1 29 3 0 90 0 1 0 99

# dtrace -n fbt:usba:usba_vlog:entry'{@[stringof arg3]=count();}' -n tick-5s'{exit(0);}'
dtrace: description 'fbt:usba:usba_vlog:entry' matched 1 probe
dtrace: description 'tick-5s' matched 1 probe
1 42148 :tick-5s


That's much better - both CPUs are IDLE now.
Looks like some problem with USB device/driver.

ps. I wonder what intrstat(1M) would show - "unfortunately" the problem gone...

Wednesday, December 12, 2007

Simple Java Math Performance on Niagara 2

Yet another test. This time I run a simple Java program which does very basic arithmetic calculations - see details at the bottom of this post. Nothing really sophisticated.

javac -g:none ; java -server Benchmark

v440 4x1GHz USIIIi, Solaris 10U4, Java 1.5.0_12
T2000 8-core 1.2GHz Niagara-1, Solaris 10U4, Java 1.5.0_12
Niagara-2 8-core 1.4GHz, Solaris 10U4, Java 1.5.0_12

Sysbench CPU test: gcc vs. cc

Yet another test with Sysbench I did some time ago. This time please notice how big the difference is on both platform when Sysbench is compiled using Sun's Studio 12 (cc) compiler vs. gcc. I guess that tweaking parameters in gcc could get it closer...
Anyway, here is a graph.

Please notice that Niagara-2 cc results are for all N values. With N=32,64,128 total time is about 0.2-0.3s so you can't see it on graph.

In previous test (Sysbench's Memory test - there's basicaly no difference between gcc and cc).

./sysbench --test=cpu --cpu-max-prime=2000 --num-threads=N run

v440 4x1GHz USIIIi
Niagara-2 8-core 1.4GHz
gcc -O2 or -O3 or defaults from configure - the same results
cc -fast

Sysbench Memory Test on Niagara

Another tests I did some time ago...

./sysbench --test=memory --memory-block-size=16K --memory-scope=global --memory-total-size=100G --memory-oper=write --num-threads=N run

v440 4x1GHz USIIIi
Niagara-2 8-core 1.4GHz

Tuesday, December 11, 2007

OpenSSL RSA1024 performance on Niagara-2

Niagara-2 8-core 1.4GHz
v440 4x 1GHz USIIIi

Now lets looks at CPU usage on both hosts.

As expected - on a 4 CPU v440 with N=4 we're utilizing 100% CPU, while on Niagara-2 we're getting 7x the performance with about 10% CPU utilization.
With N=32 we're getting over 30x the performance of v440 while still utilizing only about 50% of CPU.

Please also note, that while v440 does not have an SSL accelerator I still used PKCS11 - it's because algorithms used in PKCS11 libraries produce much better results than built-in OpenSSL ones (thanks to tuning done by Sun).

Monday, November 12, 2007

Niagara-2 and Oracle

I guess I'm not the only one who asked himself (and Oracle) - what's the factor for Niagara-2 CPU? Is it 0.25 as for Niagara-1 (except 1.4GHz?!) or is it 0.75? First Oracle guy said it's 0.25... unfortunately they call back and said it is 0.75 actually.

That's bad...

Thursday, November 01, 2007

IRON file systems

Very interesting and informative paper on commodity file systems (ext2/3, raiserfs, jfs, xfs, ntfs) reliability (or rather lack of it). Of course we've already known the answer.

Wednesday, October 10, 2007

Evil Marketing

As a techie I'm always stunned by IBM marketing... Then how many times have I read in IBM's manuals that Solaris supports only 32 LUNs or why Opterons are bad and Xeons are better (2 years ago) just because they didn't sell them back then. I don't know about you but I always value honesty. While marketing has its own rules IBM often cross the line. Too often for me.
Maybe its just that 'nobody got fired for buying from IBM' attitude in companies were techies have no say at all what company is buying and those managers are buying IBM marketing.

I don't want to say their HW is crap - honestly quite often it's good. It's just most over-hyped HW on the planet I guess and their technical documentation is hard to distinguish from marketing one. Then you should definitely check yourself all IBM's claims regarding that HW as you will quite often find it doesn't actually do what they said it will...

I totally agree with below statement:

Moral: Be VERY VERY CAREFUL when you read big blue.

Tuesday, October 09, 2007

Niagara-2 Servers

Finally Niagara-2 servers have arrived. Check T5120, T5220 and T6320. What I really like about those systems is (except T2 cpu of course) the number of disks you can put as internal disks.

The architecture document is also helpful.


I've been playing with sysbench on Solaris SPARC lately and... it surprised me how big difference compiler can make - ok I haven't been trying all the options but still...
So lets look at sysbench/cpu test - what is a difference if sysbench was compiled using gcc version 3.4.3 (-O2, -O3 basically produces similar results) vs Sun Studio 12 (-fast)

It is about 7x performance difference!

But lets look into memory test and I get mixed results - with one thread cc produced faster code, with more threads it produced slower one.

It's possible that after tweaking gcc parameters one could get similar performance to sun studio compilers (or vice verse in mem test) however people tend to compile applications with basic parameters usually. I wouldn't expect such big difference between these two compilers with larger applications... but you never know.

I'm not going into any holy-compiler wars - it's just that sometimes one can be surprised.

Friday, October 05, 2007

Encrypting ZFS

Thanks to Darren J Moffat we now have an alpha support for cryptography in ZFS.
Only bfu archives are provided for now - so if you want something more polished you've got to wait more time. Now is a good time to provide feedback, comments, etc.

Phase 1 Functionality implemented

  • Per pool wrapping key (DSKEK)
  • per dataset keytype.
    • pwrap: Randomly generated per dataset key wrapped by DSKEK
    • pool: Use DSKEK directly. Will likely NOT be supported in final release.
  • zpool keymgr load|unload|status
    • passphrase & key in file only
  • Per dataset encryption
    • NOTE: use only aes-128-cbc, aes-256-cbc
    • aes-192-cbc is broken
  • Encrypted snapshots
  • Clones "inherit" crypto properties regardless of path hierarchy clone promotion also works
  • Encryption is a create time only property
  • Encrypted datasets don't mount with 'zfs mount -a' unless key is present
  • pool history records key creation/clone.

Friday, September 28, 2007

Software Crypto Performance

Interesting results - Solaris 10 08/07 (update 4) on v440, no crypto card.

openssl speed rsa1024 -engine pkcs11 -multi N
openssl speed rsa1024 -multi N

Where N was 1 2 4 8

As you can see even without having crypto card there's huge performance increase by using pkcs11 - different algorithm implementation and different compiler?

Data Corruption

CERN has published a paper on data corruption in their data center.
Here you can find some comment and nice summary for those findings.


Tuesday, September 18, 2007

Friday, September 14, 2007


Build 74 of Nevada integrates support for 4-core SPARC64 CPU:

Solaris Nevada supports Jupiter CPU for OPL platform.
Jupiter CPU is a 4 core variant of the current shipping
Olympus-C CPU, which has 2 cores. Both Olympus-C and
Jupiter has 2 CPU strands per CPU core.

Official name for Jupiter CPU is "SPARC64-VII".
Official name for current Olympus-C CPU is "SPARC64-VI".
Official name for OPL platform is "Sun SPARC Enterprise Mx000".

Sun to Acquire Lustre

That is interesting:

Sun Microsystems Expands High Performance Computing Portfolio With Definitive Agreement to Acquire Assets of Cluster File Systems, Including the Lustre File System

Sunday, September 09, 2007

Solaris 10 Update 4

After some delays it's finally here - Solaris 10 08/07 (update 4).
Some new features in update 4 below - check for all of them in What's New.

  • NSS & nscd enhacements (see Sparks project)
  • iSCSI Targer support
    • ZFS built-in iSCSI target support (similar to sharenfs)
  • SATA tagged queuing
  • IP Instances for Zones (separate IP stack for zones)
    • ability to modify routing, packet filtering, network interfaces within a zone
  • Zone's resource limits enhacements
  • Dtrace support in a zone
  • Compact Flash (CF) support - ability to boot Solaris from CF
  • NVidia accelerated gfx driver included out of the box

Monday, August 20, 2007

IBM to distribute Solaris

That's an interesting news. First IBM and Sun started to support Solaris x86 on IBM's x86 blade servers, then HP again started to support Solaris x86 on most of its x86 servers, then early this year Intel joined Open Solaris, and now:

IBM and Sun announced that IBM will distribute the Solaris operating system (OS) and Solaris Subscriptions for select x86-based IBM System x servers and Blade Center servers. Sun President and CEO, Jonathan Schwartz, and IBM Senior Vice President and Group Executive, Bill Zeitler, jointly presented the news during a press conference on August 16, 2007. This news follows Intel's endorsement of Solaris in January. Today, the Solaris OS is supported on more than 820 x86-based platforms and more than 3000 x86-based applications.
Read more.

Also see Jonathan's blog entry on the announcement.

Friday, August 03, 2007

Trusted System with one click?

I've been going thru latest heads-ups for Nevada and found this. In the past we had Trusted Systems which were entirely separate system installs. Then with Trusted Extensions in Solaris 10U3 you could get a Trusted System by just installing your software. Now all is required is a one command... I'm a little bit oversimplifying here of course but still looks like we're getting Trusted Systems features being more and more part of a standard operating system.

DTrace vs. SystemTap

No, no more comparisons - every one who has actually try both just a little bit know it doesn't really make sense to compare them. SystemTap is just another toy noone is using in a production and for a good reasons.
Nevertheless if you are interested in some background of DTrace vs. SystemTap read this blog entry.

Tuesday, July 24, 2007

Recent ZFS enhancements

Some really nice changes have been integrated in last few builds. Hotplug support is interesting - it will greatly simplify jbod managements with zfs.

ZFS delegated administration

ZFS makes practically reasonable to give each user a file system. Now what if you want to delegate some ZFS administrative task to that user so for example he/she will be able to create sub-filesystems? It's been integrated into b70.
Check this blog entry for more examples.

Wednesday, July 18, 2007

Separate Intent Log for ZFS

Ability to put ZIL on a separate device was integrated into b68. So ff you care about ZFS performance see this blog entry.

I wonder if Sun is going to sell x4500 with some kind of nvram or SSD...

Monday, May 21, 2007

Unix Days - Gdansk 2007

Next week there's going to be 3rd edition of Unix Days conference in Poland. We've just opened up public registration. Last year we had to close public registration before even 24h lasted as there're only little above 200 seats available - not all people attend all days and all lectures so we can allow about 300 people to register. We were very happy with last two editions (and thanks to questionnaires we know you were too) and we hope this one will be even better. This year we decided to extend the conference to 3 days comparing to two days last time. This means more presentations to attend. This is also a great opportunity for sys admins to know each other better in real life - especially during evening party to all who attended the conference.

See you there!


The conference went well - over 260 people were there - good. You can find presentations here - some of them are in English.

Wednesday, May 09, 2007

NPort ID Virtualization

Aaron wrote:

What do I do all day at the office? Lately, I've been working on adding NPort ID Virtualization (NPIV) to our Leadville FibreChannel stack.

At a high level, you can think of NPIV as allowing one physical FibreChannel HBA to log in multiple times to the SAN, and so you have many virtual HBAs.

Why is this interesting?

The first thing I thought of when I heard about this is hypervisor applications, like Xen. If you have one world wide name per DOMU (in Xen terminology), you can do the same zoning/lun masking that you've always done per server, but this times it's per DOMU.

Another use is that if your HBA breaks, and you have to replace it, you can use the old WWN on the new HBA, and you won't have to rezone your SAN.

More details on NPIV.

Tuesday, April 24, 2007

Windows on Thumper

Sun Microsystems loaned an X4500 to the Johns Hopkins University Physics Department in Baltimore, Maryland to do Windows-SQLserver performance experiments and to be a public resource for services like,, and

This is the fastest Intel/AMD system we have ever benchmarked. The 6+ GB/s memory system (4.5GB/s copy) is very promising.

Read full report.

HW RAID vs. ZFS software RAID - part III

This time I wanted to test softare RAID-10 vs. hardware RAID-10 on the same hardware. I used x4100 server with dual-ported 4Gb Qlogic HBA directly connected to EMC Clariion CX3-40 array (both links, each link connected to different storage processor). Operating system was Solaris 10U3 + patches. In case of hardware RAID, 4x RAID-10 groups were created each made of 10 disks (40 disks in total) and each group presented as a single LUN. So there were 4 LUNs, two on one storage processor and 2 on the other then ZFS striped pool over all 4 LUNs was created for better performance. In case of software raid, the same disks were used but each disk was presented as individual disk by Clarrion - 20 disks from one storage processor and 20 from the other. Then one large RAID-10 ZFS pool was created (the same disk pairs as with HW RAID). In both cases MPxIO was also enabled.

Additionally I included results for x4500 (Thumper) for comparison.

Before I go to the results some explaination of system names in graphs is needed.

x4100 HW - hardware RAID as described above, pci-max-read-request=2048 set in qlc.conf
x4100 SW - software RAID as described above
x4100 SW/Q - software RAID as described above, pci-max-read-request=2048 set in qlc.conf
x4500 - software RAID-10 pool made of 44 7200k 500GB disks (+2 hotspares +2 root disk)

As you'll see across all results, setting pci-max-read-request=2048 helps to boost results.
Also please notice that doing RAID-10 completely in software means that host has to write twice as much data to the array as when doing RAID-10 on the array. If enough disks are used and the array itself is not a bottleneck then we'll saturate links meaning we should get about half application streaming write performance with software RAID. In real life when your application doesn't issue as much writes it won't be a problem. Of course we're talking only about writes.

Keep in mind that workload parameters were such so actual workload was much larger that server's installed memory to minimaze file system caching.

We can observe it in first graph - HW RAID gives about 450MB/s for sequential writes and software RAID gives about 270MBs which is ~60%. Sequential reading on the other hand is a little bit better with software RAID.

1. iozone -b iozone-1.wks -R -e -s 2g -t 32 -M -r 128k -f /f5-1/test/i1
zfs set atime=off, recordsize=128k (default)

Notice how good x4500 platform is for sequential reading/writing (locally of course, you won't be able to push it thru network). Additionally if you calculate total storage capacity and price x4500 is a winner here without any doubt by a very long margin.

Now lets see what results we will get with more common mixed workload - lots of files, 32 threads reading, writing, creating, deleting files, etc.

2. filebench - varmail workload, nfiles=100000, nthreads=32, meandirwidth=10000, meaniosize=16384, run 600
zfs set atime=off, recordsize=128k (default)

ZFS software RAID turned out to be the fastest by a low margin. This time x4500 is about 30% slower which is actually quite impressive (44x 7200k SATA disks vs. 40x 15000k FC disks).

I haven't been using IOzone much in a past so I thought it might be a good idea to try it.

3. iozone -b iozone-1.wks -R -e -s 2g -t 32 -M -r 16k -f /f5-1/test/i1
zfs set atime=off, recordsize=128k (default)

Well x4500 is too good in above benchmark - I mean it looks like IOzone isn't issuing as random workload as one might have expected. Software RAID results are also too optimistic. Part of the problem could be that IOzone creates only a few files but large ones. It behaves more like database than file server. It is especially important with ZFS. So let's see what will happen if we match ZFS's recordsize to IOzone redord size.

4. iozone -b iozone-1.wks -R -e -s 2g -t 32 -M -r 16k -f /f5-1/test/i1
zfs set atime=off, recordsize=16k

Now we get much better throughput across all tests. It shows how important it is to match ZFS's recordsize to db record size in database environments. I was expecting writes to give less throughput with software RAID. Little bit unexpected is how much better results I get with stride reads. This time x4500's results are as expected - worst perfromer on reads, and great numbers on writes (this is due to ZFS which transforms most random writes to sequential writes which is very good for x4500 as we've seen in the first graph).

Lets see what happens if we increase both IOzone record size and ZFS record size to 128k.

5. iozone -b iozone-1.wks -R -e -s 2g -t 32 -M -r 128k -f /f5-1/test/i1
zfs set atime=off, recordsize=128k (default)

It helped a lot as expected. Software RAID-10 was able to practically saturate both FC links.
Increasing block size was also very good for x4500's SATA disks which will excell in sequential reading/writing.

In all IOzone "Random mix" tests we observer x4500 to perform too good - it means that there're probably much more random writes than reads and due to ZFS's transorming random writes into sequential writes we got such good results. But it's a good result for ZFS - as it means that thanks to ZFS being much less dependend on seek time for random writes we can get some great performance characteristics out of SATA disks.

I belive that if I would have been able to directly connect disks as JBOD to the host entirely eliminating Clariion's storage processors I should have been get even better results especially in terms of random read/write IOPS. Under heave load all those storage processors are actually doing is to introduce some additional latency.

In different workloads, especially when you write from time to time and you're not saturating your disks nor an array cache you can potentially benefit from large non-voatile caches in an array. But then with sporadical writes you won't probably notice anyway. In most environments you won't notice an array read cache - you've got probably more memory in your server (or it's much cheaper to add memory to a server than to an array) or your active data set is so much bigger than any array cache size that it really doesn't matter if you have the cache or not.

When you think about it - most entry-level and mid-range arrays are just x86 servers inside. CX3-40 is a 2x 2.8GHz Xeon server...

See also my previous similar tests here, here and here.

So the question is - if you need a dedicated storage for a given workload does it make sense to buy mid-range arrays with storage processors, caches, etc.? Or maybe it's not only cheaper but also better in terms of raw perfromance to buy an FC, SAS, SATA (?) JBODS? I'm serious.

As it looks like in many workloads HW RAID will in real life give you less performance for higher price... but you get all the other features, right? Well, you get clones, snapshots... but you get them buit-in with ZFS and for most workloads ZFS clones and snapshots not only will give much better performance but won't need dedicated disks. Then management is much more easier with ZFS than with arrays - especially when you think about different software to manage arrays from different vendors. With ZFS it's all the same - just give it disks... Then ZFS is open source, is already ported to FreeBSD, is being ported to OS X and is for free. When was last time you had to call for EMC to reconfigure your array and yet pay for it?

Some people are concerned about enough bus bandwith when doing SW RAID. First it's not an issue for RAID-5, RAID-6 and RAID-0 as you have put through about the same volume of data regardless of when RAID is actually done. Then look at modern x86,x64 or RISC servers, even low-end ones and see how much IO bandwith they have and compare it to your actual environment. In most cases you don't have to worry about it. It was a problem 10-15 years ago, not now.

When doing RAID in ZFS you've also get end-to-end checksumming and self-healing for all of your data.

Now there're workloads when HW RAID actually makes sense. First RAID-5 for random reads workload will generally work better on the array than RAID-Z. But still it's worth considering to just buy more disks in a JBOD then spending all the money on an array.

There're also some features like remote synchronous replication which some arrays do offer and in some environments its needed.

The real issue with ZFS right now is its hot spare support and disk failure recovery. Right now it's barely working and it's nothing like you are accustomed to in arrays. It's being worked on right now by ZFS team so I expect it to quickly improve. But right now if you are afraid of disk failures and you can't afford any downtimes due to disk failure you should go with HW RAID and possibly with ZFS as a file system. In such a scenario I also encourage to expose to ZFS at least 3 luns made of different disks/raid groups and do dynamic striping in ZFS - that way ZFS's meta data will be protected.

The other problem is that it is hard to find a good JBODs, especially from tier-1 vendors.
Then it's harder to find large JBODs (in terms of # of disks). Would be really nice to be able to buy SAS/SATA JBOD with ability to add many expansion units, with 4-8 ports to servers, supported in a cluster configs, etc. Maybe a JBOD with SAS 2,5" disks packed similar to x4500 - this would give enourmous IOPS/CAPACITY per 1U...

ps. and remember - even with ZFS there's no one hammer...

HW details:
System : Solaris 10 11/06 (kernel Generic_125101-04)
Server : x4100M2 2xOpteron 2218 2.6GHz dual-core, 16GB RAM, dual ported 4Gb Qlogic HBA
Array : Clariion CX3-40, 73GB 15K 4Gb disks
X4500 : 500GB 7200K, 2x Opteron 285 (2.6GHz dual-core), 16GB RAM

New SPARC Servers

Sun announced new servers based on new SPARC CPU designed in co-operation with Fujitsu. Those new models are: M4000, M5000, M8000, M9000.

See Richard Elling's blog entry about RAS features of this architecture.
Also see Jonathan Schwartz's post.

Thursday, April 19, 2007

NFS server - file stats

Have you ever wondered what files are most accessed on your nfs server? How good are those files cached? You've got many nfs clients...

We've put new nfs server on Solaris 10, Opteron server, Sun Cluster 3.2, ZFS, etc.
So far only part of production data are served and we see somewhat surprising numbers.

bash-3.00# /usr/local/sbin/ 10 3
[omitting first output]
Time Int rKb/s wKb/s rPk/s wPk/s rAvs wAvs %Util Sat
03:04:20 nge1 0.07 0.05 1.20 1.20 61.50 46.67 0.00 0.00
03:04:20 nge0 0.07 0.05 1.20 1.20 61.50 46.67 0.00 0.00
03:04:20 e1000g1 71.87 0.13 446.22 1.20 164.92 114.83 0.06 0.00
03:04:20 e1000g0 0.34 10117.91 5.40 7120.07 64.00 1455.15 8.29 0.00
Time Int rKb/s wKb/s rPk/s wPk/s rAvs wAvs %Util Sat
03:04:30 nge1 0.08 0.06 1.30 1.30 62.77 47.54 0.00 0.00
03:04:30 nge0 0.08 0.06 1.30 1.30 62.77 47.54 0.00 0.00
03:04:30 e1000g1 69.13 0.14 430.27 1.30 164.53 110.92 0.06 0.00
03:04:30 e1000g0 0.43 9827.54 6.79 6914.19 64.29 1455.47 8.05 0.00

So we have 9-10MB/s being served.

bash-3.00# iostat -xnz 1
[omitting first output]
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device

Well but we do not touch disks at all.
'zpool iostat 1' also confirms that.

Now I wonder what files are we actually serving right now.

bash-3.00# ./rfileio.d 10s

Read IOPS, top 20 (count)
/media/d001/a/nfs.wsx logical 101
/media/d001/a/0410_komentarz_walutowy.wmv logical 712
/media/d001/a/0410_komentarz_gieldowy.wmv logical 3654

Read Bandwidth, top 20 (bytes)
/media/d001/a/nfs.wsx logical 188264
/media/d001/a/0410_komentarz_walutowy.wmv logical 1016832
/media/d001/a/0410_komentarz_gieldowy.wmv logical 96774144

Total File System miss-rate: 0%

In 10 seconds we read ~95MB so it agrees with 9-10MB/s as nicstat reported. Everything is read as "logical" - agrees.
And most important - we now which files are served!
So it's time to tune nfs clients... :)

You can find rfileio.d script in the DTraceToolkit (although I modified it slightly).

Now imagine what you can do with such possibilities on more busy servers. You don't have to guess what files are most served and how good they cache. Using another script 'rfileio.d' you can break down statistics by file systems. And if you want to customize them you can easily and safely do so as those scripts are written in DTrace.

Of course all of the above is safe to run in a production - that's most important thing.

Additionally to put it clearly - I did it on nfs server, not nfs clients so it doesn't matter if your clients are *BSD, Linux, Windows, Solaris, ... as long as your nfs server is running Solaris.

Friday, April 06, 2007


Pawel Dawidek has been working on a ZFS port to FreeBSD for some time now. He's just announced that ZFS is integrated into FreeBSD. Congratulations for hard work!

See threads on Open Solaris and FreeBSD lists.

I'm also very glad that Pawel will be on of a presenters at Unix Days '07. He'll be talking about FreeBSD.

Wednesday, March 28, 2007

Latest ZFS add-ons

ZFS boot support was just integrated (x86/x64 platform for now). It will be available in SXCE build 62.
Yes, we'll be able to boot directly from ZFS - that would definitely make life easier - no more hassle with separate partitions and their sizing, snapshots and clones for / and much easier live upgrade - those are just some examples. In b62 installer won't know about ZFS (yet) so some manual fiddling will be required to install system on ZFS.

Also in b62 gzip compression was integrated into ZFS (additional to ljzb) thanks to Adam Leventhal. It not only can save you lot of space transparently to application but in some workloads it can actually speed up disk access (if there's free CPU, disk IO is a bottleneck and data are good candidate for compression). We've been using zfs built-in compression (ljzb) for quite some time on LDAP servers - on disk database size reduced 2x and we've also gained some performance. It would be interesting to try ZFS/gzip.

Ditto block support for data blocks was integrated in b61. It means that we can set new property per fs basis (zfs set copies=N fs, N=1 by default) to instruct zfs to write N (1-3) copies of data regardless of a pool protection. Like with ditto blocks for meta data if your pool has more vdevs each copy will be on different disk.

ZFS support for iSCSI was integrated in b54. It greatly simplifies exposing ZVOLs via iSCSI in the same way sharenfs simplifies sharing file systems over nfs.

In case you haven't noticed 'zpool history' feature was integrated into b51. It stores zfs commands history in a pool itself so you can see what was happening.

Of course lots of bug and performance fixes were also integrated recently.

Monday, March 19, 2007

ZFS online replication

During last Christmas I was playing with ZFS code again and I figured out that adding online replication of ZFS file systems should be quite easy to implement. By online replication I mean one-to-one relation between two file systems, potentially on different servers, and all modifications done to one file system are asynchronously replicated to the other one with a small delay (like few seconds). Additionally one should be able to snapshot remote file system independently to get point-in-time copies and resume replication from automatically created snapshots on both ends at given intervals. The good thing is that once you're just few seconds behind you should get all transactions from memory so you get a remote copy of your file system without generating any additional IOs on a backuped one.

Due to some reasons I haven't done it myself rather I asked one of my developers to actually implement such tool and here we are :)

bash-3.00# zfs list
solaris 5.13G 11.6G 24.5K /solaris
solaris/testws 5.13G 11.6G 5.13G /export/testws/
bash-3.00# zfs create solaris/d100
bash-3.00# zfs list
solaris 5.13G 11.6G 26.5K /solaris
solaris/d100 24.5K 11.6G 24.5K /solaris/d100
solaris/testws 5.13G 11.6G 5.13G /export/testws/

Now in another terminal:

bash-3.00# ./zreplicate send solaris/d100 | ./zreplicate receive solaris/d100-copy

Back to original terminal:

bash-3.00# zfs list
solaris 5.13G 11.6G 26.5K /solaris
solaris/d100 24.5K 11.6G 24.5K /solaris/d100
solaris/d100-copy 24.5K 11.6G 24.5K /solaris/d100-copy
solaris/testws 5.13G 11.6G 5.13G /export/testws/
bash-3.00# cp /platform/i86pc/boot_archive /solaris/d100/
bash-3.00# zfs list
solaris 5.15G 11.6G 26.5K /solaris
solaris/d100 12.0M 11.6G 12.0M /solaris/d100
solaris/d100-copy 12.0M 11.6G 12.0M /solaris/d100-copy
solaris/testws 5.13G 11.6G 5.13G /export/testws/
bash-3.00# rm /solaris/d100/boot_archive
bash-3.00# zfs list
solaris 5.14G 11.6G 26.5K /solaris
solaris/d100 24.5K 11.6G 24.5K /solaris/d100
solaris/d100-copy 12.0M 11.6G 12.0M /solaris/d100-copy
solaris/testws 5.13G 11.6G 5.13G /export/testws/
bash-3.00# zfs list
solaris 5.13G 11.6G 26.5K /solaris
solaris/d100 24.5K 11.6G 24.5K /solaris/d100
solaris/d100-copy 24.5K 11.6G 24.5K /solaris/d100-copy
solaris/testws 5.13G 11.6G 5.13G /export/testws/

bash-3.00# cp /platform/i86pc/boot_archive /solaris/d100/
[stop replication in another terminal]
bash-3.00# zfs mount -a
bash-3.00# digest -a md5 /solaris/d100/boot_archive
bash-3.00# digest -a md5 /solaris/d100-copy/boot_archive

Bingo! All modifications to solaris/d100 are automatically replicated to solaris/d100-copy. Of course you can replicate over the network to remote server using ssh.

There're still some minor problems but generally the tool works as expected.

Once the first phase is implemented we will probably start second one - to implement a tool to manage replications between servers (like automatic replication setup if new file system is created, replication resume in case of a problem, etc.).

There're other approaches which create a snapshots and then incrementally replicate them to remote side in given intervals. While our approach is very similar it's more elegant and gives you almost on-line replication. What do you think?

Friday, March 16, 2007

IPMP - Next

If you're interested in how IPMP is going to look like in a near future check this blog entry. I really like this.

ps. don't miss this document.

Tuesday, March 13, 2007

Open Solaris Starter Kit

Do you want to try Open Solaris? Go and get free Starter Kit.

ps. Jim posted that most orders of the kit are from Russia and Poland!

Tuesday, March 06, 2007

Tuesday, February 27, 2007

Data corruption on SATA array

Below is today's Security Alert - another good example that even with currently shipping arrays data corruption still happens in an array firmware and ZFS is the only file system which is able to detect such corruption and correct it.

That alert explains some issues here...

Sun Alert ID: 102815
Synopsis: SE3310/SE3320/SE3510/SE3511 Storage Arrays May
Experience Data Integrity Events
Product: Sun StorageTek 3510 FC Array, Sun StorEdge 3310 NAS
Array, Sun StorageTek 3320 SCSI Array, Sun
StorageTek 3511 SATA Array
Category: Data Loss, Availability
Date Released: 22-Feb-2007

To view this Sun Alert document please go to the following URL:

Sun(sm) Alert Notification

* Sun Alert ID: 102815
* Synopsis: SE3310/SE3320/SE3510/SE3511 Storage Arrays May Experience Data Integrity Events
* Category: Data Loss, Availability
* Product: Sun StorageTek 3510 FC Array, , Sun StorageTek 3320 SCSI Array, Sun StorageTek 3511 SATA Array
* BugIDs: 6511494
* Avoidance: Workaround
* State: Workaround
* Date Released: 22-Feb-2007
* Date Closed:
* Date Modified:

1. Impact

System panics and warning messages on the host Operating System may occur due to a filesystem reading and acting on incorrect data from the disk or a user application reading and acting on incorrect data from the array.
2. Contributing Factors

This issue can occur on the following platforms:

* Sun StorEdge 3310 (SCSI) Array with firmware version 4.11K/4.13B/4.15F (as delivered in patch 113722-10/113722-11/113722-15)
* Sun StorageTek 3320 (SCSI) Array with firmware version 4.15G (as delivered in patch 113730-01)
* Sun StorageTek 3510 3510 (FC) Array with firmware version 4.11I/4.13C/4.15F (as delivered in patch 113723-10/113723-11/113723-15)
* Sun StorageTek 3511 (FC) Array with firmware version 4.11I/4.13C/4.15F (as delivered in patch 113724-04/113724-05/113724-09)

The above raid arrays (single or double controller) with "Write Behind Caching" enabled on Raid 5 LUNs (or other raid level LUNs and an array disk administration action occurs), can return stale data when the i/o contains writes and reads in a very specific pattern. This pattern has only be observed in UFS metadata updates but could be seen in other situations.
3. Symptoms

Filesystem warnings and panics occur and with no indication of an underlying storage issue. For UFS these messages could include:

"panic: Freeing Free Frag"
WARNING: /: unexpected allocated inode XXXXXX, run fsck(1M) -o f
WARNING: /: unexpected free inode XXXXXX, run fsck(1M) -o f

This list is not exhaustive and other symptoms of stale data read might be seen.

Solution Summary Top
4. Relief/Workaround

Disable the "write behind" caching option inside the array using your preferred array administration tool (sccli(1M) or telnet). This workaround can be removed on final resolution.

Use ZFS to detect (and correct if configured) the Data Integrity Events.

If not using a filesystem make sure your application has checksums and identity information embedded in its disk data so it can detect Data Integrity Events.

Migrating back to 3.X firmware is a major task and is not recommended.

5. Resolution

A final resolution is pending completion.

Thursday, February 22, 2007

ldapsearch story

Recently we've migrated a system with a lot of small scripts, programs, etc. running on Linux to Solaris. The problem with such systems running a lot of different relatively small scripts and programs written by different people over long period of time is that it's really hard to tell which of those programs are eating up CPU most or which of them are generating most IO's, etc. Ok, it is hard to almost impossible to tell on Linux - more like guessing than real data.

But once we migrated to Solaris we quickly tried dtrace. And the result was really surprising - from all those things the application (by name) which eats most of the cpu is ldapsearch utility. It's been used by some scripts by no one expected it to be the top application. As many of those scripts are written in Perl we tried to use ldap perl module instead and once we did it for some scripts their CPU usage dropped considerably being somewhere in a noise of all the other applications.

How hard was it to get the conclusion?

dtrace -n sched:::on-cpu'{self->t=timestamp;}' \
-n sched:::off-cpu'/self->t/{@[execname]=sum(timestamp-self->t);self->t=0;}'

Another interesting thing is that when we use built-in zfs compression (ljzb) we get about 3.6 compression ratio for all our collected data - yes that means the data are reduced over 3 times in size on disk without changing any applications.

Disk failures in data center

Really interesting paper from Google about disk failures in large disk population (over 100k). Also information if SMART can be reliably used in disk failure prediction.

There's also another paper presented by people from Carnegie Mellon University.

Some observations are really surprising - like you get higher probability of disk failure if it's running below 20C than if it's running about 40C. Another interesting thing is that SATA disks seem to have similar ARR to FC and SCSI disks.

Anyone managing "disks" should read those two papers.

update: NetApp response to above papers.
Also interesting paper from Seagate and Microsoft on SATA disks.

This time IBM responded.

Thursday, January 25, 2007


I've been watching Physics course teached by Richard Muller at the University of California at Berkeley. The course is called Physics for future Presidents (PffP) or Physics10. If you want to refresh your knowledge about atom bombs, relativity or quantum physics without going into much details and math but rather to understand the idea behind the PffP is for you. I promise you will learn something even if you've studied physics - actually you'll probably learn more than during studies :) The most important thing about this course is you'll learn a lot about real world problems, you'll get answers to questions you've always had in your mind.

I must say that Richard Muller is a GREAT teacher, one in a million, I wish I had such teachers during my studies. He's one of those guys who are really passionate about what they are doing and at the same time they have that gift to share their knowledge and passion in such a way you just learn and enjoy the whole experience.

There're 26 lectures during a semester each about 1h long. All lectures are available online on PffP page or you can watch them directly on Berkeley pages here. Right now new semester started so if you want to watch a whole semester quicker just see last one here.

Just give it a try, I'm sure you won't regret it.

update: why

Wednesday, January 24, 2007

IP Instances

IP Instances project has been just integrated. Basically it lets you set special parameter to a local zone so IP stack will be virtualized for it and you will be able to configure network interfaces from inside the zone, configure routing, ipfilter, etc. See the latest changelog.

ps. it also looks like some code for Xen was integrated - so perhaps we'll see Xen integrated soon

Tuesday, January 23, 2007

Sun sells Intel CPUs again

Sun will sell servers and workstations with Intel CPUs in addition to AMD and SPARC chips. Generally more choice given to customers is good. Official announcement here.

Monday, January 15, 2007

Solaris 10 on IBM LS21

We did install Solaris 10 on LS20 blades some time ago without any problems - it just works. However when LS21 blades arrived later despite that IBM lists Solaris 10 as supported system we couldn't get it working - Solaris didn't want to even boot a kernel. That was bad. So we asked IBM how to install Solaris 10 on its blades as it's supported by them. Just in case I also asked on Solaris-x86 mailing list. Almost immediately Mike Riley from Sun offered his help - he asked internal Sun people who were testing Solaris on IBM's blades how they did it. Next day I got the solution forwarded by Mike and it worked! Thank you very much.

Only two weeks later I got an answer from IBM that despite Solaris 10 is listed as supported it's not yet supported on LS21 and it should be early 2007. Well I had it already working for two weeks then.

So in order to install Solaris 10 on LS21 you need to go to BIOS and:
1) Select "Advanced Setup", and then "PCI Bus Control" and then select the
"PCI Enhanced Configuration Access" option.
2) Press the <> or to select "Enabled".
3) Select "Save Settings"

You also need bnx driver to get network working. Driver is provided by IBM on a CD or you can download it directly from Broadcom here. I added it to jumpstart install and then we could install system using jumpstart over the network.

U_MTTDL vs Space

Another great blog entry from Richard. A must-read for all x4500 users.

Friday, January 12, 2007

MTTDL vs Space

Really nice and informational blog entry about Mean Time To Data Loss (MTTDL) for different RAID levels.

Saturday, January 06, 2007

Funny Sun Commercials

Episode 1

Episode 2

Episode 3

Episode 4

Is EMC affraid of ZFS?

While adding ZFS support to VCS wouldn't be hard it's not there yet and I guess it won't be there for some time. I don't think it's a good strategy to lock-in customers. Ok, this entry was supposed to be about EMC. I was looking into EMC Support Matrix today and was surprised. Look at page #698 and reference #22 (document dated 12/29/2006) or just search for ZFS and you will find: "EMC supports ZFS Version 3 or higher without Snapshot and Clone features.". Now that's really interesting. In what sense their support ZFS on DMX or Clariion boxes? Since when they care about which file system you're going to put on their arrays? Not to mention specific features of those file systems. Or maybe it's just that in many situations ZFS's snapshots and clones make similar functionalities on their hardware obsolete? And of course when using ZFS you don't have to pay another licenses fees for Time Finder, etc. and buy more disks.

I can understand they are afraid that because of ZFS in some cases people will spend less money on buying hardware and some propertiary software, and they are right. Their hardware really doesn't matter what file system you put on it and features are you going to use of that file system.

Don't get me wrong - EMC makes really good hardware and software to manage it, they have also excellent support. But putting such restrictions is just plain stupid - I understand they want to force people to pay for Time Finder even if ZFS would be good enough or better not to mention cheaper (ZFS is free).

On the other hand if vendors like EMC starts to notice ZFS just after 6 months it went into stable Solaris it is good and probably means customers are asking for it.

ps. of course ZFS's snapshots and clones work perfectly with Symmetrix boxes and Clariions

update: in latest EMC's support matrix EMC now says:

EMC supports ZFS in Solaris 10 11/06 or later. Snapshot and Clone features are supported only through Sun.
That's a good news.

Friday, January 05, 2007


Dominic Kay has posted some nice benchmarks for ZFS vs VxFS (1 2). He also showed the difference between managing VxVM/VxFS and ZFS. Believe me ZFS is easiest enterprise Volume Manager + File System on the market. When you've got to create few dozen file systems on an array or you've got to create some file systems but you're really not sure how much space assign to each one then ZFS is your only friend.

ps. I don't know which Solaris version was used - and it can make some difference when it comes to ZFS