Tuesday, July 22, 2014

Massive msync() speed up on ZFS

MongoDB is using mmap() to access all its files. It also has a special thread which by default wakes up every 60s and calls msync() for all the mmap'ed files, one at a time. Initially when you start MongoDB all these msync's are fast, assuming there are no modifications. However if your server has hundreds of GBs of RAM (or more) and a database is also that big, all these msync()s are getting slower over time - they take longer the more data is cached in RAM. Eventually it can take even 50s or more for the thread to finish syncing all of the files even if there is nothing to write to disk. The actual problem is that while the thread is syncing all the files it holds a global lock and until it finishes the database is almost useless. If it takes 50s to sync all the files then the database can process requests only for 10s out of each 60s window...

If you have logging enabled in MongoDB you should see log entries like:

    Tue Apr 29 06:22:02.901 [DataFileSync] flushing mmaps took 33302ms for 376 files

On Linux this is much faster as it has a special optimization for such a case, which Solaris doesn't.
However Oracle fixed the bug sometime ago and now the same database reports:

    Tue Apr 29 12:55:51.009 [DataFileSync] flushing mmaps took 9ms for 378 files

This is over 3000x improvement!
 
The Solaris Bug ID is: 18658199 Speed up msync() on ZFS by 90000x with this one weird trick
which is fixed in Solaris 11.1 SRU21 and also Solaris 11.2

Note that the fix only improves a case when an entire file is msync'ed and the underlying file system is ZFS. Any application which has a similar behavior would benefit. For some large mappings, like 1TB, the improvement can even be 178,000x.


Wednesday, April 09, 2014

Slow process core dump

Solaris limits rate at which a process core dump is generated. There are two tunables with default values: core_delay_usec=10000 core_chunk=32 (in pages). This essentially means that process core dump is limited to 100 writes of 128KB per second (32x 4k pages on x86) which is an eqivalen of 12.5MB/s - for large processes this will take quite a long time.

I set core_chunk=128 and core_delay_usec=0 via mdb which increased the core dump write rate from 12.5MB/s to about 300-600MB/s.

Long core dump generations can cause issues like delay in restarting a service in SMF. But then making it too fast might overwhelm OS and impact other running processes there. Still, I think the defaults are probably too conservative.

update: I just noticed that this has been fixed in Illumos, see https://illumos.org/issues/3673
 

Thursday, April 03, 2014

Home NAS - Data Corruption

I value my personal pictures so I store them on a home NAS which is running Solaris/ZFS. The server has two data disks mirrored. It's been running perfectly fine for the last two years but recently one of the disk drives returned corrupted data during ZFS scrub. If it wasn't for ZFS I probably wouldn't even know that some pixels are wrong... or perhaps that some pictures are missing... This is exactly why I've used ZFS on my home NAS.

# zpool status vault
  pool: vault
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
  scan: scrub repaired 256K in 3h54m with 0 errors on Tue Mar 11 17:34:22 2014
config:

        NAME        STATE     READ WRITE CKSUM
        vault       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c4t1d0  ONLINE       0     0     2
            c5t0d0  ONLINE       0     0     0

errors: No known data errors

Friday, January 31, 2014

SMF: svcbundle

One of the missing things in SMF for me is an ability to mark a service to be disabled or enabled after reboot during postinstall phase of OS installation. One way of doing it is to generate a site profile, but this is rather cumbersome to do from a postinstaller. But there is a better way - svcbundle(1M). For example, to generate a profile which would enable ssh after reboot (or after manifest-import service is restarted:
svcbundle -s bundle-type=profile -s service-name=network/ssh \
          -s enabled=true >/etc/svc/profile/site/postinstall-ssh.xml
You can modify any parameter you want, for example to modify a sendmail service to use /etc/mail/cf/cf/nullclient.mc to generate its configuration from each time it starts:
svcbundle -s bundle-type=profile -s service-name=network/smtp -s instance-name=sendmail \
          -s instance-property-config:patht_to_sendmail_mc:astring:/etc/mail/cf/cf/nullclient.mc \
          >/etc/svc/profile/site/postinstall-sendmail-nullclient.xml

Thursday, January 23, 2014

mkdir() performance

Update: the fix is in Solaris 11.1 + SRU17, and should be in Solaris 11.2 once it is out. It now has a similar optimization to Linux. Network based file systems like AFS or NFS benefit most from it.

Recently I came across an issue where 'make install' on a Solaris server was taking *much* more time than on Linux. Files were being installed into AFS file system. After some debugging I found that GNU install calls mkdir() for all directories for a specified path and relies on EEXIST if a given directory already exists. For example: 

$ truss -D -t mkdir /usr/bin/ginstall -c -d \
 /ms/dev/openafs/core/1.6.5-c3/compile/x86_64.sunos64.5.11/sunx86_511/dest/bin
0.0026 mkdir("/ms", 0755)                              Err#17 EEXIST
0.0003 mkdir("dev", 0755)                              Err#30 EROFS
0.0002 mkdir("openafs", 0755)                          Err#30 EROFS
0.0002 mkdir("core", 0755)                             Err#30 EROFS
0.0083 mkdir("1.6.5-c3", 0755)                         Err#17 EEXIST
3.0085 mkdir("compile", 0755)                          Err#17 EEXIST
3.0089 mkdir("x86_64.sunos64.5.11", 0755)              Err#17 EEXIST
0.0005 mkdir("sunx86_511", 0755)                       Err#17 EEXIST
0.0002 mkdir("dest", 0755)                             Err#17 EEXIST
0.0065 mkdir("bin", 0755)                              Err#17 EEXIST
$

Notice that two mkdir()s took about 3s each! Now if there are lots of directories to be ginstall'ed it will take a very long time... I couldn't reproduce it on Linux though. Actually what happens is that Linux checks on VFS layer if there is a valid dentry with an inode allocated and if there is it will return with EEXIST without calling a file system specific VOP_MKDIR. The relevant code is:

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/namei.c
…
int vfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
{
        int error = may_create(dir, dentry);
        unsigned max_links = dir->i_sb->s_max_links;

        if (error)
               return error;

        if (!dir->i_op->mkdir)
               return -EPERM;
…
static inline int may_create(struct inode *dir, struct dentry *child)
{
        audit_inode_child(dir, child, AUDIT_TYPE_CHILD_CREATE);
        if (child->d_inode)
               return -EEXIST;
…

Unfortunately, Solaris doesn't have this optimization (though it does optimize couple of other cases, for example for EROFS), so each mkdir() results in VOP_MKDIR being called and for AFS it means sending a request over a network to a file server and waiting for a reply. That alone will make it slower than on Linux, but it still doesn't explain the 3s.

It turned out that AFS file server has a throttling mechanism - if a client is generating requests which results in error then by default it will delay answering to the client after 10 errors. This can be disabled or the threshold can be adjusted though. See -abortthreshold option to file server.

This was also tested (by an Oracle Solaris engineer) over NFS and showed 100x difference in response time. There is a negligible difference for local file systems.

A bug was opened against Solaris to get it fixed - see Bug 18115102 - mkdir(2) calls VOP_MKDIR() even though the directory or file already exists

Hopefully it will get fixed soon.

Tuesday, January 21, 2014

Systemtap

I had to measure what's going on in kernel for a specific VFS call both on Linux and on Solaris. On Solaris it was quick and easy - got an answer in couple of minutes, on Linux... Systemtap didn't like version of kernel-debuginfo installed (the error message wasn't that obvious either), had to uninstall it and install the proper one, as one can't have multiple versions installed which would be useful with multiple kernels installed...

While I like Systemtap in principle, in practice every time I use it I find it very frustrating, especially if I have to measure something similar on Solaris with DTrace which usually is so much easier and quicker to do.

Then, we do not use Systemtap on prod servers... in minutes I unintentionally crashed a Linux box by tracing with Systemtap :(

Now, Oracle Linux has DTrace... :)

Friday, January 17, 2014

AFS Client - better observability

Wouldn't it be nice if fsstat(1M) and fsinfo::: DTrace provider were working with AFS client? It turned out this was trivial to implement - Solaris itself will provide necessary stats for fsstat and register fsinfo::: probes if one extra flag is specified.Thanks to Andrew Deason for implementing it in AFS, and big thanks to Oracle's engineers for providing information on how to do it. See the actual patch to AFS.

Let's see fsstat in action against an AFS mountpount:
$ fsstat /ms 1
new  name   name  attr  attr lookup rddir  read read  write write
file remov  chng   get   set    ops   ops   ops bytes   ops bytes
    0     0     0 10.5K     0   173K   939 20.7K 72.8M     0     0 /ms
    0     0     0     0     0      0     0     0     0     0     0 /ms
    0     0     0     0     0      0     0     0     0     0     0 /ms
    0     0     0    26     0    296     0     9 8.06K     0     0 /ms
    0     0     0     0     0      0     0     0     0     0     0 /ms
    0     0     0     0     0      0     0     0     0     0     0 /ms
    0     0     0     0     0      0     0     0     0     0     0 /ms
    0     0     0   170     0  2.36K    10   106  265K     0     0 /ms
    0     0     0   159     0  1.97K     2    90  261K     0     0 /ms
    0     0     0    25     0    331     0 1.05K 4.15M     0     0 /ms
    0     0     0   138     0  1.80K     0 1.48K 5.37M     0     0 /ms
    0     0     0   360     0  4.72K    12 1.42K 5.35M     0     0 /ms
    0     0     0   122     0  1.63K     0 1.30K 4.68M     0     0 /ms
Now DTrace fsinfo::: provider showing results for AFS related operations:
$ dtrace -q -n fsinfo:::'/args[0]->fi_fs == "afs"/ \
               {printf("%Y %s[%d] %s %s\n", walltimestamp, execname, pid, probename, args[0]->fi_pathname);}'

2014 Jan 16 16:49:07 ifstat[1485] getpage /ms/dist/perl5/PROJ/core/5.8.8-2/.exec/ia32.sunos.5.10/bin/perl
2014 Jan 16 16:49:07 ifstat[1485] lookup /ms/dist
2014 Jan 16 16:49:07 ifstat[1485] lookup /ms/dist/aurora
2014 Jan 16 16:49:07 ifstat[1485] inactive /ms/dist
2014 Jan 16 16:49:07 ifstat[1485] lookup /ms/dist/aurora/bin
2014 Jan 16 16:49:07 ifstat[1485] inactive /ms/dist/aurora
2014 Jan 16 16:49:07 ifstat[1485] inactive /ms/dist/aurora/bin
2014 Jan 16 16:49:07 ifstat[1485] lookup /ms/dist
2014 Jan 16 16:49:07 ifstat[1485] lookup /ms/dist/fsf
2014 Jan 16 16:49:07 ifstat[1485] inactive /ms/dist
2014 Jan 16 16:49:07 ifstat[1485] lookup /ms/dist/fsf/bin
2014 Jan 16 16:49:07 ifstat[1485] inactive /ms/dist/fsf
2014 Jan 16 16:49:07 ifstat[1485] inactive /ms/dist/fsf/bin
2014 Jan 16 16:49:07 ifstat[1485] delmap /ms/dist/perl5/PROJ/core/5.8.8-2/.exec/ia32.sunos.5.10/bin/perl
2014 Jan 16 16:49:07 ifstat[1485] delmap /ms/dist/perl5/PROJ/core/5.8.8-2/.exec/ia32.sunos.5.10/bin/perl
2014 Jan 16 16:49:07 ifstat[1485] delmap /ms/dist/perl5/PROJ/core/5.8.8-2/.exec/ia32.sunos.5.10/lib/perl5/auto/Time/HiRes/HiRes.so
2014 Jan 16 16:49:07 ifstat[1485] delmap /ms/dist/perl5/PROJ/core/5.8.8-2/.exec/ia32.sunos.5.10/lib/perl5/auto/Time/HiRes/HiRes.so
2014 Jan 16 16:49:07 ifstat[1485] close /ms/dist/perl5/PROJ/core/5.8.8-2/.exec/ia32.sunos.5.10/bin/perl
2014 Jan 16 16:49:07 ifstat[964] open /ms/dist/perl5/PROJ/core/5.8.8-2/.exec/ia32.sunos.5.10/bin/perl
2014 Jan 16 16:49:07 ifstat[964] addmap /ms/dist/perl5/PROJ/core/5.8.8-2/.exec/ia32.sunos.5.10/bin/perl
2014 Jan 16 16:49:07 tcpstat[965] open /ms/dist/perl5/PROJ/core/5.8.8-2/.exec/ia32.sunos.5.10/bin/perl
2014 Jan 16 16:49:07 tcpstat[965] addmap /ms/dist/perl5/PROJ/core/5.8.8-2/.exec/ia32.sunos.5.10/bin/perl
2014 Jan 16 16:49:07 tcpstat[965] addmap /ms/dist/perl5/PROJ/core/5.8.8-2/.exec/ia32.sunos.5.10/bin/perl
2014 Jan 16 16:49:07 tcpstat[965] addmap /ms/dist/perl5/PROJ/core/5.8.8-2/.exec/ia32.sunos.5.10/lib/perl5/auto/Time/HiRes/HiRes.so
2014 Jan 16 16:49:07 tcpstat[965] addmap /ms/dist/perl5/PROJ/core/5.8.8-2/.exec/ia32.sunos.5.10/lib/perl5/auto/Time/HiRes/HiRes.so
2014 Jan 16 16:49:07 tcpstat[1484] getpage /ms/dist/perl5/PROJ/core/5.8.8-2/.exec/ia32.sunos.5.10/bin/perl

Tuesday, November 12, 2013

ZFS Appliance

While building a storage appliance based on ZFS, one of the important features is an ability to identify physical disk locations, which is hard to do for SAS disks and easier for SATA. Solaris 11 has a topology framework which makes it much easier and it is nicely integrated with various subsystems like FMA, ZFS. Recently I came across this blog entry which highlights this specific issue of how to identify physical disk locations.

The other important factor is easy of use - in order to replace a failed disk drive one should be able to pull out a bad one, put in a replacement one and that's it - all the rest should be done automatically. There shouldn't be any need to login to the OS and issue some commands to assist with the replacement. Again, this is how things are in Solaris 11.
 
Let's see how it works in practice. Recently one disk reported two read errors and multiple checksum errors during zpool scrub. Because the affected pool is redundant (RAID-10), ZFS was able to detect the corruption, serve good data from other disk and fix the corrupted blocks on the affected disk.
The story doesn't end here though - FMA decided that too many checksum errors were reported for a single disk, so it activated a hot-spare as a precaution to pro-actively protect from the bad disk would  misbehaving again. This is how the pool looked like after the hot-spare fully attached: 

# zpool status -v pool-0
  pool: pool-0
state: DEGRADED
status: One or more devices has been diagnosed as degraded. An attempt
        was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or 'fmadm repaired', or replace the device
        with 'zpool replace'.
  scan: resilvered 458G in 2h23m with 0 errors on Sat Nov  2 07:53:42 2013
config:

        NAME                         STATE     READ WRITE CKSUM
        pool-0                       DEGRADED     0     0     0
          mirror-0                   ONLINE       0     0     0
            c0t5000CCA0165FC0F8d0    ONLINE       0     0     0
            c0t5000CCA016217040d0    ONLINE       0     0     0
          mirror-1                   ONLINE       0     0     0
            c0t5000CCA01666AB64d0    ONLINE       0     0     0
            c0t5000CCA0166F3BB8d0    ONLINE       0     0     0
          mirror-2                   ONLINE       0     0     0
            c0t5000CCA0166F36C8d0    ONLINE       0     0     0
            c0t5000CCA01661894Cd0    ONLINE       0     0     0
          mirror-3                   ONLINE       0     0     0
            c0t5000CCA0166BE338d0    ONLINE       0     0     0
            c0t5000CCA016626340d0    ONLINE       0     0     0
          mirror-4                   ONLINE       0     0     0
            c0t5000CCA0166DC81Cd0    ONLINE       0     0     0
            c0t5000CCA016685238d0    ONLINE       0     0     0
          mirror-5                   ONLINE       0     0     0
            c0t5000CCA016636CA4d0    ONLINE       0     0     0
            c0t5000CCA016687528d0    ONLINE       0     0     0
          mirror-6                   ONLINE       0     0     0
            c0t5000CCA0166DC944d0    ONLINE       0     0     0
            c0t5000CCA0166DC0CCd0    ONLINE       0     0     0
          mirror-7                   ONLINE       0     0     0
            c0t5000CCA0166F4178d0    ONLINE       0     0     0
            c0t5000CCA01668DCC0d0    ONLINE       0     0     0
          mirror-8                   DEGRADED     0     0     0
            c0t5000CCA0166DD7BCd0    ONLINE       0     0     0
            spare-1                  DEGRADED     0     0     0
              c0t5000CCA016671600d0  DEGRADED     2     0    49
              c0t5000CCA0166876F8d0  ONLINE       0     0     0
          mirror-9                   ONLINE       0     0     0
            c0t5000CCA0166DC20Cd0    ONLINE       0     0     0
            c0t5000CCA0166877BCd0    ONLINE       0     0     0
          mirror-10                  ONLINE       0     0     0
            c0t5000CCA0166F3334d0    ONLINE       0     0     0
            c0t5000CCA0166BDD2Cd0    ONLINE       0     0     0
        spares
          c0t5000CCA0166876F8d0      INUSE
          c0t5000CCA0166DCAACd0      AVAIL

device details:

        c0t5000CCA016671600d0      DEGRADED       too many errors
        status: FMA has degraded this device.
        action: Run 'fmadm faulty' for more information. Clear the errors
                using 'fmadm repaired'.
           see: http://support.oracle.com/msg/DISK-8000-D5 for recovery


errors: No known data errors

See that mirror-8 vdev is a 3-way mirror now - as the affected disk is still functional it wasn't detached, but just in case it goes really bad we have a hot spare now forming a 3-way mirror. Below is what FMA reported:

# fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Nov 02 05:29:39 d10c88f7-8e31-ce12-ab5c-8a759cf875c3  DISK-8000-D5   Major

Problem Status    : solved
Diag Engine       : eft / 1.16
System
    Manufacturer  : Oracle-Corporation
    Name          : SUN-FIRE-X4270-M3
    Part_Number   : 31792382+1+1
    Serial_Number : XXXXXXXX
    Host_ID       : 004858f6

----------------------------------------
Suspect 1 of 1 :
   Fault class : fault.io.scsi.disk.csum-zfs.transient
   Certainty   : 100%
   Affects     : dev:///:devid=id1,sd@n5000cca016671600//scsi_vhci/disk@g5000cca016671600
   Status      : faulted but still providing degraded service

   FRU
     Location         : "HDD17"
     Manufacturer     : HITACHI
     Name             : H109090SESUN900G
     Part_Number      : HITACHI-H109090SESUN900G
     Revision         : A31A
     Serial_Number    : XXXXXXXX
     Chassis
        Manufacturer  : Oracle-Corporation
        Name          : SUN-FIRE-X4270-M3
        Part_Number   : 31792382+1+1
        Serial_Number : XXXXXXXX
        Status        : faulty

Description : There have been excessive transient ZFS checksum errors on this
              disk.

Response    : A hot-spare disk may have been activated.

Impact      : If a hot spare is available it will be brought online and during
              this time I/O could be impacted. If a hot spare isn't available
              then I/O could be lost and data corruption is possible.

Action      : Use 'fmadm faulty' to provide a more detailed view of this event.
              Please refer to the associated reference document at
              http://support.oracle.com/msg/DISK-8000-D5 for the latest service
              procedures and policies regarding this diagnosis.

Notice that FMA is reporting that the affected disk location is HDD17 - this corresponds to HDD17 slot on the x3-2l server. That way we know exactly which disk to replace. We can also get physical disk locations from zpool status command:

# zpool status -l pool-0
  pool: pool-0
state: DEGRADED
status: One or more devices has been diagnosed as degraded. An attempt
        was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or 'fmadm repaired', or replace the device
        with 'zpool replace'.
        Run 'zpool status -v' to see device specific details.
  scan: resilvered 458G in 2h23m with 0 errors on Sat Nov  2 07:53:42 2013
config:

        NAME                               STATE     READ WRITE CKSUM
        pool-0                             DEGRADED     0     0     0
          mirror-0                         ONLINE       0     0     0
            /dev/chassis/SYS/HDD00/disk    ONLINE       0     0     0
            /dev/chassis/SYS/HDD01/disk    ONLINE       0     0     0
          mirror-1                         ONLINE       0     0     0
            /dev/chassis/SYS/HDD02/disk    ONLINE       0     0     0
            /dev/chassis/SYS/HDD03/disk    ONLINE       0     0     0
          mirror-2                         ONLINE       0     0     0
            /dev/chassis/SYS/HDD04/disk    ONLINE       0     0     0
            /dev/chassis/SYS/HDD05/disk    ONLINE       0     0     0
          mirror-3                         ONLINE       0     0     0
            /dev/chassis/SYS/HDD06/disk    ONLINE       0     0     0
            /dev/chassis/SYS/HDD07/disk    ONLINE       0     0     0
          mirror-4                         ONLINE       0     0     0
            /dev/chassis/SYS/HDD08/disk    ONLINE       0     0     0
            /dev/chassis/SYS/HDD09/disk    ONLINE       0     0     0
          mirror-5                         ONLINE       0     0     0
            /dev/chassis/SYS/HDD10/disk    ONLINE       0     0     0
            /dev/chassis/SYS/HDD11/disk    ONLINE       0     0     0
          mirror-6                         ONLINE       0     0     0
            /dev/chassis/SYS/HDD12/disk    ONLINE       0     0     0
            /dev/chassis/SYS/HDD13/disk    ONLINE       0     0     0
          mirror-7                         ONLINE       0     0     0
            /dev/chassis/SYS/HDD14/disk    ONLINE       0     0     0
            /dev/chassis/SYS/HDD15/disk    ONLINE       0     0     0
          mirror-8                         DEGRADED     0     0     0
            /dev/chassis/SYS/HDD16/disk    ONLINE       0     0     0
            spare-1                        DEGRADED     0     0     0       
              /dev/chassis/SYS/HDD17/disk  DEGRADED     2     0    49
              /dev/chassis/SYS/HDD22/disk  ONLINE       0     0     0
          mirror-9                         ONLINE       0     0     0
            /dev/chassis/SYS/HDD18/disk    ONLINE       0     0     0
            /dev/chassis/SYS/HDD19/disk    ONLINE       0     0     0
          mirror-10                        ONLINE       0     0     0
            /dev/chassis/SYS/HDD20/disk    ONLINE       0     0     0
            /dev/chassis/SYS/HDD21/disk    ONLINE       0     0     0
        spares
          /dev/chassis/SYS/HDD22/disk      INUSE
          /dev/chassis/SYS/HDD23/disk      AVAIL

errors: No known data errors

Now it is up to us if we want to run a scrub and wait few days and if there are no new errors clear the pool status and deactivate hot spare, or if we don't want to take any chances and replace the affected disk drive. We decided to replace it. The disk in bay 17 of x3-2l was physically pulled out, a replacement was put in in its place and since we have autoreplace property set to true on the pool, FMA/ZFS automatically put an EFI label on the new disk and attached it to the pool, once it fully synchronized a hot spare was mas detached and made available again. We didn't have to login to the OS and co-ordinate in any way with the physical disk replacement. Here is how zpool history looked like: 

# zpool history -i pool-0
…
2013-11-06.20:38:03 [internal pool scrub txg:1709498] func=2 mintxg=3 maxtxg=1709499 logs=0
2013-11-06.20:38:17 [internal vdev attach txg:1709501] replace vdev=/dev/dsk/c0t5000CCA016217834d0s0 \
                                                           for vdev=/dev/dsk/c0t5000CCA016671600d0s0
2013-11-06.23:01:33 [internal pool scrub done txg:1710852] complete=1 logs=0
2013-11-06.23:01:34 [internal vdev detach txg:1710854] vdev=/dev/dsk/c0t5000CCA016671600d0s0
2013-11-06.23:01:39 [internal vdev detach txg:1710855] vdev=/dev/dsk/c0t5000CCA0166876F8d0s0

Let's see the pool status after the replacement disk fully synchronized:

# zpool status -l pool-0
  pool: pool-0
state: ONLINE
  scan: resilvered 459G in 2h23m with 0 errors on Wed Nov  6 23:01:34 2013
config:

        NAME                             STATE     READ WRITE CKSUM
        pool-0                           ONLINE       0     0     0
          mirror-0                       ONLINE       0     0     0
            /dev/chassis/SYS/HDD00/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD01/disk  ONLINE       0     0     0
          mirror-1                       ONLINE       0     0     0
            /dev/chassis/SYS/HDD02/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD03/disk  ONLINE       0     0     0
          mirror-2                       ONLINE       0     0     0
            /dev/chassis/SYS/HDD04/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD05/disk  ONLINE       0     0     0
          mirror-3                       ONLINE       0     0     0
            /dev/chassis/SYS/HDD06/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD07/disk  ONLINE       0     0     0
          mirror-4                       ONLINE       0     0     0
            /dev/chassis/SYS/HDD08/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD09/disk  ONLINE       0     0     0
          mirror-5                       ONLINE       0     0     0
            /dev/chassis/SYS/HDD10/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD11/disk  ONLINE       0     0     0
          mirror-6                       ONLINE       0     0     0
            /dev/chassis/SYS/HDD12/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD13/disk  ONLINE       0     0     0
          mirror-7                       ONLINE       0     0     0
            /dev/chassis/SYS/HDD14/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD15/disk  ONLINE       0     0     0
          mirror-8                       ONLINE       0     0     0
            /dev/chassis/SYS/HDD16/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD17/disk  ONLINE       0     0     0
          mirror-9                       ONLINE       0     0     0
            /dev/chassis/SYS/HDD18/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD19/disk  ONLINE       0     0     0
          mirror-10                      ONLINE       0     0     0
            /dev/chassis/SYS/HDD20/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD21/disk  ONLINE       0     0     0
        spares
          /dev/chassis/SYS/HDD22/disk    AVAIL
          /dev/chassis/SYS/HDD23/disk    AVAIL

errors: No known data errors

All is back to normal. We can also check if all disks, including the new one, are of the same part number, firmware level, etc.

# diskinfo -t disk -o Rcmenf1 R:receptacle-name c:occupant-compdev m:occupant-mfg e:occupant-model n:occupant-part f:occupant-firm 1:occupant-capacity ----------------- --------------------- -------------- ---------------- ------------------------ --------------- ------------------- SYS/HDD00 c0t5000CCA0165FC0F8d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD01 c0t5000CCA016217040d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD02 c0t5000CCA01666AB64d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD03 c0t5000CCA0166F3BB8d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD04 c0t5000CCA0166F36C8d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD05 c0t5000CCA01661894Cd0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD06 c0t5000CCA0166BE338d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD07 c0t5000CCA016626340d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD08 c0t5000CCA0166DC81Cd0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD09 c0t5000CCA016685238d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD10 c0t5000CCA016636CA4d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD11 c0t5000CCA016687528d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD12 c0t5000CCA0166DC944d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD13 c0t5000CCA0166DC0CCd0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD14 c0t5000CCA0166F4178d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD15 c0t5000CCA01668DCC0d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD16 c0t5000CCA0166DD7BCd0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD17 c0t5000CCA016217834d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD18 c0t5000CCA0166DC20Cd0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD19 c0t5000CCA0166877BCd0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD20 c0t5000CCA0166F3334d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD21 c0t5000CCA0166BDD2Cd0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD22 c0t5000CCA0166876F8d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD23 c0t5000CCA0166DCAACd0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216

This is a a very cool integration of different features in Solaris 11 which makes ZFS based solution much more reliable and easy to support.

The topology framework should work out of the box on Oracle servers and Solaris 11. The above example is from Solaris 11.1 + SRU10 running on X3-2L server with 24 disks in front (and another 2x in rear for OS, mirrored by the controller itself). It has a simple HBA which presents all of the front disks as JBODs which is perfect for ZFS.

The topology framework also works on 3rd party hardware, but depending on a particular set-up some additional configuration steps might be required, like defining Bay Labels. If disks are behind SAS expander then it is more complicated to get it working and I'm not sure if there is a documented procedure describing how to do it.

Friday, November 01, 2013

SYNCHRONIZE_CACHE on close()

Recently while testing iSCSI I noticed that when you close a raw device, which is an iSCSI target, Solaris sends SCSI SYNCHRONIZE command on close().
$ dd if=/dev/zero of=/dev/rdsk/c0t6537643965643539d0s0 bs=1b count=1
1+0 records in
1+0 records out


$ dtrace -n fbt::*SYNCHRONIZE_CACHE:entry'{printf("%s %d\n", execname, pid);stack();}'
dtrace: description 'fbt::*SYNCHRONIZE_CACHE:entry' matched 1 probe

dd 2562

              sd`sdclose+0x1c0
              genunix`dev_close+0x55
              specfs`device_close+0xb3
              specfs`spec_close+0x171
              genunix`fop_close+0x9f
              genunix`closef+0x68
              genunix`closeandsetf+0x5be
              genunix`close+0x18
              unix`_sys_sysenter_post_swapgs+0x149

Thursday, October 24, 2013

ZFS Internals

Early December there will be ZFS Internals course in London - there are still places available. If you are interested in learning DTrace then there will be DTrace course a week earlier as well.

Friday, October 04, 2013

Solaris 11: IDRs

One of the lacking features in Solaris 11 was  the way it dealt with IDRs when performing OS update. Essentially one had to manually back out an IDR to perform an update as a separate step before an update or by adding a --reject IDRxxx (but this wouldn't necessarily be 100% correct). However, recently Oracle started publishing  IDRs in their support repo once they are integrated as an obsoleting package - long story short, it means that now one can just update OS without worrying how to back out an IDR if it has been fixed in a later release. For more details see here.

Sunday, September 15, 2013

The Importance of Hiring

Adam wrote about his experience at Delphix - the main point I agree with is the importance of hiring the right people. It takes time but it pays off in the long term.

Friday, August 30, 2013

Coming Back of Big Iron?

In the past decade servers have become boring - almost everything runs on a cheap x86 servers which mainly differ by color. Now, 96 CPU sockets, 152 cores and 9,216 threads and up-to 96TB of RAM in a server? I wouldn't mind to play with such a monster... read more