Friday, February 27, 2015

Specifying Physical Disk Locations for AI Installations

One of the very useful features in Solaris is the ability to identify physical disk location on supported hardware (mainly Oracle x86 and SPARC servers). This not only makes it easier to identify a faulty disk to be replaced but also makes OS installation more robust, as you can actually specify physical disk locations in a given server model where OS should be installed. Here is an example output from diskinfo tool on x5-2l server: 

$ diskinfo
D:devchassis-path            c:occupant-compdev
---------------------------  ---------------------
/dev/chassis/SYS/HDD00/disk  c0t5000CCA01D3A1A24d0
/dev/chassis/SYS/HDD01/disk  c0t5000CCA01D2EB40Cd0
/dev/chassis/SYS/HDD02/disk  c0t5000CCA01D30FD90d0
/dev/chassis/SYS/HDD03/disk  c0t5000CCA032018CB4d0
...
/dev/chassis/SYS/RHDD0/disk  c0t5000CCA01D34EB38d0
/dev/chassis/SYS/RHDD1/disk  c0t5000CCA01D315288d0

The server supports 24x disks in front and another two disks in the back. We use the front disks for data and the two disks in the back for OS. In the past we uses RAID controller to mirror the two OS disks, while all disks in the front were presented in pass-thru mode (JBOD) and managed by ZFS.

Recently I started looking into using ZFS for mirroring the OS disks as well. Notice in the above output that the two disks in the back of x5-2l server are identified as: SYS/RHDD0 SYS/RHDD1.

This is very useful as with SAS the CTD would be different for each disk and woudl also change if a disk was replaced, while the SYS/[R}HDDn location would always stay the same.

See also my older blog entry on how this information is presented in other subsystems (FMA or ZFS).

Below is a part of AI manifest which defines that OS should be installed on the two rear disks and mirrored by ZFS:
    <target>
      <disk in_vdev="mirror" in_zpool="rpool" whole_disk="true">
        <disk_name name="SYS/RHDD0" name_type="receptacle">
      </disk_name>
      <disk in_vdev="mirror" in_zpool="rpool" whole_disk="true">
        <disk_name name="SYS/RHDD1" name_type="receptacle">
      </disk_name>
      <logical>
        <zpool is_root="true" name="rpool">
          <vdev name="mirror" redundancy="mirror">

In our environment the AI manifest is generated per server from a configuration management system based on a host profile. This means that for x5-2l servers we generate AI manifest as shown above, but on some other servers we want OS to be installed on a RAID volume, and on a general server which doesn't fall into any specific category we install OS on boot_disk. So depending on the server we generate different sections in AI manifest. This is similar to derived manifests in AI but instead of being separate to a configuration management system in our case it is part of it.

Wednesday, February 04, 2015

Native IPS Manifests

We used to use pkgbuild tool to generate IPS packages. However recently I started working on internal Solaris SPARC build and we decided to use IPS fat packages for x86 and SPARC platforms, similarly to how Oracle delivers Solaris itself. We could keep using pkgbuild but as it always puts a variant of a host on which it was executed from, it means that we would have to run it once on a x86 server, once on a SPARC server, each time publishing to a separate repository and then use pkgmerge to create a fat package and publish it into a 3rd repo.

Since we have all our binaries already compiled for all platforms, when we build a package (RPM, IPS, etc.) all we have to do is to pick up proper files, add metadata and publish a package. No point in having three repositories and at least two hosts involved in publishing a package.

In our case native IPS manifest is a better (simpler) way to do it - we can publish a fat package from a single server to its final repository in a single step.

What is also useful is that pkgmogrify transformations can be listed in the same manifest file. Entire file is loaded first and then any transformations would be run in the specified order and new manifest will be printed to stdout. This means that in most cases we can have a single file for each package we want to generate, similarily to pkgbuild. There are cases where there are lots of files and we do use pkgsend generate to generate all files and directories, and then we have a separate file with metadata and transformations. In this case pkgbuild is a little bit easier to understand compared to what native IPS tooling offers, but it actually is not that bad.

Let's see an example IPS manifest, with some basic transformations and with both x86 and SPARC binaries.

set name=pkg.fmri value=pkg://ms/ms/pam/access@$(PKG_VERSION).$(PKG_RELEASE),5.11-0
set name=pkg.summary value="PAM pam_access library"
set name=pkg.description value="PAM pam_access module. Compiled from Linux-PAM-1.1.6."
set name=info.classification value="com.ms.category.2015:MS/Applications"
set name=info.maintainer value="Robert Milkowski "

set name=variant.arch value=i386 value=sparc

depend type=require fmri=ms/pam/libpam@$(PKG_VERSION).$(PKG_RELEASE)

dir group=sys mode=0755 owner=root path=usr
dir group=bin mode=0755 owner=root path=usr/lib
dir group=bin mode=0755 owner=root path=usr/lib/security
dir group=bin mode=0755 owner=root path=usr/lib/security/amd64      variant.arch=i386
dir group=bin mode=0755 owner=root path=usr/lib/security/sparcv9    variant.arch=sparc

&lttransform file -> default mode 0555>
&lttransform file -> default group bin>
&lttransform file -> default owner root>

# i386
file SOURCES/Linux-PAM/libs/intel/32/pam_access.so    path=usr/lib/security/pam_access.so          variant.arch=i386
file SOURCES/Linux-PAM/libs/intel/64/pam_access.so    path=usr/lib/security/amd64/pam_access.so    variant.arch=i386

# sparc
file SOURCES/Linux-PAM/libs/sparc/32/pam_access.so    path=usr/lib/security/pam_access.so          variant.arch=sparc
file SOURCES/Linux-PAM/libs/sparc/64/pam_access.so    path=usr/lib/security/sparcv9/pam_access.so  variant.arch=sparc

We can then publish the manifest by running:
$ pkgmogrify -D PKG_VERSION=1.1.6 -D PKG_RELEASE=1 SPECS/ms-pam-access.manifest | \
    pkgsend publish -s /path/to/IPS/repo
This would really go into a Makefile so in order to publish a package one does something like:
$ PUBLISH_REPO=file:///xxxxx/ gmake publish-ms-pam-access
In case where there are too many files to list them manually in the manifest, you can use pkgsend generate to generate a full list of files and directories. You need to create a manifest with only package meta data and all transformations (which would put files in their proper locations, set desired owner, group, etc.). In order to publish a package one puts into a Makefile somethine like:
$ pkgsend generate SOURCES/LWP/5.805 >BUILD/ms-perl-LWP.files
$ pkgmogrify -D PKG_VERSION=5 -D PKG_RELEASE=805 SPECS/ms-perl-LWP.p5m BUILD/ms-perl-LWP.files | \
    pkgsend publish -s /path/to/IPS/repo

Friday, January 16, 2015

ZFS: Persistent L2ARC

Recently Oracle integrated persistent L2ARC in ZFS and this is currently available in the ZFS-SA. It is not yet in Solaris 11 but it should be coming soon. Finally!

To make the very good news even better - it stores blocks in their raw format, so if for example you have compression enabled in your pool then L2ARC will store blocks compressed as well (similarly for encryption). If your data compresses well you L2ARC suddenly became much bigger as well.

Friday, January 09, 2015

Docker on SmartOS


Bryan blogged about running Linux Docker containers on SmartOS. Really cool. Now I would love to see something similar in Solaris 11...

Saturday, December 13, 2014

ZFS: RAID-Z Resilvering

Solaris 11.2 introduced a new ZFS pool version: 35 Sequential resilver.

The new feature is supposed to make disk resilvering (disk replacement, hot-spare synchronization, etc.) much faster. It achieves it by reading ahead some meta data first and then by trying to read the data to be resilvered in a sequential manner. And it does work!

Here is a real world case, with real data - over 150mln different sized files, most relatively small. Many of them were deleted, new were written, etc. so I expect that the data is already fragmented in the pool. The server is Sun/Oracle x4-2l with 26x 1.2TB 2.5" 10k SAS disks. The 24 disks in front are presented in a pass-thru mode and managed by ZFS, configured as 3 RAID-Z pools, the other 2 disks in rear are configured in RADI-1 in the raid controller and used for OS. A disk in one of the pools failed, and hot-spare automatically attached:

# zpool status -x
  pool: XXXXXXXXXXXXXXXXXXXXXXX-0
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function in a degraded state.
action: Wait for the resilver to complete.
        Run 'zpool status -v' to see device specific details.
  scan: resilver in progress since Fri Dec 12 21:02:58 2014
    3.60T scanned
    45.9G resilvered at 342M/s, 9.96% done, 2h45m to go
config:

        NAME                         STATE     READ WRITE CKSUM
        XXXXXXXXXXXXXXXXXXXXXXX-0    DEGRADED     0     0     0
          raidz1-0                   DEGRADED     0     0     0
            spare-0                  DEGRADED     0     0     0
              c0t5000CCA01D5EAE50d0  UNAVAIL      0     0     0
              c0t5000CCA01D5EED34d0  DEGRADED     0     0     0  (resilvering)
            c0t5000CCA01D5BF56Cd0    ONLINE       0     0     0
            c0t5000CCA01D5E91B0d0    ONLINE       0     0     0
            c0t5000CCA01D5F9B00d0    ONLINE       0     0     0
            c0t5000CCA01D5E87E4d0    ONLINE       0     0     0
            c0t5000CCA01D5E95B0d0    ONLINE       0     0     0
            c0t5000CCA01D5F8244d0    ONLINE       0     0     0
            c0t5000CCA01D58B3A4d0    ONLINE       0     0     0
        spares
          c0t5000CCA01D5EED34d0      INUSE
          c0t5000CCA01D5E1E3Cd0      AVAIL

errors: No known data errors

Let's see I/O statistics for the involved disks:

# iostat -xnC 1 | egrep "device| c0$|c0t5000CCA01D5EAE50d0|c0t5000CCA01D5EED34d0..."
...
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 16651.6  503.9 478461.6 69423.4  0.2 26.3    0.0    1.5   1  19 c0
 2608.5    0.0 70280.3    0.0  0.0  1.6    0.0    0.6   3  36 c0t5000CCA01D5E95B0d0
 2582.5    0.0 66708.5    0.0  0.0  1.9    0.0    0.7   3  39 c0t5000CCA01D5F9B00d0
 2272.6    0.0 68571.0    0.0  0.0  2.9    0.0    1.3   2  50 c0t5000CCA01D5E91B0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5EAE50d0
    0.0  503.9    0.0 69423.8  0.0  9.7    0.0   19.3   2 100 c0t5000CCA01D5EED34d0
 2503.5    0.0 66508.4    0.0  0.0  2.0    0.0    0.8   3  41 c0t5000CCA01D58B3A4d0
 2324.5    0.0 67093.8    0.0  0.0  2.1    0.0    0.9   3  44 c0t5000CCA01D5F8244d0
 2285.5    0.0 69192.3    0.0  0.0  2.3    0.0    1.0   2  45 c0t5000CCA01D5E87E4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5E1E3Cd0
 1997.6    0.0 70006.0    0.0  0.0  3.3    0.0    1.6   2  54 c0t5000CCA01D5BF56Cd0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 25150.8  624.9 499295.4 73559.8  0.2 33.7    0.0    1.3   1  22 c0
 3436.4    0.0 68455.3    0.0  0.0  3.3    0.0    0.9   2  51 c0t5000CCA01D5E95B0d0
 3477.4    0.0 71893.7    0.0  0.0  3.0    0.0    0.9   3  48 c0t5000CCA01D5F9B00d0
 3784.4    0.0 72370.6    0.0  0.0  3.6    0.0    0.9   3  56 c0t5000CCA01D5E91B0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5EAE50d0
    0.0  624.9    0.0 73559.8  0.0  9.4    0.0   15.1   2 100 c0t5000CCA01D5EED34d0
 3170.5    0.0 72167.9    0.0  0.0  3.5    0.0    1.1   2  55 c0t5000CCA01D58B3A4d0
 3881.4    0.0 72870.8    0.0  0.0  3.3    0.0    0.8   3  55 c0t5000CCA01D5F8244d0
 4252.3    0.0 70709.1    0.0  0.0  3.2    0.0    0.8   3  53 c0t5000CCA01D5E87E4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5E1E3Cd0
 3063.5    0.0 70380.1    0.0  0.0  4.0    0.0    1.3   2  60 c0t5000CCA01D5BF56Cd0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 17190.2  523.6 502346.2 56121.6  0.2 31.0    0.0    1.8   1  18 c0
 2342.7    0.0 71913.8    0.0  0.0  2.9    0.0    1.2   3  43 c0t5000CCA01D5E95B0d0
 2306.7    0.0 72312.9    0.0  0.0  3.0    0.0    1.3   3  43 c0t5000CCA01D5F9B00d0
 2642.1    0.0 68822.9    0.0  0.0  2.9    0.0    1.1   3  45 c0t5000CCA01D5E91B0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5EAE50d0
    0.0  523.6    0.0 56121.2  0.0  9.3    0.0   17.8   1 100 c0t5000CCA01D5EED34d0
 2257.7    0.0 71946.9    0.0  0.0  3.2    0.0    1.4   2  44 c0t5000CCA01D58B3A4d0
 2668.2    0.0 72685.4    0.0  0.0  2.9    0.0    1.1   3  43 c0t5000CCA01D5F8244d0
 2236.6    0.0 71829.5    0.0  0.0  3.3    0.0    1.5   3  47 c0t5000CCA01D5E87E4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5E1E3Cd0
 2695.2    0.0 72395.4    0.0  0.0  3.2    0.0    1.2   3  45 c0t5000CCA01D5BF56Cd0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 31265.3  578.9 342935.3 53825.1  0.2 18.3    0.0    0.6   1  15 c0
 3748.0    0.0 48255.8    0.0  0.0  1.5    0.0    0.4   2  42 c0t5000CCA01D5E95B0d0
 4367.0    0.0 47278.2    0.0  0.0  1.1    0.0    0.3   2  35 c0t5000CCA01D5F9B00d0
 4706.1    0.0 50982.6    0.0  0.0  1.3    0.0    0.3   3  37 c0t5000CCA01D5E91B0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5EAE50d0
    0.0  578.9    0.0 53824.8  0.0  9.7    0.0   16.8   1 100 c0t5000CCA01D5EED34d0
 4094.1    0.0 48077.3    0.0  0.0  1.2    0.0    0.3   2  35 c0t5000CCA01D58B3A4d0
 5030.1    0.0 47700.1    0.0  0.0  0.9    0.0    0.2   3  33 c0t5000CCA01D5F8244d0
 4939.9    0.0 52671.2    0.0  0.0  1.1    0.0    0.2   3  33 c0t5000CCA01D5E87E4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5E1E3Cd0
 4380.1    0.0 47969.9    0.0  0.0  1.4    0.0    0.3   3  36 c0t5000CCA01D5BF56Cd0
^C

These are pretty amazing numbers for RAID-Z - and the only reason why a single disk drive can do so many thousands reads per second is that most of them have to be very almost ideally sequential. From time to time I see even more amazing numbers: 

                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 73503.1 3874.0 53807.0 19166.6  0.3  9.8    0.0    0.1   1  16 c0
 9534.8    0.0 6859.5    0.0  0.0  0.4    0.0    0.0   4  30 c0t5000CCA01D5E95B0d0
 9475.7    0.0 6969.1    0.0  0.0  0.4    0.0    0.0   4  30 c0t5000CCA01D5F9B00d0
 9646.9    0.0 7176.4    0.0  0.0  0.4    0.0    0.0   3  31 c0t5000CCA01D5E91B0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5EAE50d0
    0.0 3478.6    0.0 18040.0  0.0  5.1    0.0    1.5   2  98 c0t5000CCA01D5EED34d0
 8213.4    0.0 6908.0    0.0  0.0  0.8    0.0    0.1   3  38 c0t5000CCA01D58B3A4d0
 9671.9    0.0 6860.5    0.0  0.0  0.4    0.0    0.0   3  30 c0t5000CCA01D5F8244d0
 8572.7    0.0 6830.0    0.0  0.0  0.7    0.0    0.1   3  35 c0t5000CCA01D5E87E4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5E1E3Cd0
 18387.8    0.0 12203.5    0.0  0.1  0.7    0.0    0.0   7  57 c0t5000CCA01D5BF56Cd0

It is really good to see the new feature work so well in practice. This feature is what makes RAID-Z much more usable in many production environments. The other feature which complements this one and also makes RAID-Z much more practical to use is: RAID-Z/mirror hybrid allocator introduced in Solaris 11 (pool version 29). It makes accessing meta-data in RAID-Z much faster.

Both features are only available in Oracle Solaris 11 and not in OpenZFS deriviates.
Although OpenZFS has its own interesting new features as well.

Friday, December 05, 2014

ZFS Performance Improvements

One of the blogs I value is Roch Bourbonnais's. He hasn't blogged much in some time but he is back to blogging! He listed the main ZFS performance improvements since Oracle took over, and then also provided more details on ReARC. There is more to come from him, hopefully soon.

ps. to make it clear he is describing improvements to Oracle's ZFS as in Solaris 11 and ZFS-SA and not OpenZFS.

Friday, November 14, 2014

Kernel Zones Internals

There is a very good presentation describing how Kernel Zones are implemented from an architecture point of view.

Kernel Zones: root disk recovery

I've been playing with Solaris Kernel Zones recently. I installed a zone named kz1 on an iscsi device so I could test Live Migration as well. At some point I wanted to modify contents of the kz1 root disk outside of the zone, so I imported the pool in a global zone, of course while kz1 was shut down. This worked fine (zpool import -t ...). However the zone crashed on boot as it couldn't import the pool. The reason is that when a pool is imported on another system it updates phys_path in its ZFS label and when kernel zone boots it tries to import its root pool based on phys_path which now might be invalid, which was exactly the case for me. The root disk of kz1 zone was configured as: 

add device
set storage=iscsi://iscsi1.test/lunme.naa.600144f0a613c900000054521d550001
set bootpri=0
set id=0
end

This results in phys_path: /zvnex/zvblk@0:b as a disk driver is virtualized in Kernel Zones. After the pool was imported in a global zone the phys_path was updated to: /scsi_vhci/disk@g600144f0a613c900000054521d550001:b which won't work in the kernel zone. One way to workaround the issue is to create another "rescue" zone with its root disk's id set to 1. Then add the root disk of kz1 to it as an additional non-bootable disk with id=0.
 
add device
set storage=dev:zvol/dsk/rpool/kz2
set bootpri=0
set id=1
end

add device
set storage=iscsi://iscsi1.test/lunme.naa.600144f0a613c900000054521d550001
set id=0
end

Now in order to update the phys_path to the correct one, the pool needs to be imported and exported in the rescue zone:
 
# zpool import -R /mnt/1 -Nf -t kz1 rpool
# zpool export kz1

Notice that the rescue zone doesn't even need to have a network configured - in fact all you need is a minimal OS installation in it with almost everything disabled and you login to it via zlogin.

The kz1 zone will now boot just fine. In most cases you don't need to use the above procedure.You should be able to do all the customizations via AI manifests and system profiles and SMF. But in some cases it is useful to be able to manipulate root disk contents of a kernel zone without actually booting it. Or perhaps you need to recover rpool after it became unbootable.

Monday, November 03, 2014

Rollback ZFS volume exported over iSCSI

While playing with Kernel Zones on Solaris 11.2 I noticed that once a ZFS volume is shared over iSCSI, while I can create a snapshot of it, I can't roll it back - I get "volume is busy" error message. I found a way to do it:

# stmfadm delete-lu 600144F0A613C900000054521D550001
# zfs rollback pool/idisk0@snap1
# stmfadm import-lu /dev/zvol/rdsk/pool/idisk0
Logical unit imported: 600144F0A613C900000054521D550001
# stmfadm add-view 600144F0A613C900000054521D550001

Although this should be easier...

The iSCSI lun contains a kernel zone image just after it was installed. However now I couldn't boot into it:

# zoneadm -z kz1 attach
zone 'kz1': error: Encryption key not available. See solaris-kz(5) for configuration migration
zone 'kz1': procedure or restore /etc/zones/keys/kz1.

Right, the man page explains it all - there is a host meta data needed to boot a kernel zone, which is encrypted. Since I rolled back the ZFS volume to a previous installation the encryption key stored in the zone's configuration is no longer valid. I had to re-create it:

# zoneadm -z kz1 attach -x initialize-hostdata
# zoneadm -z kz1 boot

And now it booted just fine.