Tuesday, May 21, 2013

Setting RPATH

Today I was made aware that elfedit tool in Solaris 11 allows for setting RPATH (among other things). The only caveat is that a binary had to be linked on Solaris 11. It is very easy to use:

 # elfedit -e 'dyn:runpath $ORIGIN/../lib' /opt/bin/myprog 

There is a nice blog entry about it from Ali Bahrami.

Friday, March 22, 2013

OpenAFS on Solaris 11 x86

Two days ago I presented at the UK Solaris SIG meeting on running OpenAFS on Solaris 11 x86. This is essentially the same talk I gave last year in Edinburgh, I just added few slides explaining what OpenAFS is.

Tuesday, March 05, 2013

ZFS: no-op overwrites

There is an interesting new feature in ZFS in Illumos.

https://www.illumos.org/issues/3236
When overwriting a block which is check summed with a cryptographically secure hash function we can compare the old and new checksums for the block to determine if they differ (at almost no cost since we were going to do the checksums anyway). If they do not differ we don't actually need to do the write. This:
1) Reduces I/O
2) Reduces space usage, because if the old block is referenced by a snapshot we will need to keep both copies of the block around even though they contain the same data.
This functionality is only enabled if:
1) The old and new blocks are checksummed using the same algorithm.
2) That algorithm is cryptographically secure (e.g. sha256)
3) Compression is enabled on that block.
 Philosophical question - should we just trust sha256?
(it seems this can't be disabled nor there is an option similar to verify=on in dedup).

There are more interesting new zfs features in Illumos (for example this one which does a similar thing to what Solaris 11 does). The only regret is that unless one wants to play with one of the appliances based on Illumos the only way to use these features is to use FreeBSD or Linux, which is rather ironic. But on the other hand - why not? At least at home.
 

Friday, November 09, 2012

vmtasks explained

Solaris 11 introduced a new kernel process called "vmtasks" which accelerates some operations when working with shared memory. For more details see here.

20 Years of Solaris

Nice video from Oracle celebrating 20 years of Solaris.

Tuesday, October 23, 2012

Running OpenAFS on Solaris 11 x86 / ZFS

Recently I gave a talk on running OpenAFS services on top of Solaris 11 x86 / ZFS. The talk was split in two parts - first part about $$ benefits of transparent ZFS compression, when running on 3rd party x86 hardware (but it also makes sense when running on Sun/Oracle kit - in some cases even more so). This part also discusses some ideas about running AFS on internal disks instead of directly attached disk arrays, which again, thanks to ZFS built-in compression makes it worthwhile and  delivers even more $$ savings.

The main message of this part is, that if your data compresses well (above 2x), running OpenAFS on ZFS can deliver similar or even better performance but most importantly it can save you lots of $$, both in acquisition costs, and in cost of running AFS plant. In most cases you should even be able to re-use the current x86 hardware you have. The beauty of AFS is, that we were able to migrate data from Linux to Solaris/ZFS, in-place, by re-using the same x86 HW, and all of this was completely transparent to all clients (keep in mind we are talking about PBs of data) - this is truly the cloud file system. I think OpenAFS is one of the under-appreciated technologies in the market.

The second part is about using DTrace, both in dev and in production systems, to find scalability and performance bottlenecks, and other bugs as well. Two easy and real-life examples are discussed, which resulted in considerable improvement in scalability and performance of some operations in OpenAFS, along with some other examples of D scripts which provide top-like output with some statistics (slide #32 is an example from a Solaris NFS server, serving VMWare clients and displaying different stats per VM from a single file system...). DTrace has proven to be a very powerful and helpful tool for us, although it is hard to put a specific $ value it brings.

The slides should be available here.

Wednesday, August 29, 2012

Open Indiana is dead

With the main guy behind the project resigning, OI is essentially dead. It's been dead for some time though and with no commercial backing it never really had much chance. This is sad news indeed (although I haven't really used OI). It marks the end of Open Solaris era.

Can Illumos survive in the long term? Can it become relevant outside of couple of niche use cases?

Ironically, it is Oracle's Solaris which will probably outlive all of them.

Friday, July 27, 2012

Locking a running process in RAM

Recently I was looking at a possibility of locking some or all memory mappings, for an already running process, in RAM. Why would you want to do it? There might be many reasons. One of them is to prevent a critical process to be swapped out.

Now, how can we do it without restarting a process or changing its code? There is memcntl() and similar calls like plock(), mlock(), etc. - but they only work for a process which is calling them. However libproc on Solaris allows you to run memcntl() in a context of other process, among many other cool things. By the way - libproc is used by tools like truss, ppgsz, etc.

First, lets see that it actually works. Here is a bash process running as a non-root user, see that no mappings are locked in RAM.
cwafseng3 $ pmap -ax $$
27709:  bash
 Address  Kbytes     RSS    Anon  Locked Mode   Mapped File
08050000     968     968       -       - r-x--  bash
08151000      76      76       8       - rwx--  bash
08164000     140     140      56       - rwx--    [ heap ]
F05D0000     440     440       -       - r-x--  libnsl.so.1
F064E000       8       8       4       - rw---  libnsl.so.1
F0650000      20      12       -       - rw---  libnsl.so.1
F0660000      56      56       -       - r-x--  libsocket.so.1
F067E000       4       4       -       - rw---  libsocket.so.1
FE560000      64      16       -       - rwx--    [ anon ]
FE577000       4       4       4       - rwxs-    [ anon ]
FE580000      24      12       4       - rwx--    [ anon ]
FE590000       4       4       4       - rw---    [ anon ]
FE5A0000    1352    1352       -       - r-x--  libc_hwcap1.so.1
FE702000      44      44      16       - rwx--  libc_hwcap1.so.1
FE70D000       4       4       -       - rwx--  libc_hwcap1.so.1
FE710000       4       4       -       - r-x--  libdl.so.1
FE720000       4       4       4       - rw---    [ anon ]
FE730000     184     184       -       - r-x--  libcurses.so.1
FE76E000      16      16       -       - rw---  libcurses.so.1
FE772000       8       8       -       - rw---  libcurses.so.1
FE780000       4       4       4       - rw---    [ anon ]
FE790000       4       4       4       - rw---    [ anon ]
FE7A0000       4       4       -       - rw---    [ anon ]
FE7AD000       4       4       -       - r--s-    [ anon ]
FE7B4000     220     220       -       - r-x--  ld.so.1
FE7FB000       8       8       4       - rwx--  ld.so.1
FE7FD000       4       4       -       - rwx--  ld.so.1
FEFFB000      16      16       4       - rw---    [ stack ]
-------- ------- ------- ------- -------
total Kb    3688    3620     116       -
Now, I will use the small tool I wrote, to lock all mappings with RX or RWX protections on them. The tool requires for a PID to be specified.
$ ./pr_memcntl 27709
pr_memcntl() failed: Not owner
Although I run it as root, it failed. Remember when I wrote a moment ago, that libproc would call memcntl() from a contetx of the target process? The target process here is bash with pid 27709 and it is running as a standard user, and by default a standard user cannot lock pages in memory. We can add the required privileges for locking pages in RAM. Lets see what's missing. I enabled privilege debugging for the bash process and run the pr_memcntl tool again
$ ppriv -D 27709
bash[27709]: missing privilege "proc_lock_memory"
            (euid = 145104, syscall = 131) needed at memcntl+0x140
Now, lets add the missing privilege to the bash process and then try again locking the mappings:
$ ppriv -s EP+proc_lock_memory 27709
$ ./pr_memcntl 27709
$
This time no error, lets see the pmap output again:
$ pmap -ax $$
27709:  bash
 Address  Kbytes     RSS    Anon  Locked Mode   Mapped File
08050000     968     968       -     968 r-x--  bash
08151000      76      76       8      76 rwx--  bash
08164000     140     140      48     140 rwx--    [ heap ]
F05D0000     440     440       -     440 r-x--  libnsl.so.1
F064E000       8       8       4       - rw---  libnsl.so.1
F0650000      20      12       -       - rw---  libnsl.so.1
F0660000      56      56       -      56 r-x--  libsocket.so.1
F067E000       4       4       -       - rw---  libsocket.so.1
FE560000      64      64       -      64 rwx--    [ anon ]
FE577000       4       4       4       - rwxs-    [ anon ]
FE580000      24      24       4      24 rwx--    [ anon ]
FE590000       4       4       4       - rw---    [ anon ]
FE5A0000    1352    1352       -    1352 r-x--  libc_hwcap1.so.1
FE702000      44      44      16      44 rwx--  libc_hwcap1.so.1
FE70D000       4       4       -       4 rwx--  libc_hwcap1.so.1
FE710000       4       4       -       4 r-x--  libdl.so.1
FE720000       4       4       4       - rw---    [ anon ]
FE730000     184     184       -     184 r-x--  libcurses.so.1
FE76E000      16      16       -       - rw---  libcurses.so.1
FE772000       8       8       -       - rw---  libcurses.so.1
FE780000       4       4       4       - rw---    [ anon ]
FE790000       4       4       4       - rw---    [ anon ]
FE7A0000       4       4       -       - rw---    [ anon ]
FE7AD000       4       4       -       - r--s-    [ anon ]
FE7B4000     220     220       -     220 r-x--  ld.so.1
FE7FB000       8       8       4       8 rwx--  ld.so.1
FE7FD000       4       4       -       4 rwx--  ld.so.1
FEFFB000      16      16       4       - rw---    [ stack ]
-------- ------- ------- ------- -------
total Kb    3688    3680     108    3588
It works! :)

Notice that only mappings with RX or RWX protections are locked (as hard-coded in the tool, but any mappings or entire process can be locked if desired). The tool can obviously easily be expanded so it automatically adds the privilege if needed.

The C code below is a prototype - it is not idiot proof nor does it handle all errors, and it could be more user friendly - but it works and it is trivial to extend it. It should work on Solaris 10 and Solaris 11 and also on all Illumos based distributions.

Notice, that it sets MCL_CURRENT|MCL_FUTURE flags, meaning that not only current rx|rwx mappings are locked but also all future mappings, with the rx|rwx protections set, will be locked as well. In order to compile the program you need libproc.h which is currently not distributed with Solaris (hopefully it will change soon). You can get a copy from here.

Some ideas on how the tool could be easily extended:
  • add an option to add the proc_lock_memory automatically (and perhaps update resource limit as well)
  • add an option to remove a lock on specific mapping or from all mappings
  • add an option to specify what should be lock
  • add options to specify if only current mappings and/or future mappings should be locked as well

ps. putting a process in RT class would achieve a similar result, although it wouldn't give a control over which mappings should be locked, and running in RT might not be desirable for other reasons as well
 

// gcc -m64 -lproc -I. -o pr_memcntl pr_memcntl.c
 
#include <sys/types.h>
#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
 
#include <libproc.h>


int main(int argc, char **argv) {
  pid_t pid;
  int perr;
  static struct ps_prochandle *Pr;
 
  pid = atoi(argv[1]);
 
  if((Pr = Pgrab(pid, PGRAB_NOSTOP, &perr)) == NULL) {
    printf("Pgrab() failed: %s\n", Pgrab_error(perr));
    exit(1);
  }
 
  if(pr_memcntl(Pr, 0, 0, MC_LOCKAS, (caddr_t)(MCL_CURRENT|MCL_FUTURE), PROC_TEXT, (int)0)) {
    perror("pr_memcntl() failed");
    Prelease(Pr, 0);
    exit(1);
  }
 
  if(pr_memcntl(Pr, 0, 0, MC_LOCKAS, (caddr_t)(MCL_CURRENT|MCL_FUTURE), PROC_TEXT|PROT_WRITE, (int)0)) {
    perror("pr_memcntl() failed");
    Prelease(Pr, 0);
    exit(1);
  }
 
  Prelease(Pr, 0);
  Pr = NULL;
 
  exit(0);
}