26 Jun 2012

Grid site with dynamic worker nodes (on Scientific Linux 5)

Tags: automatic deployment, cluster, grid
Posted in Linux | Leave a comment

Because we had a lot of worker node upgrades and new entries we needed a system that can bypass the installation of each worker node and also we wanted to upgrade in one go all of the existing worker nodes.

Basically we want to achieve a “plug-and-play” worker node infrastructure where each node uses its own HDD for job storage and processing.

Requirements:

TFTP and DHCP server (with static IP-MAC binding)
NFS server
An up-to-date worker node system from which we’ll build the NFS root filesystem export for the worker nodes

The overview of the whole process is the following:

Add a new physical worker node (WN)
Add its MAC address to the DHCP server configuration file, assigning a fixed IP address
Power on the WN
If the HDD is blank then preformat it and populate with the needed directories (/var and /home)
Configure the WN (NFS root filesystem, hdd mount points, hostname etc) based on its IP address
Node is ready to run as it if was installed normally

The actual configuration of the infrastructure

We won’t go into details about setting up a TFTP and DHCP server but it should be clearly stated that each WN must have its MAC address set with a fixed IP in the DHCP server’s configuration file:

host my_new_wn {
....
hardware ethernet ae:d6:90:bb:aa:1d;
fixed-address 172.24.10.100;
....
}

Also you need to modify the IPs and paths from the presented examples to match your environment.

The worker node must have its root filesystem on the NFS share. In order to achieve this, we must deal with two steps:

NFS server configuration
WN boot process modification with initrd customization

As said in the requirements, we need to have an image of the root filesystem that each WN will use as its own root filesystem. In order to accomplish that, we’ll be using a normal up-to-date worker node from which we’ll copy the whole filesystem and the export it thru NFS.

In our case we’ve used /opt/nfs_export and exported it with (/etc/exports):

/opt/nfs_export         172.24.10.0/24(sync,no_subtree_check,ro,no_root_squash)

The synchronization of the reference WN to the export directory was done over ssh with rsync:

 rsync -av --one-file-system -l -t -e ssh root@REFERENCE_WN:/ /opt/nfs_export/
cd /opt/nfs_export;
mv var var.ro;
mkdir var
cd etc/sysconfig
rm network
ln -s ../../var/network network

We’ve moved the var directory because each WN will have its own /var, /tmp and /home mounted locally. Later on, we’ll populate the /var on each WN with the contents from var.ro.
Because we need each worker node to have it’s own hostname it’s necessary that we store this information locally on each WN (we’ve chosen /var/network as a path) and symlink that file to the actual configuration file /etc/sysconfig/network.
Otherwise each WN would use the same /etc/sysconfig/network and we would end up with all the WNs having the same hostname.

The second step is to modify the boot process of the worker nodes so that they will get an IP address, mount the above exported filesystem and switchroot into it.

Extract the initrd image (from /boot) into a temporary directory:

mkdir /tmp/1; cd /tmp/1;
cp /boot/THE_INITRD_IMAGE .;
gunzip < INITRD.IMG | cpio -i --make-directories

Now, we have the contents of the initrd and we can examine and modify the boot sequence from the “init” script in the current directory. As you may notice, there aren’t many libraries and executables that you can use. Therefore we need to add some utilities and modules to the initrd image. In our case we had to add:

busybox – a lot of useful commands
network card module
nfs modules

Then, we’ve modified the standard mouting procedure from:

echo Scanning and configuring dmraid supported devices
echo Scanning logical volumes
lvm vgscan --ignorelockingfailure
echo Activating logical volumes
lvm vgchange -ay --ignorelockingfailure  VolGroup00
resume /dev/VolGroup00/LogVol01
echo Creating root device.
mkrootdev -t ext3 -o defaults,ro /dev/VolGroup00/LogVol00
echo Mounting root filesystem.
mount /sysroot

To:

/bin/insmod /lib/mii.ko
/bin/insmod /lib/pcnet32.ko
/bin/insmod /lib/sunrpc.ko
/bin/insmod /lib/lockd.ko
/bin/insmod /lib/nfs_acl.ko
/bin/insmod /lib/nfs.ko
/bin/ifconfig eth0 up
/bin/udhcpc -q -s /udhcpc.script
sleep 1
mount -o ro,proto=tcp,nolock 172.24.10.1:/work/nfs_export /sysroot

So, we have deleted the part that uses the HDD for the root filesystem and we mount the root filesystem from the NFS export. Before that, we’re loading the necessary modules and request an IP address.

Next, we need to repack the initrd image and copy it to the tftp server directory (you should tweak according to your setup):

find . | cpio --create --'format=newc' | gzip > ../initrd.img
cp initrd.img /tftpboot

At this point, every WN that it’s set to boot from the network should boot & mount the root filesystem from the NFS server and continue with normal booting.

Of course, there are some more issues that need addressing:

preformatting the hdd if necessary and copying some directories (in our specific case)
setting the hostname correctly

In order to achieve this, we need to add some extra stepts to the boot sequence of the OS. For that we call in /opt/nfs_export/etc/rc.sysinit, right after udev has started, our own helper script:

/sbin/start_udev
#----FROM HERE--------
/etc/setup.sh
if [ -f /etc/sysconfig/network ]; then
    . /etc/sysconfig/network
fi
#------TO HERE-------
# Load other user-defined modules
for file in /etc/sysconfig/modules/*.modules ; do
  [ -x $file ] && $file
done

The script file contains the following:

#!/bin/bash                                       
PARTDEF="n\np\n1\n \n+6000M\nn\np\n2\n \n+5000M\nn\np\n3\n \n+51000M\nw\n" 

function initial_setup()
{
    echo "Performing initial setup on $1";
    echo -ne $PARTDEF | fdisk $1          
    sleep 10                              
    partprobe                             
    sleep 10                              
    mke2fs -j "$1"1                       
    mke2fs -j "$1"2                       
    mke2fs -j "$1"3                       
    sleep 1                               
    mount "$DEVICE"1 /var                 
    cp -aR /var.ro/* /var/                
    echo -n "HOSTNAME=" >> /var/network   
    cat /etc/hosts | grep `ifconfig eth0 | grep 'inet addr' | awk '{ printf $2."\n"}' | tr ':' ' ' | awk '{ printf $2."\n"}'` |  awk '{ printf $2."\n"}' >> /var/network                                                            
    umount /var                                                                                                   
}                                                                                                                 

if [ ! -e /dev/sda1 ]; then
    if [ -e /dev/sda ]; then
        DEVICE="/dev/sda";  
        initial_setup $DEVICE;
    else                      
        if [ ! -e /dev/hda1 ]; then
            if [ -e /dev/hda ]; then
                DEVICE="/dev/hda"   
                initial_setup $DEVICE ;
            fi                         
        else
            DEVICE="/dev/hda"
        fi
    fi
else
    DEVICE="/dev/sda"
fi

echo "Storage on $DEVICE"

mount "$DEVICE"1 /var
mount "$DEVICE"2 /tmp
mount "$DEVICE"3 /home

if [ -e /var/wipe ]; then
    echo "Wipe HDD requested :)"
    sleep 10
    #echo b > /proc/sysrq-trigger
    #to be done
fi

cat /var/network | grep -v "HOSTNAME" > /var/network1
echo -n "HOSTNAME=" >> /var/network1
cat /etc/hosts | grep `ifconfig eth0 | grep 'inet addr' | awk '{ printf $2."\n"}' | tr ':' ' ' | awk '{ printf $2."\n"}'` |  awk '{ printf $2."\n"}' >> /var/network1
rm -f /var/network
mv /var/network1 /var/network

So, what does this script do ? It should be clear that it needs adaptation for each environment but, in our case:

It will check it we have /dev/sda or /dev/hda as a hdd
It will check if there are any partitions on it
If not, it will create /dev/(h)sda1,2,3 with 6GB / 5 GB / 51GB, format them with ext3 and copy the original /var contents from /var.ro
It will mount those partitions to /var, /tmp and /home
It will set the hostname according to what it find in /etc/hosts for the current IP address (it’s clear that the hosts file must be up to date). The /etc/sysconfig/network is a symlink to /var/network because each WN must have a different configuration file and that file can only be stored on a personal RW directory.

After this whole process we’ve simplified a lot the installation and updates on the WNs from our grid site. Now, we only have to add the new worker node’s MAC to the DHCP server’s config file, the IP address to the hosts file and power up the WN. It gets formatted by itself and starts perfectly as a normal WN would.

There are a lot of tweaks that could be done, but we’ve simplified the process and got it to a point where it simply does what it’s supposed to do.

You must be logged in to post a comment.

Grid site with dynamic worker nodes (on Scientific Linux 5)

Leave a Reply

Recent Comments