SysAdmin to SysAdmin: Using RAID with PVFS under ROCKS

68

Author: Preston St. Pierre

I administer a newly deployed ROCKS
compute cluster, and I use the Parallel Virtual Filesystem which comes with
the ROCKS linux distribution to provide a parallel IO system. For those who
are not familiar, check out my earlier
ROCKS article
, as well as my earlier article
about PVFS
.

My cluster is slightly older hardware — dual PIIIs, and each PC has two
hard drives. Initially, I thought having two drives was great news, because I could add
all of the capacity of the second drive, along with unused capacity of the
first drive to grant large amounts of scratch space to the cluster users,
some of whom would be more than happy to have it. However, what I didn’t
realize was that PVFS can only use a single mount point for its data storage
needs. I couldn’t tell PVFS to use /dev/hda3 and /dev/hdb1. Then someone on
the ROCKS list said I might consider using RAID, but that he hadn’t tried it
himself. I was game, and it works wonderfully. So here’s how I did it.

Moving the metadata server off the head node

On a ROCKS cluster, the head node is called frontend-0-0. In testing, this
is where my PVFS metadata server lived. However, the frontend also serves up
home directories to the rest of the cluster, and handles intercommunications
between the scheduler and workload management daemons across the cluster. It
also gathers statistics on the cluster, pushes out administrative changes to
the nodes, and runs a web server. That’s more than enough load without
coordinating all of the PVFS clients and matching their requests with the 16
PVFS IO nodes. On frontend-0-0, I just turned off the “mgr” and “pvfsd”
services. I’ll turn the “pvfsd” service back on after I configure the *new*
metaserver, which will allow frontend-0-0 to be a pvfs client. It needs to be
a client so that users can collect all of their data from there without
logging into other nodes.

The new metadata server is pvfs-0-0, which (as the name implies) is also a
PVFS IO node. It has two physical disks installed, as do all of the PVFS IO
nodes, so I’ll configure them in a RAID0 setup to maximize capacity, and then
configure the metadata server.

Setting up RAID

First, I took a look at the state of things in diskland on what will
become the new PVFS metaserver. It will also be an IO node.

 
[root@pvfs-0-0 init.d]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/hdg1             5.8G  2.2G  3.4G  40% /
/dev/hde1              57G  8.8G   45G  17% /state/partition2
/dev/hdg3              50G   33M   48G   1% /state/partition1
none                  501M     0  501M   0% /dev/shm

I’m going to use /dev/hde1 and /dev/hdg3 in my RAID 0 volume. Having
never used RAID on these systems, I needed to run vgscan to
create the files which will be used later by the other commands used to
create the RAID filesystem. Then I immediately run pvcreate,
which creates physical volumes using the physical disk partitions you
specify.

[root@pvfs-0-0 init.d]# vgscan
vgscan -- reading all physical volumes (this may take a while...)
vgscan -- "/etc/lvmtab" and "/etc/lvmtab.d" successfully created
vgscan -- WARNING: This program does not do a VGDA backup of your volume
group

[root@pvfs-0-0 init.d]# pvcreate /dev/hde1 /dev/hdg3
pvcreate -- physical volume "/dev/hde1" successfully created
pvcreate -- physical volume "/dev/hdg3" successfully created

Next I have to create a “volume group”, which takes the physical volumes and
essentially puts them under a unified label. In this case, I’m labeling the
volume “pvmeta”.

[root@pvfs-0-0 init.d]# vgcreate pvmeta /dev/hde1 /dev/hdg3
vgcreate -- INFO: using default physical extent size 32 MB
vgcreate -- INFO: maximum logical volume size is 2 Terabyte
vgcreate -- doing automatic backup of volume group "pvmeta"
vgcreate -- volume group "pvmeta" successfully created and activated

Now it’s time to actually create a logical volume — something that can be
treated like a regular old disk partition, so I can go about the business of
creating a filesystem, assigning a mount point, etc.

[root@pvfs-0-0 init.d]# lvcreate -L 100G pvmeta
lvcreate -- doing automatic backup of "pvmeta"
lvcreate -- logical volume "/dev/pvmeta/lvol1" successfully created

With that out of the way, it’s time to put a filesystem on the new volume so
that PVFS can use it for data storage. This is done using the standard
mkfs command, just as you would do with any other partition.

[root@pvfs-0-0 init.d]# mkfs -t ext3 /dev/pvmeta/lvol1
mke2fs 1.32 (09-Nov-2002)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
13107200 inodes, 26214400 blocks
1310720 blocks (5.00%) reserved for the super user
First data block=0
800 block groups
32768 blocks per group, 32768 fragments per group
16384 inodes per group
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632,
2654208,
        4096000, 7962624, 11239424, 20480000, 23887872

Writing inode tables: done
Creating journal (8192 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 36 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.

All that’s left now is to mount our new volume under the
/pvfs-data directory, and check to make sure everything worked.
One thing you’ll notice is that you don’t get the full 100GB of space you
asked for when you created the volume, but this is due to PVFS taking up
some space creating a structure that it uses to manage and quickly find data
chunks that are managed on the local machine on behalf of clients spread
throughout the node.

[root@pvfs-0-0 /]# mount /dev/pvmeta/lvol1 /pvfs-data
[root@pvfs-0-0 /]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/hdg1             5.8G  2.2G  3.4G  40% /
/dev/hde1              16T   16T     0 100% /state/partition2
/dev/hdg3              16T   16T     0 100% /state/partition1
none                  501M     0  501M   0% /dev/shm
/dev/pvmeta/lvol1      99G   33M   94G   1% /pvfs-data

Using the ROCKS tools to propagate changes

This is easy enough to set up on a single box, but now what? I have 16 IO
nodes! Certainly there’s some way to automate this to some extent, no? Yes.
ROCKS comes with a plethora of cluster management tools that make things
really easy to manage across a number of nodes, no matter if they’re compute
nodes, PVFS nodes, or otherwise. The most powerful tool in the ROCKS
administrator’s arsenal is the cluster-fork command. Here are a
few of the same commands I ran on pvfs-0-0, only this time, they’re being
run on pvfs-0-1 thru pvfs-0-15:


cluster-fork --nodes "pvfs-0-%d:1-15" mkfs -t ext3 /dev/pvmeta/lvol1
cluster-fork --nodes "pvfs-0-%d:1-15" mkdir /pvfs-data
cluster-fork --nodes "pvfs-0-%d:1-15" mount /dev/pvmeta/lvol1 /pvfs-data

This is the same command I used to set up the compute nodes as proper
clients to the newly configured PVFS metaserver. These clients have a
pvfstab file left over from testing, so you’ll notice that I use the
overwrite redirection operator, &gt, to make sure that what I’m
echoing is the only thing in the file.


cluster-fork --nodes "compute-0-%d:0-17" 'echo "pvfs-0-0.local:/pvfs-meta
/mnt/pvfs pvfs port=3000 0 0" > /etc/pvfstab'

Two quick things about cluster-fork to remember. One is that you can specify
ranges using the %d as a placeholder, then specifying the range (ie
%d:range). you can also supply a comma-separated list of numbers
instead of a strict range. The second thing to remember is to be very
careful about passing multiple commands to cluster-fork — especially
destructive ones. If you don’t use quotes, and you run something like this:


cluster-fork --nodes "pvfs-0-%d:5,6,9,10" mount /dev/hdg3 /state/partition1;
mount /dev/hde1 /state/partition2

You’re going to wind up mounting /dev/hdg3 on the pvfs nodes you specify,
and then mounting /dev/hde1 on the local machine you’re typing on. As an
administrator working in such foreign waters at first, you sometimes forget
these things — but it’s still bash, so stay alert!

In Conclusion

Using this method, I was able to grow the available scratch space to my
cluster from around 500GB to about 1.5TB. So far, everything is humming
along nicely on the production cluster. There are details I haven’t covered
here, like the fact that you have to restart all of the IOD daemons on the
IO nodes, and the pvfsd daemons on the clients, but the reason for that is
twofold: one, the last thing I do before launching anything into production
is make sure I haven’t done anything that will cause the cluster to
not reboot — by rebooting the cluster. Two, rebooting the cluster
restarts everything that needs to be restarted in order for things to work.

I hope this article has piqued an interest in clustering with ROCKS, and
using PVFS in your cluster.