Author: Preston St. Pierre
compute cluster, and I use the Parallel Virtual Filesystem which comes with
the ROCKS linux distribution to provide a parallel IO system. For those who
are not familiar, check out my earlier
ROCKS article, as well as my earlier article
about PVFS.
My cluster is slightly older hardware — dual PIIIs, and each PC has two
hard drives. Initially, I thought having two drives was great news, because I could add
all of the capacity of the second drive, along with unused capacity of the
first drive to grant large amounts of scratch space to the cluster users,
some of whom would be more than happy to have it. However, what I didn’t
realize was that PVFS can only use a single mount point for its data storage
needs. I couldn’t tell PVFS to use /dev/hda3 and /dev/hdb1. Then someone on
the ROCKS list said I might consider using RAID, but that he hadn’t tried it
himself. I was game, and it works wonderfully. So here’s how I did it.
Moving the metadata server off the head node
On a ROCKS cluster, the head node is called frontend-0-0. In testing, this
is where my PVFS metadata server lived. However, the frontend also serves up
home directories to the rest of the cluster, and handles intercommunications
between the scheduler and workload management daemons across the cluster. It
also gathers statistics on the cluster, pushes out administrative changes to
the nodes, and runs a web server. That’s more than enough load without
coordinating all of the PVFS clients and matching their requests with the 16
PVFS IO nodes. On frontend-0-0, I just turned off the “mgr” and “pvfsd”
services. I’ll turn the “pvfsd” service back on after I configure the *new*
metaserver, which will allow frontend-0-0 to be a pvfs client. It needs to be
a client so that users can collect all of their data from there without
logging into other nodes.
The new metadata server is pvfs-0-0, which (as the name implies) is also a
PVFS IO node. It has two physical disks installed, as do all of the PVFS IO
nodes, so I’ll configure them in a RAID0 setup to maximize capacity, and then
configure the metadata server.
Setting up RAID
First, I took a look at the state of things in diskland on what will
become the new PVFS metaserver. It will also be an IO node.
[root@pvfs-0-0 init.d]# df -h Filesystem Size Used Avail Use% Mounted on /dev/hdg1 5.8G 2.2G 3.4G 40% / /dev/hde1 57G 8.8G 45G 17% /state/partition2 /dev/hdg3 50G 33M 48G 1% /state/partition1 none 501M 0 501M 0% /dev/shm
I’m going to use /dev/hde1 and /dev/hdg3 in my RAID 0 volume. Having
never used RAID on these systems, I needed to run vgscan
to
create the files which will be used later by the other commands used to
create the RAID filesystem. Then I immediately run pvcreate
,
which creates physical volumes using the physical disk partitions you
specify.
[root@pvfs-0-0 init.d]# vgscan vgscan -- reading all physical volumes (this may take a while...) vgscan -- "/etc/lvmtab" and "/etc/lvmtab.d" successfully created vgscan -- WARNING: This program does not do a VGDA backup of your volume group [root@pvfs-0-0 init.d]# pvcreate /dev/hde1 /dev/hdg3 pvcreate -- physical volume "/dev/hde1" successfully created pvcreate -- physical volume "/dev/hdg3" successfully created
Next I have to create a “volume group”, which takes the physical volumes and
essentially puts them under a unified label. In this case, I’m labeling the
volume “pvmeta”.
[root@pvfs-0-0 init.d]# vgcreate pvmeta /dev/hde1 /dev/hdg3 vgcreate -- INFO: using default physical extent size 32 MB vgcreate -- INFO: maximum logical volume size is 2 Terabyte vgcreate -- doing automatic backup of volume group "pvmeta" vgcreate -- volume group "pvmeta" successfully created and activated
Now it’s time to actually create a logical volume — something that can be
treated like a regular old disk partition, so I can go about the business of
creating a filesystem, assigning a mount point, etc.
[root@pvfs-0-0 init.d]# lvcreate -L 100G pvmeta lvcreate -- doing automatic backup of "pvmeta" lvcreate -- logical volume "/dev/pvmeta/lvol1" successfully created
With that out of the way, it’s time to put a filesystem on the new volume so
that PVFS can use it for data storage. This is done using the standard
mkfs
command, just as you would do with any other partition.
[root@pvfs-0-0 init.d]# mkfs -t ext3 /dev/pvmeta/lvol1 mke2fs 1.32 (09-Nov-2002) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) 13107200 inodes, 26214400 blocks 1310720 blocks (5.00%) reserved for the super user First data block=0 800 block groups 32768 blocks per group, 32768 fragments per group 16384 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872 Writing inode tables: done Creating journal (8192 blocks): done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 36 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override.
All that’s left now is to mount our new volume under the
/pvfs-data
directory, and check to make sure everything worked.
One thing you’ll notice is that you don’t get the full 100GB of space you
asked for when you created the volume, but this is due to PVFS taking up
some space creating a structure that it uses to manage and quickly find data
chunks that are managed on the local machine on behalf of clients spread
throughout the node.
[root@pvfs-0-0 /]# mount /dev/pvmeta/lvol1 /pvfs-data [root@pvfs-0-0 /]# df -h Filesystem Size Used Avail Use% Mounted on /dev/hdg1 5.8G 2.2G 3.4G 40% / /dev/hde1 16T 16T 0 100% /state/partition2 /dev/hdg3 16T 16T 0 100% /state/partition1 none 501M 0 501M 0% /dev/shm /dev/pvmeta/lvol1 99G 33M 94G 1% /pvfs-data
Using the ROCKS tools to propagate changes
This is easy enough to set up on a single box, but now what? I have 16 IO
nodes! Certainly there’s some way to automate this to some extent, no? Yes.
ROCKS comes with a plethora of cluster management tools that make things
really easy to manage across a number of nodes, no matter if they’re compute
nodes, PVFS nodes, or otherwise. The most powerful tool in the ROCKS
administrator’s arsenal is the cluster-fork
command. Here are a
few of the same commands I ran on pvfs-0-0, only this time, they’re being
run on pvfs-0-1 thru pvfs-0-15:
cluster-fork --nodes "pvfs-0-%d:1-15" mkfs -t ext3 /dev/pvmeta/lvol1
cluster-fork --nodes "pvfs-0-%d:1-15" mkdir /pvfs-data
cluster-fork --nodes "pvfs-0-%d:1-15" mount /dev/pvmeta/lvol1 /pvfs-data
This is the same command I used to set up the compute nodes as proper
clients to the newly configured PVFS metaserver. These clients have a
pvfstab file left over from testing, so you’ll notice that I use the
overwrite redirection operator, >
, to make sure that what I’m
echoing is the only thing in the file.
cluster-fork --nodes "compute-0-%d:0-17" 'echo "pvfs-0-0.local:/pvfs-meta
/mnt/pvfs pvfs port=3000 0 0" > /etc/pvfstab'
Two quick things about cluster-fork to remember. One is that you can specify
ranges using the %d as a placeholder, then specifying the range (ie
%d:range). you can also supply a comma-separated list of numbers
instead of a strict range. The second thing to remember is to be very
careful about passing multiple commands to cluster-fork — especially
destructive ones. If you don’t use quotes, and you run something like this:
cluster-fork --nodes "pvfs-0-%d:5,6,9,10" mount /dev/hdg3 /state/partition1;
mount /dev/hde1 /state/partition2
You’re going to wind up mounting /dev/hdg3 on the pvfs nodes you specify,
and then mounting /dev/hde1 on the local machine you’re typing on. As an
administrator working in such foreign waters at first, you sometimes forget
these things — but it’s still bash, so stay alert!
In Conclusion
Using this method, I was able to grow the available scratch space to my
cluster from around 500GB to about 1.5TB. So far, everything is humming
along nicely on the production cluster. There are details I haven’t covered
here, like the fact that you have to restart all of the IOD daemons on the
IO nodes, and the pvfsd daemons on the clients, but the reason for that is
twofold: one, the last thing I do before launching anything into production
is make sure I haven’t done anything that will cause the cluster to
not reboot — by rebooting the cluster. Two, rebooting the cluster
restarts everything that needs to be restarted in order for things to work.
I hope this article has piqued an interest in clustering with ROCKS, and
using PVFS in your cluster.