One of my major interests is programming in R. I took a few days off for Christmas break, and in between talking with family, watching old movies, opening presents, and spreading Christmas cheer, I decided it would be fun to use Chef to create a parallel computing environment suitable for use with R. As part of this project, I wrote:
- A Chef cookbook to install and configure R.
- An R package provider to easily install packages from CRAN.
- A Chef cookbook to install and configure MPI.
This blog post outlines some of the highlights of this fun little project. It isn’t a step-by-step tutorial for creating such an environment. I assume that already have:
- General familiarity with R and Chef. I talk about cookbooks, recipes, and roles and assume you know what to do with them.
- A means of provisioning computation nodes, each already configured with hostnames that are resolvable to network addresses by other computation nodes.
I used vagrant to provision a few VMs for testing these recipes and a small set of vagrant-specific cookbooks to ensure that each node had a resolvable hostname.
R Cookbook
The first item every node in the cluster will need is R. The R cookbook on the Opscode Community site is a bit out of date. I’ve created a newer version that can be found here:
http://github.com/stevendanna/cookbook-r
This cookbook does the following:
- Installs R from either the CRAN APT repository or from source depending on the platform.
- Defines a system-wide default CRAN mirror using an Rprofile.site template.
- Contains an R package provider that can install R packages available on CRAN.
Currently this cookbook is linux-centric and best suited for use on Ubuntu or Debian.
R Package Provider
My R parallel computing environment relies on the ‘snow’ and ‘Rmpi’ packages to manage communication with computation nodes. Further, I often need additional packages from CRAN when working with my cluster. The R package provider in my R cookbook allows for easy, automated installation of R packages. It was written using Chef’s Light-weight Resource and Provider DSL. The current version provides the minimal necessary functionality:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
|
It supports three actions: :install
, :upgrade
, and
:remove
; however, it is not very feature-rich. A few features I
would like to add include:
- The ability to specify the CRAN mirror or repository.
- The ability to install local packages
- The ability to install packages on a per-user basis.
Despite these shortcomings, the current provider works well enough for
my purposes and is a good example of Chef’s high-level of
‘Whip-It-Up-itude.’ Within a few minutes, I was able to go from some
quick-and-dirty R code to a Chef provider that I can iteratively
improve as my needs become more complex. While I did not include it
in the repository linked above, I was able to create a small R::snow
recipe that does little more than use this provider:
1 2 3 4 5 6 7 |
|
MPI Cookbook
The R package ‘snow’ can use various backends to communicate with computation nodes. The default method for most commands is “Message Passing Interface” (MPI). MPI is also generally useful outside of R as you can easily use MPI utilities such as mpirun to execute a command on multiple computation nodes and use MPI’s C libraries to write programs designed for parallel computation.
MPI has multiple implementation. I’ve written an MPI cookbook that can install openmpi on Ubuntu. It can be found here:
https://github.com/stevendanna/cookbook-mpi
This cookbook does the following:
- Installs openmpi on Ubuntu
- Constructs a list of computation nodes within your environment using search and then renders a default hostfile that is used by MPI commands such as mpirun.
- Sets basic configuration options that I have found useful.
While MPI does not natively contain the concepts of master and slaves (every MPI node can talk to any other MPI node), I found it useful to include the concept of a node’s “MPI role.” I use this role to populate the default hostfile, since there are often cases where you might want MPI installed on a machine within your environment but would not want that machine being used for computation.
Putting it all Together
I put this all together using two roles: snow_master and snow_slave
1 2 3 4 5 6 7 8 9 10 11 |
|
1 2 3 4 5 6 |
|
By applying the snow_slave role to all of the nodes I wanted within my computation cluster and the snow_master role to the machine I will use to launch jobs in my cluster, I can quickly bring up a new cluster of machines suitable for doing parallel computation in R using snow.
The ssh_known_hosts recipe is directly from the Opscode community site and helps make ssh connections between the nodes a bit smoother.
Twelve Days of Chistmas
As an example of how easy it is to do simple parallel computation using snow, consider the following Chistmas-themed R function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
Assuming chef-client has been run on all of our nodes, we can simply log into our ‘snow master’, boot up R, and run the following code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
|
Conclusion
Overall this was a fun project to do over break. It took a handful of hours to complete, most of which was dedicated to a few problems not discussed in this post:
- Making hostnames resolvable and other network oddities within my Vagrant test environment.
- Deciding which implementation of MPI to use.
I personally would like to see further work done around using Chef to manage large R environments. Some work I hope to do in the not to distant future includes:
- Expand the R cookbook to better support OS X.
- Add support to detect the latest version when doing a source install.
- Expand the R package provider to include more of the features available when installing package from within R.
- Write an R Ohai plugin that would collect information about:
- R version and capabilities
- R packages installed
Happy Holidays!