Getting Started with Whirr™

The Whirr CLI provides the most convenient way to launch clusters. For the programmatic interface, see the javadoc.

Also see Whirr in 5 Minutes for the condensed instructions for getting started (with ZooKeeper as the example).

Pre-requisites

  • Java 6
  • An account with a cloud provider, such as Amazon EC2, or Rackspace Cloud Servers
  • An SSH client

Install Whirr

Download or build Whirr.

You can test that Whirr is working by running:

% bin/whirr version

Which will display the version of Whirr that is installed.

To get usage instructions type:

% bin/whirr

Setup your Credentials

% mkdir -p ~/.whirr
    % cp conf/credentials.sample ~/.whirr/credentials
    

Edit ~/.whirr/credentials using your editor of choice and add the API connection credentials as required.

Configure a Hadoop cluster

First, create a properties file to define the cluster. The name doesn't matter, but here we will assume it is called hadoop.propertiesand located in your home directory. This file defines a cluster with a single machine for the namenode and jobtracker, and a further machine for a datanode and tasktracker. You can see how to launch other services by consulting the sample configurations in the recipesdirectory of the distribution.

whirr.cluster-name=myhadoopcluster 
whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,1 hadoop-datanode+hadoop-tasktracker 
whirr.provider=aws-ec2
whirr.private-key-file=${sys:user.home}/.ssh/id_rsa
whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub

Note that we haven't specified a particular cloud image, since Whirr provides a default for each provider which should work well enough. However, for larger clusters you will likely use larger hardware sizes or particular images. See the recipesfiles and the Configuration Guide for details.

In this configuration file the cloud identity and credential are read from environment variables - you can equally well put them in the configuration file if you wish.

The private-key-file and public-key-file properties specify an SSH keypair. You can generate a keypair with:

% ssh-keygen -t rsa -P ''

You should use only RSA SSH keys, since DSA keys are not accepted yet.

Note: the keypair specified by these properties is not the same as the AWS keypair generated with the ec2-add-keypair command or the AWS Management Console (since these don't place bothof the keys on your local machine). The PEM-encoded X.509 Certificate and Private Key (e.g. pk-XXXXXX.pem) cannot be used as a keypair either.

Launch a Hadoop cluster

Run the following command to launch a cluster:

% bin/whirr launch-cluster --config hadoop.properties

Messages will be logged to the console as the cluster starts. You can see debug-level logging in a file named whirr.login the directory you ran the whirrcommand from.

A message will be printed out when the cluster has started, with a URL that you can use to access the web UI.

Run a proxy

For security reasons, traffic from the network your client is running on is proxied through the master node of the cluster using an SSH tunnel (a SOCKS proxy on port 6666).

A script to launch the proxy is created when you launch the cluster, and may be found in ~/.whirr/<cluster-name>. Run it as a follows (in a new terminal window):

% . ~/.whirr/myhadoopcluster/hadoop-proxy.sh

To stop the proxy, just kill the process with Ctrl-C.

Web browsers need to be configured to use this proxy too, so you can view pages served by worker nodes in the cluster. The most convenient way to do this is to use a proxy auto-config (PAC) file file, such as this one for Hadoop EC2 clusters.

If you are using Firefox, then you may find FoxyProxy useful for managing PAC files.

Run a MapReduce job

After you launch a cluster, a hadoop-site.xmlfile is created in the directory ~/.whirr/<cluster-name>. You can use this to connect to the cluster by setting the HADOOP_CONF_DIR environment variable. (It is also possible to set the configuration file to use by passing it as a -conf option to Hadoop Tools):

% export HADOOP_CONF_DIR=~/.whirr/myhadoopcluster

You should now be able to browse HDFS:

% hadoop fs -ls /

Note that the version of Hadoop installed locally should match the version installed on the cluster. You should also make sure that the HADOOP_HOME environment variable is set.

Here's how you can run a MapReduce job:

hadoop fs -mkdir input 
hadoop fs -put $HADOOP_HOME/LICENSE.txt input 
hadoop jar $HADOOP_HOME/hadoop-*examples*.jar wordcount input output 
hadoop fs -cat output/part-* | head

Configuration

Whirr is configured using a properties file, and optionally using command line arguments when using the CLI. Command line arguments take precedence over properties specified in a properties file.

For example, instead of using the properties file above, you could launch a Hadoop cluster with the following command line (note that the whirr. prefix for properties is not reflected in the command line argument):

% bin/whirr launch-cluster \ 
    --cluster-name=myhadoopcluster \ 
    --instance-templates='1 hadoop-jobtracker+hadoop-namenode,1 hadoop-datanode+hadoop-tasktracker' \ 
    --provider=aws-ec2 \
    --identity=$AWS_ACCESS_KEY_ID \ 
    --credential=$AWS_SECRET_ACCESS_KEY \
    --private-key-file=~/.ssh/id_rsa \ 
    --public-key-file=~/.ssh/id_rsa.pub

Notice that here we took advantage of the fact that the AWS credentials have been defined in environment variables.

See the configuration guide for a list of all the configuration properties you can set.

Destroy a cluster

When you've finished using a cluster you can terminate the instances and clean up resources with the following.

WARNING: All data will be deleted when you destroy the cluster.

% bin/whirr destroy-cluster --config hadoop.properties

At this point you shut down the SSH proxy to the cluster if you started one earlier.