To follow this tutorial is recommended that you are already familiar with the basics of Hadoop, a very useful "how to start" tutorial can be found at Hadoop's homepage: http://hadoop.apache.org/. Also, you have to be familiar with at least Amazon EC2 internals and instance definitions.
When you register an account at Amazon AWS you receive 750 hours to run t1.micro instances, but unfortunately, you can't successfully run Hadoop in such machines.
On the following steps, when a command starts with $ means that it should be executed into the local machine, and with # into the EC2 instance.
Create an X.509 Certificate
Since we gonna use ec2-tools, our account at AWS needs a valid X.509 Certificate:
- Create .ec2 folder:
$ mkdir ~/.ec2
- Select “Security Credentials” and at "Access Credentials" click on "X.509 Certificates";
- You have two options:
- Create certificate using command line:
$ cd ~/.ec2; openssl genrsa -des3 -out my-pk.pem 2048
$ openssl rsa -in my-pk.pem -out my-pk-unencrypt.pem
$ openssl req -new -x509 -key my-pk.pem -out my-cert.pem -days 1095
- It only works if your machine date is ok.
- Create the certificate using the site and download the private-key (remember to put it at ~/.ec2).
Setting up Amazon EC2-Tools
- Download and unpack ec2-tools;
- Edit your ~/.profile to export all variables needed by ec2-tools, so you don't have to do it every time that you open a prompt:
- Here is an example of what should be appended to the ~/.profile file:
- export JAVA_HOME=/usr/lib/jvm/java-6-sun
- export EC2_HOME=~/ec2-api-tools-*
- export PATH=$PATH:$EC2_HOME/bin
- export EC2_CERT=~/.ec2/my-cert.pem
- To access an instance, you need to be authenticated (obvious security reasons), in this way, you have to create a Key Pair (public and private keys):
- At https://console.aws.amazon.com/ec2/home, click on "Key Pairs", or
- You can run the following commands:
$ ec2-add-keypair my-keypair | grep –v KEYPAIR > ~/.ec2/id_rsa-keypair
$ chmod 600 ~/.ec2/id_rsa-keypair
Setting up Hadoop
After download and unpack Hadoop, you have to edit the EC2 configuration script present at src/contrib/ec2/bin/hadoop-ec2-env.sh.
- AWS variables
- These variables are related to your AWS account (AWS_ACCOUNT_ID, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY), they can be found at logging at your account, in Security Credentials;
- The AWS_ACCOUNT_ID is your 12 digit account number.
- Security variables
- The security variables (EC2_KEYDIR, KEY_NAME, PRIVATE_KEY_PATH), are the ones related to the launch and access of an EC2 instance;
- You have to save the private key into your EC2_KEYDIR path.
- Select an AMI
- Depending on Hadoop's version that you want to run (HADOOP_VERSION) and the instance type (INSTANCE_TYPE), you should use a properly image to deploy your instance:
- There are many public AMI images that you can use (they must suit the needs for most users), to list, type
$ ec2-describe-images -x all | grep hadoop
- S3_BUCKET: the bucket where is placed the image that you will use, example hadoop-images,
- ARCH: the architecture of the AMI image you have chosen (i386 or x84_64) and
- BASE_AMI_IMAGE: the unique code that maps an AMI image, example ami-2b5fba42.
- You can also provide a link where would be located the binary (JAVA_BINARY_URL), for instance, if you have JAVA_VERSION=1.6.0_29, an option is use JAVA_BINARY_URL=http://download.oracle.com/otn-pub/java/jdk/6u29-b11/jdk-6u29-linux-i586.bin.
Running!
- You can add the content of src/contrib/ec2/bin to your PATH variable so you will be able to run the commands indepentend from where the prompt is open;
- To launch a EC2 cluster and start Hadoop, you use the following command. The arguments are the cluster name (hadoop-test) and the number of slaves (2). When the cluster boots, the public DNS name will be printed to the console.
$ hadoop-ec2 launch-cluster hadoop-test 2
$ hadoop-ec2 login hadoop-test
- For example, to test your cluster, you can run a pi calculation that is already provided by the hadoop*-examples.jar:
# cd /usr/local/hadoop-*
# bin/hadoop jar hadoop-*-examples.jar pi 10 10000000
$ hadoop-ec2 terminate-cluster hadoop-test