quarta-feira, 14 de março de 2012

Hadoop + Amazon EC2 - An updated tutorial

There is an old tutorial placed at Hadoop's wiki page: http://wiki.apache.org/hadoop/AmazonEC2, but recently I had to follow this tutorial and I noticed that it doesn't cover some new Amazon functionality.

To follow this tutorial is recommended that you are already familiar with the basics of Hadoop, a very useful "how to start" tutorial can be found at Hadoop's homepage: http://hadoop.apache.org/. Also, you have to be familiar with at least Amazon EC2 internals and instance definitions.

When you register an account at Amazon AWS you receive 750 hours to run
t1.micro instances, but unfortunately, you can't successfully run Hadoop in such machines.

On the following steps, when a command starts with $ means that it should be executed into the local machine, and with # into the EC2 instance.

Create an X.509 Certificate


Since we gonna use ec2-tools, our account at AWS needs a valid X.509 Certificate:
  • Create .ec2 folder:
  • $ mkdir ~/.ec2
  • Login in at AWS
    • Select “Security Credentials” and at "Access Credentials" click on "X.509 Certificates";
    • You have two options:
      • Create certificate using command line:
      • $ cd ~/.ec2; openssl genrsa -des3 -out my-pk.pem 2048
        $ openssl rsa -in my-pk.pem -out my-pk-unencrypt.pem
        $ openssl req -new -x509 -key my-pk.pem -out my-cert.pem -days 1095
        • It only works if your machine date is ok.
      • Create the certificate using the site and download the private-key (remember to put it at ~/.ec2).

Setting up Amazon EC2-Tools

  • Download and unpack ec2-tools;
  • Edit your ~/.profile to export all variables needed by ec2-tools, so you don't have to do it every time that you open a prompt:
    • Here is an example of what should be appended to the ~/.profile file:
      • export JAVA_HOME=/usr/lib/jvm/java-6-sun
      • export EC2_HOME=~/ec2-api-tools-*
      • export PATH=$PATH:$EC2_HOME/bin
      • export EC2_CERT=~/.ec2/my-cert.pem
    • To access an instance, you need to be authenticated (obvious security reasons), in this way, you have to create a Key Pair (public and private keys):
      • At https://console.aws.amazon.com/ec2/home, click on "Key Pairs", or
      • You can run the following commands:
      • $ ec2-add-keypair my-keypair | grep –v KEYPAIR > ~/.ec2/id_rsa-keypair
        $ chmod 600 ~/.ec2/id_rsa-keypair

Setting up Hadoop


After download and unpack Hadoop, you have to edit the EC2 configuration script present at src/contrib/ec2/bin/hadoop-ec2-env.sh.
  • AWS variables
    • These variables are related to your AWS account (AWS_ACCOUNT_ID, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY), they can be found at logging at your account, in Security Credentials;
      • The AWS_ACCOUNT_ID is your 12 digit account number.
  • Security variables
    • The security variables (EC2_KEYDIR, KEY_NAME, PRIVATE_KEY_PATH), are the ones related to the launch and access of an EC2 instance;
    • You have to save the private key into your EC2_KEYDIR path.
  • Select an AMI
    • Depending on Hadoop's version that you want to run (HADOOP_VERSION) and the instance type (INSTANCE_TYPE), you should use a properly image to deploy your instance:
    • There are many public AMI images that you can use (they must suit the needs for most users), to list, type
    • $ ec2-describe-images -x all | grep hadoop
    • Or you can build your own image, and upload it to an Amazon S3 bucket;
    • After selecting the AMI you will use, there are basically three variables to edit at hadoop-ec2-env.sh:
      • S3_BUCKET: the bucket where is placed the image that you will use, example hadoop-images,
      • ARCH: the architecture of the AMI image you have chosen (i386 or x84_64) and
      • BASE_AMI_IMAGE: the unique code that maps an AMI image, example ami-2b5fba42.
    • Other configurable variable is the JAVA_VERSION, there you can define which version will be installed along with the instance:
      • You can also provide a link where would be located the binary (JAVA_BINARY_URL), for instance, if you have JAVA_VERSION=1.6.0_29, an option is use JAVA_BINARY_URL=http://download.oracle.com/otn-pub/java/jdk/6u29-b11/jdk-6u29-linux-i586.bin.

Running!

  • You can add the content of src/contrib/ec2/bin to your PATH variable so you will be able to run the commands indepentend from where the prompt is open;
  • To launch a EC2 cluster and start Hadoop, you use the following command. The arguments are the cluster name (hadoop-test) and the number of slaves (2). When the cluster boots, the public DNS name will be printed to the console.
  • $ hadoop-ec2 launch-cluster hadoop-test 2
  • To login at the master node from your "cluster" you type:
  • $ hadoop-ec2 login hadoop-test
  • Once you are logged into the master node you will be able to start the job:
    • For example, to test your cluster, you can run a pi calculation that is already provided by the hadoop*-examples.jar:
    • # cd /usr/local/hadoop-*
      # bin/hadoop jar hadoop-*-examples.jar pi 10 10000000
  • You can check your job progress at http://MASTER_HOST:50030/. Where MASTER_HOST is the host name returned after the cluster started.
  • After your job has finished, the cluster remains alive. To shutdown you use the following command:
  • $ hadoop-ec2 terminate-cluster hadoop-test
  • Remember that in Amazon EC2, the instances are charged per hour, so if you only wanted to do tests, you can play with the cluster for some more minutes.