quarta-feira, 14 de março de 2012

Hadoop + Amazon EC2 - An updated tutorial

There is an old tutorial placed at Hadoop's wiki page: http://wiki.apache.org/hadoop/AmazonEC2, but recently I had to follow this tutorial and I noticed that it doesn't cover some new Amazon functionality.

To follow this tutorial is recommended that you are already familiar with the basics of Hadoop, a very useful "how to start" tutorial can be found at Hadoop's homepage: http://hadoop.apache.org/. Also, you have to be familiar with at least Amazon EC2 internals and instance definitions.

When you register an account at Amazon AWS you receive 750 hours to run
t1.micro instances, but unfortunately, you can't successfully run Hadoop in such machines.

On the following steps, when a command starts with $ means that it should be executed into the local machine, and with # into the EC2 instance.

Create an X.509 Certificate

Since we gonna use ec2-tools, our account at AWS needs a valid X.509 Certificate:
  • Create .ec2 folder:
  • $ mkdir ~/.ec2
  • Login in at AWS
    • Select “Security Credentials” and at "Access Credentials" click on "X.509 Certificates";
    • You have two options:
      • Create certificate using command line:
      • $ cd ~/.ec2; openssl genrsa -des3 -out my-pk.pem 2048
        $ openssl rsa -in my-pk.pem -out my-pk-unencrypt.pem
        $ openssl req -new -x509 -key my-pk.pem -out my-cert.pem -days 1095
        • It only works if your machine date is ok.
      • Create the certificate using the site and download the private-key (remember to put it at ~/.ec2).

Setting up Amazon EC2-Tools

  • Download and unpack ec2-tools;
  • Edit your ~/.profile to export all variables needed by ec2-tools, so you don't have to do it every time that you open a prompt:
    • Here is an example of what should be appended to the ~/.profile file:
      • export JAVA_HOME=/usr/lib/jvm/java-6-sun
      • export EC2_HOME=~/ec2-api-tools-*
      • export PATH=$PATH:$EC2_HOME/bin
      • export EC2_CERT=~/.ec2/my-cert.pem
    • To access an instance, you need to be authenticated (obvious security reasons), in this way, you have to create a Key Pair (public and private keys):
      • At https://console.aws.amazon.com/ec2/home, click on "Key Pairs", or
      • You can run the following commands:
      • $ ec2-add-keypair my-keypair | grep –v KEYPAIR > ~/.ec2/id_rsa-keypair
        $ chmod 600 ~/.ec2/id_rsa-keypair

Setting up Hadoop

After download and unpack Hadoop, you have to edit the EC2 configuration script present at src/contrib/ec2/bin/hadoop-ec2-env.sh.
  • AWS variables
    • These variables are related to your AWS account (AWS_ACCOUNT_ID, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY), they can be found at logging at your account, in Security Credentials;
      • The AWS_ACCOUNT_ID is your 12 digit account number.
  • Security variables
    • The security variables (EC2_KEYDIR, KEY_NAME, PRIVATE_KEY_PATH), are the ones related to the launch and access of an EC2 instance;
    • You have to save the private key into your EC2_KEYDIR path.
  • Select an AMI
    • Depending on Hadoop's version that you want to run (HADOOP_VERSION) and the instance type (INSTANCE_TYPE), you should use a properly image to deploy your instance:
    • There are many public AMI images that you can use (they must suit the needs for most users), to list, type
    • $ ec2-describe-images -x all | grep hadoop
    • Or you can build your own image, and upload it to an Amazon S3 bucket;
    • After selecting the AMI you will use, there are basically three variables to edit at hadoop-ec2-env.sh:
      • S3_BUCKET: the bucket where is placed the image that you will use, example hadoop-images,
      • ARCH: the architecture of the AMI image you have chosen (i386 or x84_64) and
      • BASE_AMI_IMAGE: the unique code that maps an AMI image, example ami-2b5fba42.
    • Other configurable variable is the JAVA_VERSION, there you can define which version will be installed along with the instance:
      • You can also provide a link where would be located the binary (JAVA_BINARY_URL), for instance, if you have JAVA_VERSION=1.6.0_29, an option is use JAVA_BINARY_URL=http://download.oracle.com/otn-pub/java/jdk/6u29-b11/jdk-6u29-linux-i586.bin.


  • You can add the content of src/contrib/ec2/bin to your PATH variable so you will be able to run the commands indepentend from where the prompt is open;
  • To launch a EC2 cluster and start Hadoop, you use the following command. The arguments are the cluster name (hadoop-test) and the number of slaves (2). When the cluster boots, the public DNS name will be printed to the console.
  • $ hadoop-ec2 launch-cluster hadoop-test 2
  • To login at the master node from your "cluster" you type:
  • $ hadoop-ec2 login hadoop-test
  • Once you are logged into the master node you will be able to start the job:
    • For example, to test your cluster, you can run a pi calculation that is already provided by the hadoop*-examples.jar:
    • # cd /usr/local/hadoop-*
      # bin/hadoop jar hadoop-*-examples.jar pi 10 10000000
  • You can check your job progress at http://MASTER_HOST:50030/. Where MASTER_HOST is the host name returned after the cluster started.
  • After your job has finished, the cluster remains alive. To shutdown you use the following command:
  • $ hadoop-ec2 terminate-cluster hadoop-test
  • Remember that in Amazon EC2, the instances are charged per hour, so if you only wanted to do tests, you can play with the cluster for some more minutes.

13 comentários:

  1. Thanks for InformationHadoop Course will provide the basic concepts of MapReduce applications developed using Hadoop, including a close look at framework components, use of Hadoop for a variety of data analysis tasks, and numerous examples of Hadoop in action. This course will further examine related technologies such as Hive, Pig, and Apache Accumulo. HADOOP Online Training

  2. You have certainly explained that Big data analytics is the process of examining big data to uncover hidden patterns, unknown correlations and other useful information that can be used to make better decisions..The big data analytics is the major part to be understood regarding Big Data Course in Chennai program. Via your quality content i get to know about that in deep. Thanks for sharing this here.

  3. Thankyou for the information provided.
    I am looking forward for build an web application where i can use amazon api to bring the data from amazon ec2 cloud for bigdata and perform certain analytics and send the data back to amazon cloud.
    Can you help me on the high level of performing the above tasks.

  4. Oracle DBA Training in Chennai
    Thanks for sharing this informative blog. I did Oracle DBA Certification in Greens Technology at Adyar. This is really useful for me to make a bright career..

  5. Whatever we gathered information from the blogs, we should implement that in practically then only we can understand that exact thing clearly, but it’s no need to do it, because you have explained the concepts very well. It was crystal clear, keep sharing..
    Websphere Training in Chennai

  6. Data warehousing Training in Chennai
    I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly..

  7. Selenium Training in Chennai
    Wonderful blog.. Thanks for sharing informative blog.. its very useful to me..

  8. Oracle Training in chennai
    Thanks for sharing such a great information..Its really nice and informative..

  9. I have read your blog and i got a very useful and knowledgeable information from your blog.You have done a great job.
    SAP Training in Chennai

  10. This information is impressive..I am inspired with your post writing style & how continuously you describe this topic. After reading your post,thanks
    for taking the time to discuss this, I feel happy about it and I love learning more about this topic
    Android Training In Chennai In Chennai

  11. Pretty article! I found some useful information in your blog, it was awesome to read,thanks for sharing this great content to my vision, keep sharing..
    Unix Training In Chennai

  12. I found some useful information in your blog, it was awesome to read, thanks for sharing this great content to my vision, keep sharing..
    SalesForce Training in Chennai

  13. There are lots of information about latest technology and how to get trained in them, like Best Hadoop Training In Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies Hadoop Training in Chennai By the way you are running a great blog. Thanks for sharing this blogs..