Total Pageviews

Saturday 22 November 2014

Introduction to Data science Part 5:map red set up

Parag Ray
29-Sep-2014

Introduction

Welcome to the readers!

This is the fifth part of the series of articles. Here we actually start setting up map reduce in a multi node cluster and have a hands on experience of the map reduce.

If you have not gone through already,You would necessarily  go back to fourth part of the multi node set up as this post will depend on the multi node set up.

Please note the related readings and target audience section to get help to better follow the blog.

 
We are assuming the operating system to be Ubuntu 14.04.

Agenda
  • Target audience
  • Related readings/other blogs
  • MR configuration
  • Test
Target audience

  • This is an intermediate level discussion on Hadoop and related tools.
  • Best suited for audience who are looking for introduction to this technology.
  • Prior knowledge in java and Linux required.
  • Intermediate level understanding of networking necessary.
Related readings/other blogs  
You would also like to look at Cloudera home page & Hadoop home page for further details.

MR configuration

We shall do the configuration for master first.
Please move to $HADOOP_HOME folder and then conf folder under this.
All Map reduce and HDFS configuration files are located here.
Add the following entries to mapred-site.xml:
<configuration>     
<property>
         <name>mapred.job.tracker</name>
         <value>PRUBNode1:9001</value>
     </property>
<property>
         <name>mapred.local.dir</name>
         <value>/home/parag/work/mrhome/localdir</value>
     </property>
<property>
         <name>mapred.system.dir</name>
         <value>/home/parag/work/mrhome/sysdir</value>
     </property>
</configuration> 
 
Please note: PRUBNode1 is my hostname for the master node, please change it 
for host name of your master node.
mapred.local.dir  & mapred.system.dir  should be given a generic directory name 
which has to be the same (once we move to multinode) for all nodes. Directory needs 
be separately created as they will not be created by Hadoop.

For slave also the same configuration change is needed.

Start the hadoop daemons from the master node, issue following command from  $HADOOP_HOME/bin folder:
$ bin/start-all.sh 


as discussed in the first few blogs in this series, the Jobtracker is the master for map reduce and it should be visible now if you issue the following command,-

jps

Sample out put from Master,
parag@PRUBNode1:~/work/hadoop-1.2.1/bin$ jps
9530 DataNode
9788 JobTracker
9700 SecondaryNameNode
9962 TaskTracker
9357 NameNode
10252 Jps
 

Job tracker also has a built in UI for basic monitoring,-
http://<<hostname>>:50030/  for example in my case it is http://PRUBNode1:50030


TEST MR installation

Test 1: issue jps command and you would expect to see out put as in previous section.  master will show all jobs . Slave will show only datanode and task tracker.

Test 2:
go to admin UI (http://<<hostname>>:50030) and check all nodes are visible or not.


Test 3: run a test job

Copy the input files into the distributed filesystem:
$ bin/hadoop fs -put conf input
Run some of the examples provided:
$ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
Examine the output files:
Copy the output files from the distributed filesystem to the local filesytem and examine them:
$ bin/hadoop fs -get output output
$ cat output/*
or
View the output files on the distributed filesystem:
$ bin/hadoop fs -cat output/*
When you're done, stop the daemons with:
$ bin/stop-all.sh
 

No comments:

Post a Comment

It will be by pleasure to respond to any of your queries, and i do welcome your suggestions for making the blogs better.