Discussions on Big data technologies: Introduction to Data science Part 5:map red set up

Parag Ray
29-Sep-2014

Introduction

Welcome to the readers!

This is the fifth part of the series of articles. Here we actually start setting up map reduce in a multi node cluster and have a hands on experience of the map reduce.

If you have not gone through already,You would necessarily go back to fourth part of the multi node set up as this post will depend on the multi node set up.

Please note the related readings and target audience section to get help to better follow the blog.

We are assuming the operating system to be Ubuntu 14.04.

Agenda

Target audience

Related readings/other blogs

MR configuration

Test

Target audience

This is an intermediate level discussion on Hadoop and related tools.

Best suited for audience who are looking for introduction to this technology.

Prior knowledge in java and Linux required.

Intermediate level understanding of networking necessary.

Related readings/other blogs
You would also like to look at Cloudera home page & Hadoop home page for further details.

MR configuration

We shall do the configuration for master first.
Please move to $HADOOP_HOME folder and then conf folder under this.
All Map reduce and HDFS configuration files are located here.
Add the following entries to mapred-site.xml:

<configuration>     
<property>
         <name>mapred.job.tracker</name>
         <value>PRUBNode1:9001</value>
     </property>
<property>
         <name>mapred.local.dir</name>
         <value>/home/parag/work/mrhome/localdir</value>
     </property>
<property>
         <name>mapred.system.dir</name>
         <value>/home/parag/work/mrhome/sysdir</value>
     </property>
</configuration>

Please note: PRUBNode1 is my hostname for the master node, please change it

for host name of your master node.

mapred.local.dir  & mapred.system.dir  should be given a generic directory name

which has to be the same (once we move to multinode) for all nodes. Directory needs

be separately created as they will not be created by Hadoop.

For slave also the same configuration change is needed.

Start the hadoop daemons from the master node, issue following command from $HADOOP_HOME/bin folder:
$ bin/start-all.sh

as discussed in the first few blogs in this series, the Jobtracker is the master for map reduce and it should be visible now if you issue the following command,-

jps

Sample out put from Master,
parag@PRUBNode1:~/work/hadoop-1.2.1/bin$ jps
9530 DataNode
9788 JobTracker
9700 SecondaryNameNode
9962 TaskTracker
9357 NameNode
10252 Jps

Job tracker also has a built in UI for basic monitoring,-
http://<<hostname>>:50030/ for example in my case it is http://PRUBNode1:50030

TEST MR installation

Test 1: issue jps command and you would expect to see out put as in previous section. master will show all jobs . Slave will show only datanode and task tracker.

Test 2:
go to admin UI (http://<<hostname>>:50030) and check all nodes are visible or not.

Test 3: run a test job

Copy the input files into the distributed filesystem:
$ bin/hadoop fs -put conf input
Run some of the examples provided:
$ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
Examine the output files:
Copy the output files from the distributed filesystem to the local filesytem and examine them:
$ bin/hadoop fs -get output output
$ cat output/*
or
View the output files on the distributed filesystem:
$ bin/hadoop fs -cat output/*
When you're done, stop the daemons with:
$ bin/stop-all.sh

Discussions on Big data technologies

Total Pageviews

Saturday, 22 November 2014

Introduction to Data science Part 5:map red set up

No comments:

Post a Comment