Total Pageviews

Showing posts with label multi-node. Show all posts
Showing posts with label multi-node. Show all posts

Saturday, 22 November 2014

Introduction to Data science Part 5:map red set up

Parag Ray
29-Sep-2014

Introduction

Welcome to the readers!

This is the fifth part of the series of articles. Here we actually start setting up map reduce in a multi node cluster and have a hands on experience of the map reduce.

If you have not gone through already,You would necessarily  go back to fourth part of the multi node set up as this post will depend on the multi node set up.

Please note the related readings and target audience section to get help to better follow the blog.

 
We are assuming the operating system to be Ubuntu 14.04.

Agenda
  • Target audience
  • Related readings/other blogs
  • MR configuration
  • Test
Target audience

  • This is an intermediate level discussion on Hadoop and related tools.
  • Best suited for audience who are looking for introduction to this technology.
  • Prior knowledge in java and Linux required.
  • Intermediate level understanding of networking necessary.
Related readings/other blogs  
You would also like to look at Cloudera home page & Hadoop home page for further details.

MR configuration

We shall do the configuration for master first.
Please move to $HADOOP_HOME folder and then conf folder under this.
All Map reduce and HDFS configuration files are located here.
Add the following entries to mapred-site.xml:
<configuration>     
<property>
         <name>mapred.job.tracker</name>
         <value>PRUBNode1:9001</value>
     </property>
<property>
         <name>mapred.local.dir</name>
         <value>/home/parag/work/mrhome/localdir</value>
     </property>
<property>
         <name>mapred.system.dir</name>
         <value>/home/parag/work/mrhome/sysdir</value>
     </property>
</configuration> 
 
Please note: PRUBNode1 is my hostname for the master node, please change it 
for host name of your master node.
mapred.local.dir  & mapred.system.dir  should be given a generic directory name 
which has to be the same (once we move to multinode) for all nodes. Directory needs 
be separately created as they will not be created by Hadoop.

For slave also the same configuration change is needed.

Start the hadoop daemons from the master node, issue following command from  $HADOOP_HOME/bin folder:
$ bin/start-all.sh 


as discussed in the first few blogs in this series, the Jobtracker is the master for map reduce and it should be visible now if you issue the following command,-

jps

Sample out put from Master,
parag@PRUBNode1:~/work/hadoop-1.2.1/bin$ jps
9530 DataNode
9788 JobTracker
9700 SecondaryNameNode
9962 TaskTracker
9357 NameNode
10252 Jps
 

Job tracker also has a built in UI for basic monitoring,-
http://<<hostname>>:50030/  for example in my case it is http://PRUBNode1:50030


TEST MR installation

Test 1: issue jps command and you would expect to see out put as in previous section.  master will show all jobs . Slave will show only datanode and task tracker.

Test 2:
go to admin UI (http://<<hostname>>:50030) and check all nodes are visible or not.


Test 3: run a test job

Copy the input files into the distributed filesystem:
$ bin/hadoop fs -put conf input
Run some of the examples provided:
$ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
Examine the output files:
Copy the output files from the distributed filesystem to the local filesytem and examine them:
$ bin/hadoop fs -get output output
$ cat output/*
or
View the output files on the distributed filesystem:
$ bin/hadoop fs -cat output/*
When you're done, stop the daemons with:
$ bin/stop-all.sh
 

Thursday, 20 November 2014

Introduction to Data science Part 4:HDFS multi-node set up

Parag Ray
29-Sep-2014

Introduction

Welcome to the readers!

This is the fourth part of the series of articles. Here we actually start setting up HDFS in a multi node cluster and have a hands on experience of the HDFS.

If you have not gone through already,You would necessarily  go back to third part of the single node set up(see related readings) as this post will depend on the single node set up.

Please note the related readings and target audience section to get help to better follow the blog.

 
We are assuming the operating system to be Ubuntu 14.04.


Agenda
  • Target audience
  • Related readings/other blogs
  • Multi-node cluster setup steps
Target audience

  • This is an intermediate level discussion on Hadoop and related tools.
  • Best suited for audience who are looking for introduction to this technology.
  • Prior knowledge in java and Linux required.
  • Intermediate level understanding of networking necessary.
Related readings/other blogs  
Please refer MR setup first part in section 3 first.You would also like to look at Cloudera home page & Hadoop home page for further details.

Multi-node cluster setup steps

Multi-node set up requires that single node set up is done in each of the nodes. This can be done any time before step 3 Hadoop configuration below.

  1. Network connect
     Connect the underlying network
  2. Ssh connectivity
     Setup ssh connection
  3. HDFS configuration for multi-node. 
Network connect

A home network is demonstrated here.The set up is academic and not production quality. A full blown setup would set up comprise of all components including segment A to B below, but we are covering only segment B here with two nodes, one master and one slave.IP 4 network is shown here.

 Ip address set up:
Navigate to Use system settings > network>wired connection and turn the Wired network on.


Set 'manual' 'ip4' config and provide ip 'address' and use same 'netmask' for all PCs in network. Network config details may come under 'Options' button


In case you are using other Linux , the set up may be different. Here is one example for oracle linux.

SSH connectivity

SSH connectivity is used for connectivity may be used across nodes.
This needs to be connectivity across nodes where data communication is expected. Two way connectivity between master and slave node is a must.
Self connectivity is also a must.









In the above diagram NN indicate name node and DN indicate datanode of course.
  • Master and slaves talk via ssh connectivity.
     
  • We need ssh without need to provide credential for each requests.
     
  • The set up for ssh without password to provided is not ideal for security but
    In this case for each transaction of hadoop, providing credential is not practical.
     
  • There are other ways ,but for the purpose of this presentation, we shall stick to this set up
     
  • Both master and slave nodes must open ssh connectivity to each other  and
    to themselves.
Now for details steps

  • It is assumed that for all these we are using  the hadoop user. Sudo to respective user is also an option.
  • First from master node check that  ssh to the localhost  happens without a passphrase:
ssh localhost
  • If passphrase is requested, execute the following commands:
ssh-keygen -t rsa -P “” -f ~/.ssh/id_rsa
(The part -f ~/.ssh/id_rsa is not mandatory as that is the default location and system will ask for this location if not provided)
  • Copy public key to authorized_keys file
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
  • On master node try to ssh again to localhost and if passphrase is still requested, do the  following
chmod 700 $HOME $HOME/.ssh
chmod 600 $HOME/.ssh/authorized_keys
chown  <<hadoop user>> $HOME/.ssh/authorized_keys

it will be a good idea to check if you can do
ssh localhost and also ssh <<nodename>>
  • If after executing ssh localhost any error message is returned that indicates that ssh server is not installed you may like to try the following,-
    • Use the following command to install ssh server as under,-
sudo apt-get install openssh-server

  • Then can repeat the steps starting from trying ssh localhost
  • Now public key needs to be copied to all of  slave machine (only one here).  Ssh should be installed in all slave machines. We do strongly recommend ssh understanding.
scp ~/.ssh/id_rsa.pub <<slave host name>>:~/.ssh/<<master host name>>.pub   (/etc/hosts should be updated with slave and master host entries )
  • Now login (ssh <<slave host name>>)  to your slave machine. Password will be needed  still.
  • While on your slave machine ,issue the following command  to append your master machine’s hadoop user’s public key to the slave machine’s  authorized key store,
cat ~/.ssh/<<master host name>>.pub >> ~/.ssh/authorized_keys
  • Issue exit to come out of slave
  • Now, from the master node try to ssh to slave.
ssh <<slave host name>>
  • if passphrase is still requested, do the  following after  logging in with password to slave
chmod 700 $HOME $HOME/.ssh
chmod 600 $HOME/.ssh/authorized_keys 
chown  <<hadoop user>> $HOME/.ssh/authorized_keys
  • Exit and log in again from your master node.
ssh <<slave host name>>
  • if it still asks for password , log in using password, log out and issue 
ssh-add
  • Repeat for all Hadoop Cluster Nodes which now will point back to master.
 HDFS configuration for multi-node.

 /etc/hosts
should have all host names
parag@PRUBNode1:/etc$ cat hosts
10.192.0.2    PRUBNode2
10.192.0.1    PRUBNode1
127.0.0.1    localhost
 

there could be one problem that might happen out of hosts file.
 to detect this problem run sudo netstat -ntlp on the master , it shows:
tcp6 0 0 127.0.0.1:9020 :::* LISTEN 32646/java
This 127.0.0.1 means that it is only listening to connection on 9020 which is from localhost , all the connection on 9020 from outside cannot be received.


this may be resolved by preventing entries that point local host for both actual ip and 127.0.0.1.
the entry shown above will not have this problem but the following will


10.192.0.1     localhost
 10.192.0.2    PRUBNode2
10.192.0.1    PRUBNode1
127.0.0.1      localhost


 there is likely to be problems if hostnames are missed out as well.
slaves
slaves file in the folder $HADOOP_HOME/conf ,should have all the node names that has data node running
 we have proved the listing as below,
parag@PRUBNode1:~/work/hadoop-1.2.1/conf$ cat slaves
PRUBNode1
PRUBNode2

master
master file n the folder $HADOOP_HOME/conf should have all the node names that has secondarynamenode running
parag@PRUBNode1:~/work/hadoop-1.2.1/conf$ cat masters
PRUBNode1

hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.name.dir</n

ame>
<value>/home/parag/work/hdf/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/parag/work/hdf/data</value>
</property>
</configuration>


all folders should be generic and same across all nodes

core-site.xml
<configuration>

    <property>
        <name>fs.default.name</name>
        <value>hdfs://PRUBNode1:9020</value>
    </property>
<property>
        <name>hadoop.tmp.dir</name>
        <value>/home/parag/work/hdf/tmp</value>
    </property>

</configuration>

all folders should be generic and same across all nodes. we are now mentioning the nodename of master for fs.defaultt.name instead of local host as we expect it to be accessed from different nodes.

 Starting HDFS

In the name node containing Node issue command 
./start-dfs.sh 
from $HADOOP_HOME/bin folder

it should start name node , data node and secondary name node  in master and data node in slave.

if it is first time, name node format to be done. 

Test set up

Test 1: Issue command jps at master
parag@PRUBNode1:~/work/hadoop-1.2.1/bin$ jps
9530 DataNode
9700 SecondaryNameNode
9357 NameNode
10252 Jps
 


Issue command jps at slave will show only DataNode

Test 2: Open administrative console at http://<<hostname of master>>:50070
for me it is http://PRUBNode1:50070