Total Pageviews

Showing posts with label MR. Show all posts
Showing posts with label MR. Show all posts

Saturday, 22 November 2014

Introduction to Data science Part 5:map red set up

Parag Ray
29-Sep-2014

Introduction

Welcome to the readers!

This is the fifth part of the series of articles. Here we actually start setting up map reduce in a multi node cluster and have a hands on experience of the map reduce.

If you have not gone through already,You would necessarily  go back to fourth part of the multi node set up as this post will depend on the multi node set up.

Please note the related readings and target audience section to get help to better follow the blog.

 
We are assuming the operating system to be Ubuntu 14.04.

Agenda
  • Target audience
  • Related readings/other blogs
  • MR configuration
  • Test
Target audience

  • This is an intermediate level discussion on Hadoop and related tools.
  • Best suited for audience who are looking for introduction to this technology.
  • Prior knowledge in java and Linux required.
  • Intermediate level understanding of networking necessary.
Related readings/other blogs  
You would also like to look at Cloudera home page & Hadoop home page for further details.

MR configuration

We shall do the configuration for master first.
Please move to $HADOOP_HOME folder and then conf folder under this.
All Map reduce and HDFS configuration files are located here.
Add the following entries to mapred-site.xml:
<configuration>     
<property>
         <name>mapred.job.tracker</name>
         <value>PRUBNode1:9001</value>
     </property>
<property>
         <name>mapred.local.dir</name>
         <value>/home/parag/work/mrhome/localdir</value>
     </property>
<property>
         <name>mapred.system.dir</name>
         <value>/home/parag/work/mrhome/sysdir</value>
     </property>
</configuration> 
 
Please note: PRUBNode1 is my hostname for the master node, please change it 
for host name of your master node.
mapred.local.dir  & mapred.system.dir  should be given a generic directory name 
which has to be the same (once we move to multinode) for all nodes. Directory needs 
be separately created as they will not be created by Hadoop.

For slave also the same configuration change is needed.

Start the hadoop daemons from the master node, issue following command from  $HADOOP_HOME/bin folder:
$ bin/start-all.sh 


as discussed in the first few blogs in this series, the Jobtracker is the master for map reduce and it should be visible now if you issue the following command,-

jps

Sample out put from Master,
parag@PRUBNode1:~/work/hadoop-1.2.1/bin$ jps
9530 DataNode
9788 JobTracker
9700 SecondaryNameNode
9962 TaskTracker
9357 NameNode
10252 Jps
 

Job tracker also has a built in UI for basic monitoring,-
http://<<hostname>>:50030/  for example in my case it is http://PRUBNode1:50030


TEST MR installation

Test 1: issue jps command and you would expect to see out put as in previous section.  master will show all jobs . Slave will show only datanode and task tracker.

Test 2:
go to admin UI (http://<<hostname>>:50030) and check all nodes are visible or not.


Test 3: run a test job

Copy the input files into the distributed filesystem:
$ bin/hadoop fs -put conf input
Run some of the examples provided:
$ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
Examine the output files:
Copy the output files from the distributed filesystem to the local filesytem and examine them:
$ bin/hadoop fs -get output output
$ cat output/*
or
View the output files on the distributed filesystem:
$ bin/hadoop fs -cat output/*
When you're done, stop the daemons with:
$ bin/stop-all.sh
 

Thursday, 20 November 2014

Introduction to Data science Part 4:HDFS multi-node set up

Parag Ray
29-Sep-2014

Introduction

Welcome to the readers!

This is the fourth part of the series of articles. Here we actually start setting up HDFS in a multi node cluster and have a hands on experience of the HDFS.

If you have not gone through already,You would necessarily  go back to third part of the single node set up(see related readings) as this post will depend on the single node set up.

Please note the related readings and target audience section to get help to better follow the blog.

 
We are assuming the operating system to be Ubuntu 14.04.


Agenda
  • Target audience
  • Related readings/other blogs
  • Multi-node cluster setup steps
Target audience

  • This is an intermediate level discussion on Hadoop and related tools.
  • Best suited for audience who are looking for introduction to this technology.
  • Prior knowledge in java and Linux required.
  • Intermediate level understanding of networking necessary.
Related readings/other blogs  
Please refer MR setup first part in section 3 first.You would also like to look at Cloudera home page & Hadoop home page for further details.

Multi-node cluster setup steps

Multi-node set up requires that single node set up is done in each of the nodes. This can be done any time before step 3 Hadoop configuration below.

  1. Network connect
     Connect the underlying network
  2. Ssh connectivity
     Setup ssh connection
  3. HDFS configuration for multi-node. 
Network connect

A home network is demonstrated here.The set up is academic and not production quality. A full blown setup would set up comprise of all components including segment A to B below, but we are covering only segment B here with two nodes, one master and one slave.IP 4 network is shown here.

 Ip address set up:
Navigate to Use system settings > network>wired connection and turn the Wired network on.


Set 'manual' 'ip4' config and provide ip 'address' and use same 'netmask' for all PCs in network. Network config details may come under 'Options' button


In case you are using other Linux , the set up may be different. Here is one example for oracle linux.

SSH connectivity

SSH connectivity is used for connectivity may be used across nodes.
This needs to be connectivity across nodes where data communication is expected. Two way connectivity between master and slave node is a must.
Self connectivity is also a must.









In the above diagram NN indicate name node and DN indicate datanode of course.
  • Master and slaves talk via ssh connectivity.
     
  • We need ssh without need to provide credential for each requests.
     
  • The set up for ssh without password to provided is not ideal for security but
    In this case for each transaction of hadoop, providing credential is not practical.
     
  • There are other ways ,but for the purpose of this presentation, we shall stick to this set up
     
  • Both master and slave nodes must open ssh connectivity to each other  and
    to themselves.
Now for details steps

  • It is assumed that for all these we are using  the hadoop user. Sudo to respective user is also an option.
  • First from master node check that  ssh to the localhost  happens without a passphrase:
ssh localhost
  • If passphrase is requested, execute the following commands:
ssh-keygen -t rsa -P “” -f ~/.ssh/id_rsa
(The part -f ~/.ssh/id_rsa is not mandatory as that is the default location and system will ask for this location if not provided)
  • Copy public key to authorized_keys file
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
  • On master node try to ssh again to localhost and if passphrase is still requested, do the  following
chmod 700 $HOME $HOME/.ssh
chmod 600 $HOME/.ssh/authorized_keys
chown  <<hadoop user>> $HOME/.ssh/authorized_keys

it will be a good idea to check if you can do
ssh localhost and also ssh <<nodename>>
  • If after executing ssh localhost any error message is returned that indicates that ssh server is not installed you may like to try the following,-
    • Use the following command to install ssh server as under,-
sudo apt-get install openssh-server

  • Then can repeat the steps starting from trying ssh localhost
  • Now public key needs to be copied to all of  slave machine (only one here).  Ssh should be installed in all slave machines. We do strongly recommend ssh understanding.
scp ~/.ssh/id_rsa.pub <<slave host name>>:~/.ssh/<<master host name>>.pub   (/etc/hosts should be updated with slave and master host entries )
  • Now login (ssh <<slave host name>>)  to your slave machine. Password will be needed  still.
  • While on your slave machine ,issue the following command  to append your master machine’s hadoop user’s public key to the slave machine’s  authorized key store,
cat ~/.ssh/<<master host name>>.pub >> ~/.ssh/authorized_keys
  • Issue exit to come out of slave
  • Now, from the master node try to ssh to slave.
ssh <<slave host name>>
  • if passphrase is still requested, do the  following after  logging in with password to slave
chmod 700 $HOME $HOME/.ssh
chmod 600 $HOME/.ssh/authorized_keys 
chown  <<hadoop user>> $HOME/.ssh/authorized_keys
  • Exit and log in again from your master node.
ssh <<slave host name>>
  • if it still asks for password , log in using password, log out and issue 
ssh-add
  • Repeat for all Hadoop Cluster Nodes which now will point back to master.
 HDFS configuration for multi-node.

 /etc/hosts
should have all host names
parag@PRUBNode1:/etc$ cat hosts
10.192.0.2    PRUBNode2
10.192.0.1    PRUBNode1
127.0.0.1    localhost
 

there could be one problem that might happen out of hosts file.
 to detect this problem run sudo netstat -ntlp on the master , it shows:
tcp6 0 0 127.0.0.1:9020 :::* LISTEN 32646/java
This 127.0.0.1 means that it is only listening to connection on 9020 which is from localhost , all the connection on 9020 from outside cannot be received.


this may be resolved by preventing entries that point local host for both actual ip and 127.0.0.1.
the entry shown above will not have this problem but the following will


10.192.0.1     localhost
 10.192.0.2    PRUBNode2
10.192.0.1    PRUBNode1
127.0.0.1      localhost


 there is likely to be problems if hostnames are missed out as well.
slaves
slaves file in the folder $HADOOP_HOME/conf ,should have all the node names that has data node running
 we have proved the listing as below,
parag@PRUBNode1:~/work/hadoop-1.2.1/conf$ cat slaves
PRUBNode1
PRUBNode2

master
master file n the folder $HADOOP_HOME/conf should have all the node names that has secondarynamenode running
parag@PRUBNode1:~/work/hadoop-1.2.1/conf$ cat masters
PRUBNode1

hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.name.dir</n

ame>
<value>/home/parag/work/hdf/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/parag/work/hdf/data</value>
</property>
</configuration>


all folders should be generic and same across all nodes

core-site.xml
<configuration>

    <property>
        <name>fs.default.name</name>
        <value>hdfs://PRUBNode1:9020</value>
    </property>
<property>
        <name>hadoop.tmp.dir</name>
        <value>/home/parag/work/hdf/tmp</value>
    </property>

</configuration>

all folders should be generic and same across all nodes. we are now mentioning the nodename of master for fs.defaultt.name instead of local host as we expect it to be accessed from different nodes.

 Starting HDFS

In the name node containing Node issue command 
./start-dfs.sh 
from $HADOOP_HOME/bin folder

it should start name node , data node and secondary name node  in master and data node in slave.

if it is first time, name node format to be done. 

Test set up

Test 1: Issue command jps at master
parag@PRUBNode1:~/work/hadoop-1.2.1/bin$ jps
9530 DataNode
9700 SecondaryNameNode
9357 NameNode
10252 Jps
 


Issue command jps at slave will show only DataNode

Test 2: Open administrative console at http://<<hostname of master>>:50070
for me it is http://PRUBNode1:50070



 

Introduction to Data science Part 3: HDFS single node setup

Parag Ray
29-Sep-2014

Introduction

Welcome to the readers!

This is the third part of the series of articles. Here we actually start setting up Hadoop and have a hands on experience of the HDFS.
Please note the related readings and target audience section to get help to better follow the blog.
 
We are assuming the operating system to be Ubuntu 14.04.

Agenda
  • Target audience
  • HDFS setup
Target audience

  • This is an intermediate level discussion on Hadoop and related tools.
  • Best suited for audience who are looking for introduction to this technology.
  • Prior knowledge in java and Linux required.
  • Intermediate level understanding of networking necessary.
Related readings/other blogs  
Please see links section.You would also like to look at Cloudera home page & Hadoop home page for further details.

Hadoop Setup
  • Preconditions:
    Assuming java 7 and Ubuntu linux installed with at least 4 GB RAM and 50 GB disk space.
    If Ubuntu is new installation good idea to update with     
apt-get install update       
  • Also if needed install java with 
apt-get install openjdk-7-jdk

  • Install eclipse with
apt-get install eclipse
  • Install ssh with
 apt-get install openssh  
Note:These will require Internet connection and firewall should allow the connection to Internet repository.
  • Following are the steps to set up Hadoop
    • Download hadoop-1.2.1-bin.tar.gz.
    • Open console and use  command cd ~  to move to home directory
    • Create a folder /work.
    • Change directory to new folder and extract the hadoop-1.2.1-bin.tar.gz with command ,-
    tar –xvf hadoop-1.2.1-bin.tar.gz  
    • This will create a hadoop-1.2.1 folder under work folder. we shall cal this 'hadoop folder'.
  • Before you proceed to next steps , be aware of the java home folder. The following comamnd may help
which java
  • Go to home folder by cd ~
  • Issue following command to open profile file.
  gedit .bashrc
 
  • Assuming that java is installed using command as mentioned above, Add following lines in .bashrc,-
export JAVA_HOME="/usr/lib/jvm/java-7-openjdk-i386"
export HADOOP_HOME="/home/parag/work/hadoop-1.2.1"
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin

 
note first and second line could be different depending on where java and hadoop are installed
In case java is in some other folder, the same has to be provided, the directory should be the parent folder of the bin folder.
  • To refresh the configuration, issue command
     . . Bashrc
please note there is a space between the two dots in the above command. People with unix knowledge will find this information redundant.

  • Close gedit.
  • There will be ‘conf’ folder under hadoop folder created, change directory to the conf folder to do the subsequent config tasks.
  • In hadoop-env.sh(use gedit hadoop-env.sh)  add command 
  •  
export JAVA_HOME = /usr/lib/jvm/java-7-openjdk-i386
  • In case of multi-node cluster set up we need to add master for location where secondary namenode will be running(governed by hadoop command issued on nodes), and slave file where all datanode will be added. Each node in one line.  
  • Following should be noted,-
    • -Having a user specific to Hadoop will be good, it is not shown here
      -Folder structure should be same across cluster if multi-node set up is used. it is better to do generic folder structure any way so that later it is not a difficulty.
  •  All node names should be in /etc/hosts  (use sudo gedit /etc/hosts)
127.0.0.1    localhost
127.0.0.1    PRUBNode1
  •  Edit core-site.xml to add the following,-
<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://PRUBNode1:9000</value>
    </property>
<property>
        <name>hadoop.tmp.dir</name>
        <value>/home/parag/tmp</value>
    </property>
</configuration>
 
The above configuration is for name node so in fs.default.name it points to hdfs admin console, host name as PRUBNode1. This can be verified with hostname command at console)  hadoop.tmp.dir is temporary work area.
 
  • Edit hdfs-site.xml to add the following,-
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/parag/work/dfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/parag/work/dfs/data</value>
</property>
</configuration>
few tips:
To do failure prevention, should  have more than one with NFS as well by comma separated list.
directory structure be same across all nodes. for example dfs.data.dir will be needed in all nodes in cluster, and the folder should be the same.
  • Cd to bin folder of hadoop and issue ./hadoop and see the command options. Note map reduce and hadoop commands are coming together.
 parag@PRUBNode1:~/work/hadoop-1.2.1/bin$ ./hadoop
Warning: $HADOOP_HOME is deprecated.

Usage: hadoop [--config confdir] COMMAND
where COMMAND is one of:
  namenode -format     format the DFS filesystem
  secondarynamenode    run the DFS secondary namenode
  namenode             run the DFS namenode
  datanode             run a DFS datanode
  dfsadmin             run a DFS admin client
  mradmin              run a Map-Reduce admin client
  fsck                 run a DFS filesystem checking utility
  fs                   run a generic filesystem user client
  balancer             run a cluster balancing utility
  oiv                  apply the offline fsimage viewer to an fsimage
  fetchdt              fetch a delegation token from the NameNode
  jobtracker           run the MapReduce job Tracker node
  pipes                run a Pipes job
  tasktracker          run a MapReduce task Tracker node
  historyserver        run job history servers as a standalone daemon
  job                  manipulate MapReduce jobs
  queue                get information regarding JobQueues
  version              print the version
  jar <jar>            run a jar file
  distcp <srcurl> <desturl> copy file or directories recursively
  distcp2 <srcurl> <desturl> DistCp version 2
  archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
  classpath            prints the class path needed to get the
                       Hadoop jar and the required libraries
  daemonlog            get/set the log level for each daemon
 or
  CLASSNAME            run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
  • ONLY FOR THE FIRST  TIME WE NEED TO FORMAT the file system(remember what happens if we format you d drive???)  
./hadoop namenode –format


  • Start hadoop file system
. ./start_dfs.sh
issue jps command to see all processes running
parag@PRUBNode1:~/work/hadoop-1.2.1/bin$ jps
2403 NameNode
3331 Jps
2574 DataNode
2722 SecondaryNameNode
 
 
Monitor hadoop file system from http://<<hostname of nn>>:50070


  • Stop dfs with ./stop-dfs.sh

Monday, 10 November 2014

Introduction to Data science Part 2

Parag Ray
04-Sep-2014

Introduction

Welcome to the readers! 
This blog is an introduction to the Map reduce. please note the related readings and target audience section to get help to better follow the blog.


Agenda
  • Target audience.
  • Related readings/other blogs .
  • Map reduce motivation.
  • Features.
  • How does Hadoop storage  look like along with Map reduce.
  • Basic Algorithms.
  • Intuition of the process.
  • Physical architecture.
Target Audience
  • This is an introductory discussion on data science, big data technologies and Hadoop.
  • Best suited for audience who are looking for introduction to this technology.
  • There is no prior knowledge required,except for basic understanding of network, computing and high level understanding of enterprise application environments.
Related readings/other blogs  
This is the second part of this series of articles, and related reading are provided in the links. It will be helpful to go through part 1 first.
You would also like to look at Cloudera home page & Hadoop home page for further details.

Map Reduce intuition 
  • If there are huge amounts of data it may be a big challenge for single computational infrastructure to process that.
  • Map reduce allows the entire computational task to be divided in to smaller tasks(Map) and then combining(Reduce) them for final result.
  • Maps and reducers can run on separate Machines allowing horizontal scalability.
  • Number of maps and reducers can be very high allowing scalability.
  • Integrated and optimized for Hadoop which allows distributed data storage. 
  • Data are stored in chunks in Hadoop, instead of transporting data from one node to another and doing the processing , it is more efficient to do the processing locally and transmit the result.
    Approximate comparison of traditional computation strategy with one vertically scaled Database node and one processor to Hadoop Map reduce strategy. Columns approximate time scale.






Map features reduce at a glance



  • Master slave architecture.Composed mainly of Job trackers(master) and Task trackers(slave) .
  • Fail over support is provided by a heart beat based system.
  • Plug-able components with given set of interfaces to accomplish various required functions like detailed validation, combination of results.
  • Network optimized for data access from Hadoop for any given topology.
  • Supports optimization features like partitions.
  • Has various modes of running like local, pseudo-distributed.

How does Hadoop storage  look like along with Map reduce.

The following diagram over lays the MR components on top of the HDFS components as we have seen in the previous post to high light the integration.























  • Map reduce is implemented in close integration with Hadoop.
  • Job tracker is the master and task tracker provide handle to the distributed map tasks.
  • Job tracker maintains heart beat contact and restarts jobs if a job is none-responsive.
  • Data access from Hadoop data nodes are optimized based on policy.
  • Tasks under various job trackers are capable of data exchange.
Basic algorithm

  • Data blocks reside in HDFS and they are read as input split.
  • Map programs get the splits with specific structure like key value and also receives the context handle.
  • Maps run distributed and process distributed data by accessing them locally ( provides speed enhancement by this)
  • Maps have access to various data types as shown
  • Framework allows collating data in partitions having similar values together.
  • Merge and sort is done if reduce is necessary.
  • Reducers are not compulsory, but are provided to for final processing of data produced by Maps
  • Map data is transmitted across http to reduce location.
  • reduce 
  • If Reduce is not running then unsorted data is submitted back to HDFS out put location.
  • If Reduce is running then reduced data is saved back to HDFS.
 Physical architecture
 JT: Job tracker
TT: Task Tracker

  • Client request job via RPC to JT
  • JT maintain Heart beat from TT to make sure the TT is up and running.
  • TT has child job nodes executing Map and reduce jobs. Data access is localized.
  • There is Umbilical protocol to maintain communication between Job node and TT.