Discussions on Big data technologies: MR

Showing posts with label MR. Show all posts

Saturday, 22 November 2014

Introduction to Data science Part 5:map red set up

Parag Ray
29-Sep-2014

Introduction

Welcome to the readers!

This is the fifth part of the series of articles. Here we actually start setting up map reduce in a multi node cluster and have a hands on experience of the map reduce.

If you have not gone through already,You would necessarily go back to fourth part of the multi node set up as this post will depend on the multi node set up.

Please note the related readings and target audience section to get help to better follow the blog.

We are assuming the operating system to be Ubuntu 14.04.

Agenda

Target audience

Related readings/other blogs

MR configuration

Test

Target audience

This is an intermediate level discussion on Hadoop and related tools.

Best suited for audience who are looking for introduction to this technology.

Prior knowledge in java and Linux required.

Intermediate level understanding of networking necessary.

Related readings/other blogs
You would also like to look at Cloudera home page & Hadoop home page for further details.

MR configuration

We shall do the configuration for master first.
Please move to $HADOOP_HOME folder and then conf folder under this.
All Map reduce and HDFS configuration files are located here.
Add the following entries to mapred-site.xml:

<configuration>     
<property>
         <name>mapred.job.tracker</name>
         <value>PRUBNode1:9001</value>
     </property>
<property>
         <name>mapred.local.dir</name>
         <value>/home/parag/work/mrhome/localdir</value>
     </property>
<property>
         <name>mapred.system.dir</name>
         <value>/home/parag/work/mrhome/sysdir</value>
     </property>
</configuration>

Please note: PRUBNode1 is my hostname for the master node, please change it

for host name of your master node.

mapred.local.dir  & mapred.system.dir  should be given a generic directory name

which has to be the same (once we move to multinode) for all nodes. Directory needs

be separately created as they will not be created by Hadoop.

For slave also the same configuration change is needed.

Start the hadoop daemons from the master node, issue following command from $HADOOP_HOME/bin folder:
$ bin/start-all.sh

as discussed in the first few blogs in this series, the Jobtracker is the master for map reduce and it should be visible now if you issue the following command,-

jps

Sample out put from Master,
parag@PRUBNode1:~/work/hadoop-1.2.1/bin$ jps
9530 DataNode
9788 JobTracker
9700 SecondaryNameNode
9962 TaskTracker
9357 NameNode
10252 Jps

Job tracker also has a built in UI for basic monitoring,-
http://<<hostname>>:50030/ for example in my case it is http://PRUBNode1:50030

TEST MR installation

Test 1: issue jps command and you would expect to see out put as in previous section. master will show all jobs . Slave will show only datanode and task tracker.

Test 2:
go to admin UI (http://<<hostname>>:50030) and check all nodes are visible or not.

Test 3: run a test job

Copy the input files into the distributed filesystem:
$ bin/hadoop fs -put conf input
Run some of the examples provided:
$ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
Examine the output files:
Copy the output files from the distributed filesystem to the local filesytem and examine them:
$ bin/hadoop fs -get output output
$ cat output/*
or
View the output files on the distributed filesystem:
$ bin/hadoop fs -cat output/*
When you're done, stop the daemons with:
$ bin/stop-all.sh

Thursday, 20 November 2014

Introduction to Data science Part 4:HDFS multi-node set up

Parag Ray
29-Sep-2014

Introduction

Welcome to the readers!

This is the fourth part of the series of articles. Here we actually start setting up HDFS in a multi node cluster and have a hands on experience of the HDFS.

If you have not gone through already,You would necessarily go back to third part of the single node set up(see related readings) as this post will depend on the single node set up.

Please note the related readings and target audience section to get help to better follow the blog.

We are assuming the operating system to be Ubuntu 14.04.

Agenda

Target audience

Related readings/other blogs

Multi-node cluster setup steps

Target audience

This is an intermediate level discussion on Hadoop and related tools.

Best suited for audience who are looking for introduction to this technology.

Prior knowledge in java and Linux required.

Intermediate level understanding of networking necessary.

Related readings/other blogs
Please refer MR setup first part in section 3 first.You would also like to look at Cloudera home page & Hadoop home page for further details.

Multi-node cluster setup steps

Multi-node set up requires that single node set up is done in each of the nodes. This can be done any time before step 3 Hadoop configuration below.

Network connect
Connect the underlying network
Ssh connectivity
Setup ssh connection
HDFS configuration for multi-node.

Network connect

A home network is demonstrated here.The set up is academic and not production quality. A full blown setup would set up comprise of all components including segment A to B below, but we are covering only segment B here with two nodes, one master and one slave.IP 4 network is shown here.

Ip address set up:
Navigate to Use system settings > network>wired connection and turn the Wired network on.

Set 'manual' 'ip4' config and provide ip 'address' and use same 'netmask' for all PCs in network. Network config details may come under 'Options' button

In case you are using other Linux , the set up may be different. Here is one example for oracle linux.

SSH connectivity

SSH connectivity is used for connectivity may be used across nodes.
This needs to be connectivity across nodes where data communication is expected. Two way connectivity between master and slave node is a must.
Self connectivity is also a must.

In the above diagram NN indicate name node and DN indicate datanode of course.

Master and slaves talk via ssh connectivity.
We need ssh without need to provide credential for each requests.
The set up for ssh without password to provided is not ideal for security but
In this case for each transaction of hadoop, providing credential is not practical.
There are other ways ,but for the purpose of this presentation, we shall stick to this set up
Both master and slave nodes must open ssh connectivity to each other and
to themselves.

Now for details steps

It is assumed that for all these we are using the hadoop user. Sudo to respective user is also an option.

First from master node check that ssh to the localhost happens without a passphrase:

ssh localhost

If passphrase is requested, execute the following commands:

ssh-keygen -t rsa -P “” -f ~/.ssh/id_rsa
(The part -f ~/.ssh/id_rsa is not mandatory as that is the default location and system will ask for this location if not provided)

Copy public key to authorized_keys file

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

On master node try to ssh again to localhost and if passphrase is still requested, do the following

chmod 700 $HOME $HOME/.ssh
chmod 600 $HOME/.ssh/authorized_keys
chown <<hadoop user>> $HOME/.ssh/authorized_keys

it will be a good idea to check if you can do
ssh localhost and also ssh <<nodename>>

If after executing ssh localhost any error message is returned that indicates that ssh server is not installed you may like to try the following,-

Use the following command to install ssh server as under,-

sudo apt-get install openssh-server

Then can repeat the steps starting from trying ssh localhost

Now public key needs to be copied to all of slave machine (only one here). Ssh should be installed in all slave machines. We do strongly recommend ssh understanding.

scp ~/.ssh/id_rsa.pub <<slave host name>>:~/.ssh/<<master host name>>.pub (/etc/hosts should be updated with slave and master host entries )

Now login (ssh <<slave host name>>) to your slave machine. Password will be needed still.

While on your slave machine ,issue the following command to append your master machine’s hadoop user’s public key to the slave machine’s authorized key store,

cat ~/.ssh/<<master host name>>.pub >> ~/.ssh/authorized_keys

Issue exit to come out of slave

Now, from the master node try to ssh to slave.

ssh <<slave host name>>

if passphrase is still requested, do the following after logging in with password to slave

chmod 700 $HOME $HOME/.ssh
chmod 600 $HOME/.ssh/authorized_keys
chown <<hadoop user>> $HOME/.ssh/authorized_keys

Exit and log in again from your master node.

ssh <<slave host name>>

if it still asks for password , log in using password, log out and issue

ssh-add

Repeat for all Hadoop Cluster Nodes which now will point back to master.

HDFS configuration for multi-node.

/etc/hosts
should have all host names
parag@PRUBNode1:/etc$ cat hosts
10.192.0.2    PRUBNode2
10.192.0.1    PRUBNode1
127.0.0.1    localhost

there could be one problem that might happen out of hosts file.
to detect this problem run sudo netstat -ntlp on the master , it shows:
tcp6 0 0 127.0.0.1:9020 :::* LISTEN 32646/java
This 127.0.0.1 means that it is only listening to connection on 9020 which is from localhost , all the connection on 9020 from outside cannot be received.

this may be resolved by preventing entries that point local host for both actual ip and 127.0.0.1.
the entry shown above will not have this problem but the following will

10.192.0.1     localhost
10.192.0.2    PRUBNode2
10.192.0.1    PRUBNode1
127.0.0.1      localhost

there is likely to be problems if hostnames are missed out as well.
slaves
slaves file in the folder $HADOOP_HOME/conf ,should have all the node names that has data node running
we have proved the listing as below,
parag@PRUBNode1:~/work/hadoop-1.2.1/conf$ cat slaves
PRUBNode1
PRUBNode2
master
master file n the folder $HADOOP_HOME/conf should have all the node names that has secondarynamenode running
parag@PRUBNode1:~/work/hadoop-1.2.1/conf$ cat masters
PRUBNode1
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.name.dir</n
ame>
<value>/home/parag/work/hdf/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/parag/work/hdf/data</value>
</property>
</configuration>

all folders should be generic and same across all nodes

core-site.xml
<configuration>

    <property>
        <name>fs.default.name</name>
        <value>hdfs://PRUBNode1:9020</value>
    </property>
<property>
        <name>hadoop.tmp.dir</name>
        <value>/home/parag/work/hdf/tmp</value>
    </property>

</configuration>
all folders should be generic and same across all nodes. we are now mentioning the nodename of master for fs.defaultt.name instead of local host as we expect it to be accessed from different nodes.

Starting HDFS

In the name node containing Node issue command
./start-dfs.sh
from $HADOOP_HOME/bin folder

it should start name node , data node and secondary name node in master and data node in slave.

if it is first time, name node format to be done.

Test set up

Test 1: Issue command jps at master
parag@PRUBNode1:~/work/hadoop-1.2.1/bin$ jps
9530 DataNode
9700 SecondaryNameNode
9357 NameNode
10252 Jps

Issue command jps at slave will show only DataNode

Test 2: Open administrative console at http://<<hostname of master>>:50070
for me it is http://PRUBNode1:50070

Introduction to Data science Part 3: HDFS single node setup

Parag Ray
29-Sep-2014

Introduction

Welcome to the readers!

This is the third part of the series of articles. Here we actually start setting up Hadoop and have a hands on experience of the HDFS.
Please note the related readings and target audience section to get help to better follow the blog.

We are assuming the operating system to be Ubuntu 14.04.

Agenda

Target audience

HDFS setup

Target audience

This is an intermediate level discussion on Hadoop and related tools.

Best suited for audience who are looking for introduction to this technology.

Prior knowledge in java and Linux required.

Intermediate level understanding of networking necessary.

Related readings/other blogs
Please see links section.You would also like to look at Cloudera home page & Hadoop home page for further details.

Hadoop Setup

Preconditions:
Assuming java 7 and Ubuntu linux installed with at least 4 GB RAM and 50 GB disk space.
If Ubuntu is new installation good idea to update with

apt-get install update

Also if needed install java with

apt-get install openjdk-7-jdk

Install eclipse with

apt-get install eclipse

Install ssh with

apt-get install openssh

Note:These will require Internet connection and firewall should allow the connection to Internet repository.

Following are the steps to set up Hadoop
- Download hadoop-1.2.1-bin.tar.gz.
- Open console and use command cd ~ to move to home directory
- Create a folder /work.
- Change directory to new folder and extract the hadoop-1.2.1-bin.tar.gz with command ,-
tar –xvf hadoop-1.2.1-bin.tar.gz

This will create a hadoop-1.2.1 folder under work folder. we shall cal this 'hadoop folder'.

Before you proceed to next steps , be aware of the java home folder. The following comamnd may help

which java

Go to home folder by cd ~

Issue following command to open profile file.

gedit .bashrc

Assuming that java is installed using command as mentioned above, Add following lines in .bashrc,-

export JAVA_HOME="/usr/lib/jvm/java-7-openjdk-i386"
export HADOOP_HOME="/home/parag/work/hadoop-1.2.1"
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin

note first and second line could be different depending on where java and hadoop are installed

In case java is in some other folder, the same has to be provided, the directory should be the parent folder of the bin folder.

To refresh the configuration, issue command

. . Bashrc
please note there is a space between the two dots in the above command. People with unix knowledge will find this information redundant.

Close gedit.

There will be ‘conf’ folder under hadoop folder created, change directory to the conf folder to do the subsequent config tasks.

In hadoop-env.sh(use gedit hadoop-env.sh) add command

export JAVA_HOME = /usr/lib/jvm/java-7-openjdk-i386

In case of multi-node cluster set up we need to add master for location where secondary namenode will be running(governed by hadoop command issued on nodes), and slave file where all datanode will be added. Each node in one line.
Following should be noted,-

-Having a user specific to Hadoop will be good, it is not shown here
-Folder structure should be same across cluster if multi-node set up is used. it is better to do generic folder structure any way so that later it is not a difficulty.

All node names should be in /etc/hosts (use sudo gedit /etc/hosts)

127.0.0.1 localhost
127.0.0.1 PRUBNode1

Edit core-site.xml to add the following,-

<configuration>
   <property>
       <name>fs.default.name</name>
       <value>hdfs://PRUBNode1:9000</value>
   </property>
<property>
       <name>hadoop.tmp.dir</name>
       <value>/home/parag/tmp</value>
   </property>
</configuration>

The above configuration is for name node so in fs.default.name it points to hdfs admin console, host name as PRUBNode1. This can be verified with hostname command at console) hadoop.tmp.dir is temporary work area.

Edit hdfs-site.xml to add the following,-

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/parag/work/dfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/parag/work/dfs/data</value>
</property>
</configuration>

few tips:

To do failure prevention, should have more than one with NFS as well by comma separated list.

directory structure be same across all nodes. for example dfs.data.dir will be needed in all nodes in cluster, and the folder should be the same.

Cd to bin folder of hadoop and issue ./hadoop and see the command options. Note map reduce and hadoop commands are coming together.

parag@PRUBNode1:~/work/hadoop-1.2.1/bin$ ./hadoop
Warning: $HADOOP_HOME is deprecated.

Usage: hadoop [--config confdir] COMMAND
where COMMAND is one of:
namenode -format     format the DFS filesystem
secondarynamenode    run the DFS secondary namenode
namenode             run the DFS namenode
datanode             run a DFS datanode
dfsadmin             run a DFS admin client
mradmin              run a Map-Reduce admin client
fsck                 run a DFS filesystem checking utility
fs                   run a generic filesystem user client
balancer             run a cluster balancing utility
oiv                  apply the offline fsimage viewer to an fsimage
fetchdt              fetch a delegation token from the NameNode
jobtracker           run the MapReduce job Tracker node
pipes                run a Pipes job
tasktracker          run a MapReduce task Tracker node
historyserver        run job history servers as a standalone daemon
job                  manipulate MapReduce jobs
queue                get information regarding JobQueues
version              print the version
jar <jar>            run a jar file
distcp <srcurl> <desturl> copy file or directories recursively
distcp2 <srcurl> <desturl> DistCp version 2
archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
classpath            prints the class path needed to get the
                       Hadoop jar and the required libraries
daemonlog            get/set the log level for each daemon
or
CLASSNAME            run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

ONLY FOR THE FIRST TIME WE NEED TO FORMAT the file system(remember what happens if we format you d drive???)

./hadoop namenode –format

Start hadoop file system

. ./start_dfs.sh

issue jps command to see all processes running

parag@PRUBNode1:~/work/hadoop-1.2.1/bin$ jps
2403 NameNode
3331 Jps
2574 DataNode
2722 SecondaryNameNode

Monitor hadoop file system from http://<<hostname of nn>>:50070

Stop dfs with ./stop-dfs.sh

Monday, 10 November 2014

Introduction to Data science Part 2

Parag Ray
04-Sep-2014

Introduction

Welcome to the readers!
This blog is an introduction to the Map reduce. please note the related readings and target audience section to get help to better follow the blog.

Agenda

Target audience.

Related readings/other blogs .

Map reduce motivation.

Features.

How does Hadoop storage look like along with Map reduce.

Basic Algorithms.

Intuition of the process.

Physical architecture.

Target Audience

This is an introductory discussion on data science, big data technologies and Hadoop.

Best suited for audience who are looking for introduction to this technology.

There is no prior knowledge required,except for basic understanding of network, computing and high level understanding of enterprise application environments.

Related readings/other blogs
This is the second part of this series of articles, and related reading are provided in the links. It will be helpful to go through part 1 first.
You would also like to look at Cloudera home page & Hadoop home page for further details.

Map Reduce intuition

If there are huge amounts of data it may be a big challenge for single computational infrastructure to process that.

Map reduce allows the entire computational task to be divided in to smaller tasks(Map) and then combining(Reduce) them for final result.

Maps and reducers can run on separate Machines allowing horizontal scalability.

Number of maps and reducers can be very high allowing scalability.

Integrated and optimized for Hadoop which allows distributed data storage.

Data are stored in chunks in Hadoop, instead of transporting data from one node to another and doing the processing , it is more efficient to do the processing locally and transmit the result.

Approximate comparison of traditional computation strategy with one vertically scaled Database node and one processor to Hadoop Map reduce strategy. Columns approximate time scale.

Map features reduce at a glance

Master slave architecture.Composed mainly of Job trackers(master) and Task trackers(slave) .

Fail over support is provided by a heart beat based system.

Plug-able components with given set of interfaces to accomplish various required functions like detailed validation, combination of results.

Network optimized for data access from Hadoop for any given topology.

Supports optimization features like partitions.

Has various modes of running like local, pseudo-distributed.

How does Hadoop storage look like along with Map reduce.

The following diagram over lays the MR components on top of the HDFS components as we have seen in the previous post to high light the integration.

Map reduce is implemented in close integration with Hadoop.

Job tracker is the master and task tracker provide handle to the distributed map tasks.

Job tracker maintains heart beat contact and restarts jobs if a job is none-responsive.

Data access from Hadoop data nodes are optimized based on policy.

Tasks under various job trackers are capable of data exchange.

Basic algorithm

Data blocks reside in HDFS and they are read as input split.

Map programs get the splits with specific structure like key value and also receives the context handle.

Maps run distributed and process distributed data by accessing them locally ( provides speed enhancement by this)

Maps have access to various data types as shown

Framework allows collating data in partitions having similar values together.

Merge and sort is done if reduce is necessary.

Reducers are not compulsory, but are provided to for final processing of data produced by Maps

Map data is transmitted across http to reduce location.

reduce

If Reduce is not running then unsorted data is submitted back to HDFS out put location.

If Reduce is running then reduced data is saved back to HDFS.

Physical architecture

JT: Job tracker
TT: Task Tracker

Client request job via RPC to JT

JT maintain Heart beat from TT to make sure the TT is up and running.

TT has child job nodes executing Map and reduce jobs. Data access is localized.

There is Umbilical protocol to maintain communication between Job node and TT.


Approximate comparison of traditional computation strategy with one vertically scaled Database node and one processor to Hadoop Map reduce strategy. Columns approximate time scale.

Total Pageviews

Saturday, 22 November 2014

Introduction to Data science Part 5:map red set up

Thursday, 20 November 2014

Introduction to Data science Part 4:HDFS multi-node set up

Introduction to Data science Part 3: HDFS single node setup

Monday, 10 November 2014

Introduction to Data science Part 2