Discussions on Big data technologies: Introduction to Data science Part 3: HDFS single node setup

Parag Ray
29-Sep-2014

Introduction

Welcome to the readers!

This is the third part of the series of articles. Here we actually start setting up Hadoop and have a hands on experience of the HDFS.
Please note the related readings and target audience section to get help to better follow the blog.

We are assuming the operating system to be Ubuntu 14.04.

Agenda

Target audience

HDFS setup

Target audience

This is an intermediate level discussion on Hadoop and related tools.

Best suited for audience who are looking for introduction to this technology.

Prior knowledge in java and Linux required.

Intermediate level understanding of networking necessary.

Related readings/other blogs
Please see links section.You would also like to look at Cloudera home page & Hadoop home page for further details.

Hadoop Setup

Preconditions:
Assuming java 7 and Ubuntu linux installed with at least 4 GB RAM and 50 GB disk space.
If Ubuntu is new installation good idea to update with

apt-get install update

Also if needed install java with

apt-get install openjdk-7-jdk

Install eclipse with

apt-get install eclipse

Install ssh with

apt-get install openssh

Note:These will require Internet connection and firewall should allow the connection to Internet repository.

Following are the steps to set up Hadoop
- Download hadoop-1.2.1-bin.tar.gz.
- Open console and use command cd ~ to move to home directory
- Create a folder /work.
- Change directory to new folder and extract the hadoop-1.2.1-bin.tar.gz with command ,-
tar –xvf hadoop-1.2.1-bin.tar.gz

This will create a hadoop-1.2.1 folder under work folder. we shall cal this 'hadoop folder'.

Before you proceed to next steps , be aware of the java home folder. The following comamnd may help

which java

Go to home folder by cd ~

Issue following command to open profile file.

gedit .bashrc

Assuming that java is installed using command as mentioned above, Add following lines in .bashrc,-

export JAVA_HOME="/usr/lib/jvm/java-7-openjdk-i386"
export HADOOP_HOME="/home/parag/work/hadoop-1.2.1"
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin

note first and second line could be different depending on where java and hadoop are installed

In case java is in some other folder, the same has to be provided, the directory should be the parent folder of the bin folder.

To refresh the configuration, issue command

. . Bashrc
please note there is a space between the two dots in the above command. People with unix knowledge will find this information redundant.

Close gedit.

There will be ‘conf’ folder under hadoop folder created, change directory to the conf folder to do the subsequent config tasks.

In hadoop-env.sh(use gedit hadoop-env.sh) add command

export JAVA_HOME = /usr/lib/jvm/java-7-openjdk-i386

In case of multi-node cluster set up we need to add master for location where secondary namenode will be running(governed by hadoop command issued on nodes), and slave file where all datanode will be added. Each node in one line.
Following should be noted,-

-Having a user specific to Hadoop will be good, it is not shown here
-Folder structure should be same across cluster if multi-node set up is used. it is better to do generic folder structure any way so that later it is not a difficulty.

All node names should be in /etc/hosts (use sudo gedit /etc/hosts)

127.0.0.1 localhost
127.0.0.1 PRUBNode1

Edit core-site.xml to add the following,-

<configuration>
   <property>
       <name>fs.default.name</name>
       <value>hdfs://PRUBNode1:9000</value>
   </property>
<property>
       <name>hadoop.tmp.dir</name>
       <value>/home/parag/tmp</value>
   </property>
</configuration>

The above configuration is for name node so in fs.default.name it points to hdfs admin console, host name as PRUBNode1. This can be verified with hostname command at console) hadoop.tmp.dir is temporary work area.

Edit hdfs-site.xml to add the following,-

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/parag/work/dfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/parag/work/dfs/data</value>
</property>
</configuration>

few tips:

To do failure prevention, should have more than one with NFS as well by comma separated list.

directory structure be same across all nodes. for example dfs.data.dir will be needed in all nodes in cluster, and the folder should be the same.

Cd to bin folder of hadoop and issue ./hadoop and see the command options. Note map reduce and hadoop commands are coming together.

parag@PRUBNode1:~/work/hadoop-1.2.1/bin$ ./hadoop
Warning: $HADOOP_HOME is deprecated.

Usage: hadoop [--config confdir] COMMAND
where COMMAND is one of:
namenode -format     format the DFS filesystem
secondarynamenode    run the DFS secondary namenode
namenode             run the DFS namenode
datanode             run a DFS datanode
dfsadmin             run a DFS admin client
mradmin              run a Map-Reduce admin client
fsck                 run a DFS filesystem checking utility
fs                   run a generic filesystem user client
balancer             run a cluster balancing utility
oiv                  apply the offline fsimage viewer to an fsimage
fetchdt              fetch a delegation token from the NameNode
jobtracker           run the MapReduce job Tracker node
pipes                run a Pipes job
tasktracker          run a MapReduce task Tracker node
historyserver        run job history servers as a standalone daemon
job                  manipulate MapReduce jobs
queue                get information regarding JobQueues
version              print the version
jar <jar>            run a jar file
distcp <srcurl> <desturl> copy file or directories recursively
distcp2 <srcurl> <desturl> DistCp version 2
archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
classpath            prints the class path needed to get the
                       Hadoop jar and the required libraries
daemonlog            get/set the log level for each daemon
or
CLASSNAME            run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

ONLY FOR THE FIRST TIME WE NEED TO FORMAT the file system(remember what happens if we format you d drive???)

./hadoop namenode –format

Start hadoop file system

. ./start_dfs.sh

issue jps command to see all processes running

parag@PRUBNode1:~/work/hadoop-1.2.1/bin$ jps
2403 NameNode
3331 Jps
2574 DataNode
2722 SecondaryNameNode

Monitor hadoop file system from http://<<hostname of nn>>:50070

Stop dfs with ./stop-dfs.sh

Discussions on Big data technologies

Total Pageviews

Thursday, 20 November 2014

Introduction to Data science Part 3: HDFS single node setup

No comments:

Post a Comment