Parag Ray
29-Sep-2014
Introduction
Welcome to the readers!
This is the third part of the series of articles. Here we actually start setting up Hadoop and have a hands on experience of the HDFS.
Please note the related readings and target audience section to get help to better follow the blog.
We are assuming the operating system to be Ubuntu 14.04.
Agenda
Related readings/other blogs 29-Sep-2014
Introduction
Welcome to the readers!
This is the third part of the series of articles. Here we actually start setting up Hadoop and have a hands on experience of the HDFS.
Please note the related readings and target audience section to get help to better follow the blog.
We are assuming the operating system to be Ubuntu 14.04.
Agenda
- Target audience
- HDFS setup
- This is an intermediate level discussion on Hadoop and related tools.
- Best suited for audience who are looking for introduction to this technology.
- Prior knowledge in java and Linux required.
- Intermediate level understanding of networking necessary.
Please see links section.You would also like to look at Cloudera home page & Hadoop home page for further details.
Hadoop Setup
- Preconditions:
Assuming java 7 and Ubuntu linux installed with at least 4 GB RAM and 50 GB disk space.
If Ubuntu is new installation good idea to update with
- Also if needed install java with
apt-get install openjdk-7-jdk
- Install eclipse with
- Install ssh with
Note:These will require Internet connection and firewall should allow the connection to Internet repository.
- Following are the steps to set up Hadoop
- Download hadoop-1.2.1-bin.tar.gz.
- Open console and use command cd ~ to move to home directory
- Create a folder /work.
- Change directory to new folder and extract the hadoop-1.2.1-bin.tar.gz with command ,-
- This will create a hadoop-1.2.1 folder under work folder. we shall cal this 'hadoop folder'.
- Before you proceed to next steps , be aware of the java home folder. The following comamnd may help
- Go to home folder by cd ~
- Issue following command to open profile file.
gedit .bashrc
- Assuming that java is installed using command as mentioned above, Add following lines in .bashrc,-
export HADOOP_HOME="/home/parag/work/hadoop-1.2.1"
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin
note first and second line could be different depending on where java and hadoop are installed
In case java is in some other folder, the same has to be provided, the directory should be the parent folder of the bin folder.
please note there is a space between the two dots in the above command. People with unix knowledge will find this information redundant.
- To refresh the configuration, issue command
please note there is a space between the two dots in the above command. People with unix knowledge will find this information redundant.
- Close gedit.
- There will be ‘conf’ folder under hadoop folder created, change directory to the conf folder to do the subsequent config tasks.
- In hadoop-env.sh(use gedit hadoop-env.sh) add command
- In case of multi-node cluster set up we need to add master for location where secondary namenode will be running(governed by hadoop command issued on nodes), and slave file where all datanode will be added. Each node in one line.
- Following should be noted,-
- -Having a user specific to Hadoop will be good, it is not shown here
-Folder structure should be same across cluster if multi-node set up is used. it is better to do generic folder structure any way so that later it is not a difficulty.
- All node names should be in /etc/hosts (use sudo gedit /etc/hosts)
127.0.0.1 PRUBNode1
- Edit core-site.xml to add the following,-
<property>
<name>fs.default.name</name>
<value>hdfs://PRUBNode1:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/parag/tmp</value>
</property>
</configuration>
The above configuration is for name node so in fs.default.name it points to hdfs admin console, host name as PRUBNode1. This can be verified with hostname command at console) hadoop.tmp.dir is temporary work area.
- Edit hdfs-site.xml to add the following,-
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/parag/work/dfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/parag/work/dfs/data</value>
</property>
</configuration>
few tips:
To do failure prevention, should have more than one with NFS as well by comma separated list.
directory structure be same across all nodes. for example dfs.data.dir will be needed in all nodes in cluster, and the folder should be the same.
- Cd to bin folder of hadoop and issue ./hadoop and see the command options. Note map reduce and hadoop commands are coming together.
parag@PRUBNode1:~/work/hadoop-1.2.1/bin$ ./hadoop
Warning: $HADOOP_HOME is deprecated.
Usage: hadoop [--config confdir] COMMAND
where COMMAND is one of:
namenode -format format the DFS filesystem
secondarynamenode run the DFS secondary namenode
namenode run the DFS namenode
datanode run a DFS datanode
dfsadmin run a DFS admin client
mradmin run a Map-Reduce admin client
fsck run a DFS filesystem checking utility
fs run a generic filesystem user client
balancer run a cluster balancing utility
oiv apply the offline fsimage viewer to an fsimage
fetchdt fetch a delegation token from the NameNode
jobtracker run the MapReduce job Tracker node
pipes run a Pipes job
tasktracker run a MapReduce task Tracker node
historyserver run job history servers as a standalone daemon
job manipulate MapReduce jobs
queue get information regarding JobQueues
version print the version
jar <jar> run a jar file
distcp <srcurl> <desturl> copy file or directories recursively
distcp2 <srcurl> <desturl> DistCp version 2
archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
classpath prints the class path needed to get the
Hadoop jar and the required libraries
daemonlog get/set the log level for each daemon
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
Warning: $HADOOP_HOME is deprecated.
Usage: hadoop [--config confdir] COMMAND
where COMMAND is one of:
namenode -format format the DFS filesystem
secondarynamenode run the DFS secondary namenode
namenode run the DFS namenode
datanode run a DFS datanode
dfsadmin run a DFS admin client
mradmin run a Map-Reduce admin client
fsck run a DFS filesystem checking utility
fs run a generic filesystem user client
balancer run a cluster balancing utility
oiv apply the offline fsimage viewer to an fsimage
fetchdt fetch a delegation token from the NameNode
jobtracker run the MapReduce job Tracker node
pipes run a Pipes job
tasktracker run a MapReduce task Tracker node
historyserver run job history servers as a standalone daemon
job manipulate MapReduce jobs
queue get information regarding JobQueues
version print the version
jar <jar> run a jar file
distcp <srcurl> <desturl> copy file or directories recursively
distcp2 <srcurl> <desturl> DistCp version 2
archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
classpath prints the class path needed to get the
Hadoop jar and the required libraries
daemonlog get/set the log level for each daemon
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
- ONLY FOR THE FIRST TIME WE NEED TO FORMAT the file system(remember what happens if we format you d drive???)
- Start hadoop file system
issue jps command to see all processes running
parag@PRUBNode1:~/work/hadoop-1.2.1/bin$ jps
2403 NameNode
3331 Jps
2574 DataNode
2722 SecondaryNameNode
2403 NameNode
3331 Jps
2574 DataNode
2722 SecondaryNameNode
Monitor hadoop file system from http://<<hostname of nn>>:50070
- Stop dfs with ./stop-dfs.sh
No comments:
Post a Comment
It will be by pleasure to respond to any of your queries, and i do welcome your suggestions for making the blogs better.