Total Pageviews

Saturday, 29 November 2014

Introduction to Data science Part 8:Uploading data with flume

Introduction to Data science Part 8:Uploading data with flume

Parag Ray
29-Nov-2014

Introduction

Welcome to the readers!

This is the eighth part of the series of articles. Here we are looking into the needs of uploading data into HDFS or data acquisition tools. We shall be exploring flume as part of the strategy.

If you have not gone through already, you would need to go back to fourth part where multi node set up(see related readings) is discussed, as this post will depend on the multi node set up.
Please note the related readings and target audience section to get help to better follow the blog. 

We are using the operating system Ubuntu 14.04. Hadoop version 1.2.1 and Flume version 1.5.2. java version 1.7.

Disclaimer
I do not represent any particular organization in providing this details here.Implementation of tools and techniques discussed in this presentation should be done by responsible Implementer, capable of undertaking such tasks.

Agenda
  • Target audience
  • Related readings/other blogs
  • Requirements of data acquisition strategy
  • Flume introduction 
  • Flume set up
  • Test
Target audience

  • This is an intermediate level discussion on Hadoop and related tools.
  • Best suited for audience who are looking for introduction to this technology.
  • Prior knowledge in java and Linux required.
  • Intermediate level understanding of networking necessary.
Related readings/other blogs  
This is the eighth article of this series or blogs, other article shortcuts are available in the pages tab.
You would also like to look at Flume, Cloudera  & Hadoop home page for further details.

Requirements of data acquisition strategy

While HDFS works as a good storage infrastructure for storing massive data, providing optimized & fault tolerant data serving capability, we do need separate strategy for acquiring data.

We need high portability or interoperability , scalability and fault tolerance also so that we can reliably gather data from wide range of sources.

Flume is one such platform which can be used to capture data into HDFS.
  
Flume introduction 

Flume is a tool for capturing data from various sources and offers a very flexible way to capture large volume data into HDFS.


It provides features to capture data from multiple sources like logs of web applications or command output.

As a strategy it segregates accumulation of data and consumption of the same.
By providing a storage facility between the two , it can handle a large amount of difference in the accumulation rate and consumption rate.

With persistence between storage and consumption, it guarantees delivery of the captured data.

By having chaining capability and networking between various flume units(agents) it provides a complex and conditional data delivery capability which is also scalable horizontally.

This is not centrally controlled so it is not having any single point of failure.


Flume is designed to run stand alone as JVM process, each one is called an AGENT.
Every capture of data payload as byte array into the Agent is called an EVENT.

Typically , there is a client process , which will be the source of events. Flume can handle various types of data input as shown,-

Avro, log4j, syslog, Http POST(with JSON Body) ,twitter input and Exec source,last one being the out put of a local process call.

Source is a component which is responsible for capturing input. There are multiple type of sources by the nature of input that we are capturing.

Sink is the component which is persisting or delivering the captured data. Depending upon where the data is being delivered, there are various types of sinks like file system, hdfs etc.

Between source and the sink, there is CHANNEL. Channel holds data after it is written to by source and till it is consumed and delivered by sink.

Delivery guarantee is achieved by selecting persistent channel and it holds the data till it is delivered by sink. In case the Flume agent goes down before all of the data is delivered, it can redo delivery once it comes back up as it is retained by channel. However, in case the channel is none persistent,  such guarantee can not work.

All configuration changes are reflected without restart and the flume agents can be chained to create complex and conditional topology.

This done by chaining the sink of one agent to the source of another.
 
Flume set up
 
Flume gzip file needs to be downloaded and extracted to a folder. Below picture shows the flume version 1.5.2 was downloaded and extracted





We shall be setting up the components as shown below,-

In the below configuration ,The agent is called 'agent1' which will have source 's1' , channel 'c1' and sink 'k1'
s1, c1 and k1 will have relevant properties and at the end source s1 and the sink k1 is linked to channel c1.

Here telnet is taken as source for simple testing, and hdfs path is provided to the sink.

Open flume-conf.properties.template in conf folder under the folder created by extraction of the flume set up, and provide the following entries. All other example entries can be commented.

agent1.sources = s1
agent1.channels = c1
agent1.sinks = k1
# For each one of the sources, the type is defined'
#telnet source indicated by type netcat and port as provided on localhost
agent1.sources.s1.type = netcat
agent1.sources.s1.bind = localhost
agent1.sources.s1.port = 44444

# Each sink's type must be defined, hdfs sink at specified url is given

#if needed, the host name (PRUBNode1) and port should be changed.
#host and port should be in line with hdfs core-site.xml config.
agent1.sinks.k1.type = hdfs
agent1.sinks.k1.hdfs.path =hdfs://PRUBNode1:9020/flume/webdata/

# Each channel's type is defined.

agent1.channels.c1.type = memory
agent1.channels.c1.capacity = 1000
agent1.channels.c1.transactionCapacity = 100

#link source and sink to channel
agent1.sources.s1.channels = c1
agent1.sinks.k1.channel = c1


Special attention should be given to hdfs.path the property should be as per fs.default.name in core site.The matching entry in my case is as under in core-site.xml.

<property>
        <name>fs.default.name</name>
        <value>hdfs://PRUBNode1:9020</value>
    </property>


The part /flume/webdata is arbitrary and will be created by flume if not already present.
care should be taken for authentication. We are assuming that the user running flume is the same or used sudo to hadoop user.
Test
Now to test, first we have to start HDFS. so we need to move to HADOOP_HOME/bin and issue ./start-dfs.sh or ./start-all.sh

After HDFS is started and we an see that datanode , name node , secondary name node are active , then we can fire the following command to start flume from the bin folder under the folder created in the extraction of tar file as part of set up,-

  ./flume-ng agent -n agent1 -c conf -f ../conf/flume-conf.properties.template

 flume-ng is the executable command to start the flume following are the parameters explained,-

agent -n : choice of the agent name should come after this , in this case 'agent1'
conf -f: choice of the config file in this case flume-conf.properties.template under conf folder. we updated this properties file few steps before. 

Once this command is fired,there will be a lot of out put on screen, however following would be something where the out put should stop and you shall be able to see as under,-
14/11/30 02:02:38 INFO source.NetcatSource: Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:44444]

Since the flume agent is started as foreground process,terminating the command will stop the agent . So we shall open another terminal and issues following command to connect to flume,-
telnet localhost 44444
Once the prompt comes back, we can use telnet prompt to input data by typing

hello world[return]
telnet returns>OK
first upload[return]
telnet returns>OK
second upload[return]
telnet returns>OK



Now we can go to admin console of hdfs to check file,-

Saturday, 22 November 2014

Introduction to Data science Part 5:map red set up

Parag Ray
29-Sep-2014

Introduction

Welcome to the readers!

This is the fifth part of the series of articles. Here we actually start setting up map reduce in a multi node cluster and have a hands on experience of the map reduce.

If you have not gone through already,You would necessarily  go back to fourth part of the multi node set up as this post will depend on the multi node set up.

Please note the related readings and target audience section to get help to better follow the blog.

 
We are assuming the operating system to be Ubuntu 14.04.

Agenda
  • Target audience
  • Related readings/other blogs
  • MR configuration
  • Test
Target audience

  • This is an intermediate level discussion on Hadoop and related tools.
  • Best suited for audience who are looking for introduction to this technology.
  • Prior knowledge in java and Linux required.
  • Intermediate level understanding of networking necessary.
Related readings/other blogs  
You would also like to look at Cloudera home page & Hadoop home page for further details.

MR configuration

We shall do the configuration for master first.
Please move to $HADOOP_HOME folder and then conf folder under this.
All Map reduce and HDFS configuration files are located here.
Add the following entries to mapred-site.xml:
<configuration>     
<property>
         <name>mapred.job.tracker</name>
         <value>PRUBNode1:9001</value>
     </property>
<property>
         <name>mapred.local.dir</name>
         <value>/home/parag/work/mrhome/localdir</value>
     </property>
<property>
         <name>mapred.system.dir</name>
         <value>/home/parag/work/mrhome/sysdir</value>
     </property>
</configuration> 
 
Please note: PRUBNode1 is my hostname for the master node, please change it 
for host name of your master node.
mapred.local.dir  & mapred.system.dir  should be given a generic directory name 
which has to be the same (once we move to multinode) for all nodes. Directory needs 
be separately created as they will not be created by Hadoop.

For slave also the same configuration change is needed.

Start the hadoop daemons from the master node, issue following command from  $HADOOP_HOME/bin folder:
$ bin/start-all.sh 


as discussed in the first few blogs in this series, the Jobtracker is the master for map reduce and it should be visible now if you issue the following command,-

jps

Sample out put from Master,
parag@PRUBNode1:~/work/hadoop-1.2.1/bin$ jps
9530 DataNode
9788 JobTracker
9700 SecondaryNameNode
9962 TaskTracker
9357 NameNode
10252 Jps
 

Job tracker also has a built in UI for basic monitoring,-
http://<<hostname>>:50030/  for example in my case it is http://PRUBNode1:50030


TEST MR installation

Test 1: issue jps command and you would expect to see out put as in previous section.  master will show all jobs . Slave will show only datanode and task tracker.

Test 2:
go to admin UI (http://<<hostname>>:50030) and check all nodes are visible or not.


Test 3: run a test job

Copy the input files into the distributed filesystem:
$ bin/hadoop fs -put conf input
Run some of the examples provided:
$ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
Examine the output files:
Copy the output files from the distributed filesystem to the local filesytem and examine them:
$ bin/hadoop fs -get output output
$ cat output/*
or
View the output files on the distributed filesystem:
$ bin/hadoop fs -cat output/*
When you're done, stop the daemons with:
$ bin/stop-all.sh
 

Thursday, 20 November 2014

Introduction to Data science Part 4:HDFS multi-node set up

Parag Ray
29-Sep-2014

Introduction

Welcome to the readers!

This is the fourth part of the series of articles. Here we actually start setting up HDFS in a multi node cluster and have a hands on experience of the HDFS.

If you have not gone through already,You would necessarily  go back to third part of the single node set up(see related readings) as this post will depend on the single node set up.

Please note the related readings and target audience section to get help to better follow the blog.

 
We are assuming the operating system to be Ubuntu 14.04.


Agenda
  • Target audience
  • Related readings/other blogs
  • Multi-node cluster setup steps
Target audience

  • This is an intermediate level discussion on Hadoop and related tools.
  • Best suited for audience who are looking for introduction to this technology.
  • Prior knowledge in java and Linux required.
  • Intermediate level understanding of networking necessary.
Related readings/other blogs  
Please refer MR setup first part in section 3 first.You would also like to look at Cloudera home page & Hadoop home page for further details.

Multi-node cluster setup steps

Multi-node set up requires that single node set up is done in each of the nodes. This can be done any time before step 3 Hadoop configuration below.

  1. Network connect
     Connect the underlying network
  2. Ssh connectivity
     Setup ssh connection
  3. HDFS configuration for multi-node. 
Network connect

A home network is demonstrated here.The set up is academic and not production quality. A full blown setup would set up comprise of all components including segment A to B below, but we are covering only segment B here with two nodes, one master and one slave.IP 4 network is shown here.

 Ip address set up:
Navigate to Use system settings > network>wired connection and turn the Wired network on.


Set 'manual' 'ip4' config and provide ip 'address' and use same 'netmask' for all PCs in network. Network config details may come under 'Options' button


In case you are using other Linux , the set up may be different. Here is one example for oracle linux.

SSH connectivity

SSH connectivity is used for connectivity may be used across nodes.
This needs to be connectivity across nodes where data communication is expected. Two way connectivity between master and slave node is a must.
Self connectivity is also a must.









In the above diagram NN indicate name node and DN indicate datanode of course.
  • Master and slaves talk via ssh connectivity.
     
  • We need ssh without need to provide credential for each requests.
     
  • The set up for ssh without password to provided is not ideal for security but
    In this case for each transaction of hadoop, providing credential is not practical.
     
  • There are other ways ,but for the purpose of this presentation, we shall stick to this set up
     
  • Both master and slave nodes must open ssh connectivity to each other  and
    to themselves.
Now for details steps

  • It is assumed that for all these we are using  the hadoop user. Sudo to respective user is also an option.
  • First from master node check that  ssh to the localhost  happens without a passphrase:
ssh localhost
  • If passphrase is requested, execute the following commands:
ssh-keygen -t rsa -P “” -f ~/.ssh/id_rsa
(The part -f ~/.ssh/id_rsa is not mandatory as that is the default location and system will ask for this location if not provided)
  • Copy public key to authorized_keys file
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
  • On master node try to ssh again to localhost and if passphrase is still requested, do the  following
chmod 700 $HOME $HOME/.ssh
chmod 600 $HOME/.ssh/authorized_keys
chown  <<hadoop user>> $HOME/.ssh/authorized_keys

it will be a good idea to check if you can do
ssh localhost and also ssh <<nodename>>
  • If after executing ssh localhost any error message is returned that indicates that ssh server is not installed you may like to try the following,-
    • Use the following command to install ssh server as under,-
sudo apt-get install openssh-server

  • Then can repeat the steps starting from trying ssh localhost
  • Now public key needs to be copied to all of  slave machine (only one here).  Ssh should be installed in all slave machines. We do strongly recommend ssh understanding.
scp ~/.ssh/id_rsa.pub <<slave host name>>:~/.ssh/<<master host name>>.pub   (/etc/hosts should be updated with slave and master host entries )
  • Now login (ssh <<slave host name>>)  to your slave machine. Password will be needed  still.
  • While on your slave machine ,issue the following command  to append your master machine’s hadoop user’s public key to the slave machine’s  authorized key store,
cat ~/.ssh/<<master host name>>.pub >> ~/.ssh/authorized_keys
  • Issue exit to come out of slave
  • Now, from the master node try to ssh to slave.
ssh <<slave host name>>
  • if passphrase is still requested, do the  following after  logging in with password to slave
chmod 700 $HOME $HOME/.ssh
chmod 600 $HOME/.ssh/authorized_keys 
chown  <<hadoop user>> $HOME/.ssh/authorized_keys
  • Exit and log in again from your master node.
ssh <<slave host name>>
  • if it still asks for password , log in using password, log out and issue 
ssh-add
  • Repeat for all Hadoop Cluster Nodes which now will point back to master.
 HDFS configuration for multi-node.

 /etc/hosts
should have all host names
parag@PRUBNode1:/etc$ cat hosts
10.192.0.2    PRUBNode2
10.192.0.1    PRUBNode1
127.0.0.1    localhost
 

there could be one problem that might happen out of hosts file.
 to detect this problem run sudo netstat -ntlp on the master , it shows:
tcp6 0 0 127.0.0.1:9020 :::* LISTEN 32646/java
This 127.0.0.1 means that it is only listening to connection on 9020 which is from localhost , all the connection on 9020 from outside cannot be received.


this may be resolved by preventing entries that point local host for both actual ip and 127.0.0.1.
the entry shown above will not have this problem but the following will


10.192.0.1     localhost
 10.192.0.2    PRUBNode2
10.192.0.1    PRUBNode1
127.0.0.1      localhost


 there is likely to be problems if hostnames are missed out as well.
slaves
slaves file in the folder $HADOOP_HOME/conf ,should have all the node names that has data node running
 we have proved the listing as below,
parag@PRUBNode1:~/work/hadoop-1.2.1/conf$ cat slaves
PRUBNode1
PRUBNode2

master
master file n the folder $HADOOP_HOME/conf should have all the node names that has secondarynamenode running
parag@PRUBNode1:~/work/hadoop-1.2.1/conf$ cat masters
PRUBNode1

hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.name.dir</n

ame>
<value>/home/parag/work/hdf/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/parag/work/hdf/data</value>
</property>
</configuration>


all folders should be generic and same across all nodes

core-site.xml
<configuration>

    <property>
        <name>fs.default.name</name>
        <value>hdfs://PRUBNode1:9020</value>
    </property>
<property>
        <name>hadoop.tmp.dir</name>
        <value>/home/parag/work/hdf/tmp</value>
    </property>

</configuration>

all folders should be generic and same across all nodes. we are now mentioning the nodename of master for fs.defaultt.name instead of local host as we expect it to be accessed from different nodes.

 Starting HDFS

In the name node containing Node issue command 
./start-dfs.sh 
from $HADOOP_HOME/bin folder

it should start name node , data node and secondary name node  in master and data node in slave.

if it is first time, name node format to be done. 

Test set up

Test 1: Issue command jps at master
parag@PRUBNode1:~/work/hadoop-1.2.1/bin$ jps
9530 DataNode
9700 SecondaryNameNode
9357 NameNode
10252 Jps
 


Issue command jps at slave will show only DataNode

Test 2: Open administrative console at http://<<hostname of master>>:50070
for me it is http://PRUBNode1:50070



 

Introduction to Data science Part 3: HDFS single node setup

Parag Ray
29-Sep-2014

Introduction

Welcome to the readers!

This is the third part of the series of articles. Here we actually start setting up Hadoop and have a hands on experience of the HDFS.
Please note the related readings and target audience section to get help to better follow the blog.
 
We are assuming the operating system to be Ubuntu 14.04.

Agenda
  • Target audience
  • HDFS setup
Target audience

  • This is an intermediate level discussion on Hadoop and related tools.
  • Best suited for audience who are looking for introduction to this technology.
  • Prior knowledge in java and Linux required.
  • Intermediate level understanding of networking necessary.
Related readings/other blogs  
Please see links section.You would also like to look at Cloudera home page & Hadoop home page for further details.

Hadoop Setup
  • Preconditions:
    Assuming java 7 and Ubuntu linux installed with at least 4 GB RAM and 50 GB disk space.
    If Ubuntu is new installation good idea to update with     
apt-get install update       
  • Also if needed install java with 
apt-get install openjdk-7-jdk

  • Install eclipse with
apt-get install eclipse
  • Install ssh with
 apt-get install openssh  
Note:These will require Internet connection and firewall should allow the connection to Internet repository.
  • Following are the steps to set up Hadoop
    • Download hadoop-1.2.1-bin.tar.gz.
    • Open console and use  command cd ~  to move to home directory
    • Create a folder /work.
    • Change directory to new folder and extract the hadoop-1.2.1-bin.tar.gz with command ,-
    tar –xvf hadoop-1.2.1-bin.tar.gz  
    • This will create a hadoop-1.2.1 folder under work folder. we shall cal this 'hadoop folder'.
  • Before you proceed to next steps , be aware of the java home folder. The following comamnd may help
which java
  • Go to home folder by cd ~
  • Issue following command to open profile file.
  gedit .bashrc
 
  • Assuming that java is installed using command as mentioned above, Add following lines in .bashrc,-
export JAVA_HOME="/usr/lib/jvm/java-7-openjdk-i386"
export HADOOP_HOME="/home/parag/work/hadoop-1.2.1"
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin

 
note first and second line could be different depending on where java and hadoop are installed
In case java is in some other folder, the same has to be provided, the directory should be the parent folder of the bin folder.
  • To refresh the configuration, issue command
     . . Bashrc
please note there is a space between the two dots in the above command. People with unix knowledge will find this information redundant.

  • Close gedit.
  • There will be ‘conf’ folder under hadoop folder created, change directory to the conf folder to do the subsequent config tasks.
  • In hadoop-env.sh(use gedit hadoop-env.sh)  add command 
  •  
export JAVA_HOME = /usr/lib/jvm/java-7-openjdk-i386
  • In case of multi-node cluster set up we need to add master for location where secondary namenode will be running(governed by hadoop command issued on nodes), and slave file where all datanode will be added. Each node in one line.  
  • Following should be noted,-
    • -Having a user specific to Hadoop will be good, it is not shown here
      -Folder structure should be same across cluster if multi-node set up is used. it is better to do generic folder structure any way so that later it is not a difficulty.
  •  All node names should be in /etc/hosts  (use sudo gedit /etc/hosts)
127.0.0.1    localhost
127.0.0.1    PRUBNode1
  •  Edit core-site.xml to add the following,-
<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://PRUBNode1:9000</value>
    </property>
<property>
        <name>hadoop.tmp.dir</name>
        <value>/home/parag/tmp</value>
    </property>
</configuration>
 
The above configuration is for name node so in fs.default.name it points to hdfs admin console, host name as PRUBNode1. This can be verified with hostname command at console)  hadoop.tmp.dir is temporary work area.
 
  • Edit hdfs-site.xml to add the following,-
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/parag/work/dfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/parag/work/dfs/data</value>
</property>
</configuration>
few tips:
To do failure prevention, should  have more than one with NFS as well by comma separated list.
directory structure be same across all nodes. for example dfs.data.dir will be needed in all nodes in cluster, and the folder should be the same.
  • Cd to bin folder of hadoop and issue ./hadoop and see the command options. Note map reduce and hadoop commands are coming together.
 parag@PRUBNode1:~/work/hadoop-1.2.1/bin$ ./hadoop
Warning: $HADOOP_HOME is deprecated.

Usage: hadoop [--config confdir] COMMAND
where COMMAND is one of:
  namenode -format     format the DFS filesystem
  secondarynamenode    run the DFS secondary namenode
  namenode             run the DFS namenode
  datanode             run a DFS datanode
  dfsadmin             run a DFS admin client
  mradmin              run a Map-Reduce admin client
  fsck                 run a DFS filesystem checking utility
  fs                   run a generic filesystem user client
  balancer             run a cluster balancing utility
  oiv                  apply the offline fsimage viewer to an fsimage
  fetchdt              fetch a delegation token from the NameNode
  jobtracker           run the MapReduce job Tracker node
  pipes                run a Pipes job
  tasktracker          run a MapReduce task Tracker node
  historyserver        run job history servers as a standalone daemon
  job                  manipulate MapReduce jobs
  queue                get information regarding JobQueues
  version              print the version
  jar <jar>            run a jar file
  distcp <srcurl> <desturl> copy file or directories recursively
  distcp2 <srcurl> <desturl> DistCp version 2
  archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
  classpath            prints the class path needed to get the
                       Hadoop jar and the required libraries
  daemonlog            get/set the log level for each daemon
 or
  CLASSNAME            run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
  • ONLY FOR THE FIRST  TIME WE NEED TO FORMAT the file system(remember what happens if we format you d drive???)  
./hadoop namenode –format


  • Start hadoop file system
. ./start_dfs.sh
issue jps command to see all processes running
parag@PRUBNode1:~/work/hadoop-1.2.1/bin$ jps
2403 NameNode
3331 Jps
2574 DataNode
2722 SecondaryNameNode
 
 
Monitor hadoop file system from http://<<hostname of nn>>:50070


  • Stop dfs with ./stop-dfs.sh

Monday, 10 November 2014

Introduction to Data science Part 2

Parag Ray
04-Sep-2014

Introduction

Welcome to the readers! 
This blog is an introduction to the Map reduce. please note the related readings and target audience section to get help to better follow the blog.


Agenda
  • Target audience.
  • Related readings/other blogs .
  • Map reduce motivation.
  • Features.
  • How does Hadoop storage  look like along with Map reduce.
  • Basic Algorithms.
  • Intuition of the process.
  • Physical architecture.
Target Audience
  • This is an introductory discussion on data science, big data technologies and Hadoop.
  • Best suited for audience who are looking for introduction to this technology.
  • There is no prior knowledge required,except for basic understanding of network, computing and high level understanding of enterprise application environments.
Related readings/other blogs  
This is the second part of this series of articles, and related reading are provided in the links. It will be helpful to go through part 1 first.
You would also like to look at Cloudera home page & Hadoop home page for further details.

Map Reduce intuition 
  • If there are huge amounts of data it may be a big challenge for single computational infrastructure to process that.
  • Map reduce allows the entire computational task to be divided in to smaller tasks(Map) and then combining(Reduce) them for final result.
  • Maps and reducers can run on separate Machines allowing horizontal scalability.
  • Number of maps and reducers can be very high allowing scalability.
  • Integrated and optimized for Hadoop which allows distributed data storage. 
  • Data are stored in chunks in Hadoop, instead of transporting data from one node to another and doing the processing , it is more efficient to do the processing locally and transmit the result.
    Approximate comparison of traditional computation strategy with one vertically scaled Database node and one processor to Hadoop Map reduce strategy. Columns approximate time scale.






Map features reduce at a glance



  • Master slave architecture.Composed mainly of Job trackers(master) and Task trackers(slave) .
  • Fail over support is provided by a heart beat based system.
  • Plug-able components with given set of interfaces to accomplish various required functions like detailed validation, combination of results.
  • Network optimized for data access from Hadoop for any given topology.
  • Supports optimization features like partitions.
  • Has various modes of running like local, pseudo-distributed.

How does Hadoop storage  look like along with Map reduce.

The following diagram over lays the MR components on top of the HDFS components as we have seen in the previous post to high light the integration.























  • Map reduce is implemented in close integration with Hadoop.
  • Job tracker is the master and task tracker provide handle to the distributed map tasks.
  • Job tracker maintains heart beat contact and restarts jobs if a job is none-responsive.
  • Data access from Hadoop data nodes are optimized based on policy.
  • Tasks under various job trackers are capable of data exchange.
Basic algorithm

  • Data blocks reside in HDFS and they are read as input split.
  • Map programs get the splits with specific structure like key value and also receives the context handle.
  • Maps run distributed and process distributed data by accessing them locally ( provides speed enhancement by this)
  • Maps have access to various data types as shown
  • Framework allows collating data in partitions having similar values together.
  • Merge and sort is done if reduce is necessary.
  • Reducers are not compulsory, but are provided to for final processing of data produced by Maps
  • Map data is transmitted across http to reduce location.
  • reduce 
  • If Reduce is not running then unsorted data is submitted back to HDFS out put location.
  • If Reduce is running then reduced data is saved back to HDFS.
 Physical architecture
 JT: Job tracker
TT: Task Tracker

  • Client request job via RPC to JT
  • JT maintain Heart beat from TT to make sure the TT is up and running.
  • TT has child job nodes executing Map and reduce jobs. Data access is localized.
  • There is Umbilical protocol to maintain communication between Job node and TT.

Sunday, 2 November 2014

Introduction to Data science Part 1


Parag Ray
29-Aug-2014

Introduction

Welcome to the readers!

The purpose of writing this blog is put together all the finding and knowledge that I currently have have and will be gathering in the field of data science and big data technology and allied application area.

I hope this collection will help you.

I shall be covering various concepts, technologies starting from this basic overview, please look at the target audience and related readings section for suitability of your need. This blog is written more like a book and may be edited for correction, addition and expansion.


Agenda
  • Target audience
  • Related readings/other blogs
  • Data science and Big data definition
  • Use of Hadoop in big data
  • Hadoop at a glance.
  • Typical Use cases.
  • Types of Algorithm for  analytics and ML.
  • Concepts & Skill base.
  • How does Hadoop storage  look like.
  • The ecosystem.
Target Audience
  • This is an introductory discussion on data science, big data technologies and Hadoop.
  • Best suited for audience who are looking for introduction to this technology.
  • There is no prior knowledge required,except for basic understanding of computing and high level understanding of enterprise application environments.
Related readings/other blogs  
This is the first of this series or blogs, we shall add other bog titles as they are added.
Other article shortcuts are available in the pages tab.
You would also like to look at cloudera home page & Hadoop home page for further details.

Data Science and Big data definition
I am using wikipedia definition here, as I find them very appropriate,-
Data science is the study of the generalizable extraction of knowledge from data,[1] yet the key word is science.[2] It incorporates varying elements and builds on techniques and theories from many fields, including signal processing, mathematics, probability models, machine learning, statistical learning, computer programming, data engineering, pattern recognition and learning, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products. The subject is not restricted to only big data, although the fact that data is scaling up makes big data an important aspect of data science. Another key ingredient that boosted the practice and applicability of data science is the development of machine learning - a branch of artificial intelligence - which is used to uncover patterns from data and develop practical and usable predictive models.
For more details please visit http://en.wikipedia.org/wiki/Data_science.


Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications.
The challenges include analysis, capture, curation, search, sharing, storage, transfer, visualization, and privacy violations. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, prevent diseases, combat crime and so on."[1]
Scientists regularly encounter limitations due to large data sets in many areas, including meteorology, genomics,[2] connectomics, complex physics simulations,[3] and biological and environmental research.[4] The limitations also affect Internet search, finance and business informatics.For more details please visit http://en.wikipedia.org/wiki/Big_data

Use of Hadoop in big data 

Hadoop has become a very popular platform for big data analysis and also data science as it allows a reliable massive horizontally scalable  data store.

By default, the data is not fetched linearly from a vertically scaled  infrastructure( for example in case of an RDBMS), this makes it relatively faster in data retrieval.

Basically this is created for batch processing , there has been various related technologies which enable near real time access of data stored in Hadoop.

Although meant for commodity hardware, some of the vendors have come up with specialized and proprietory hardware, which improve the speed of data access even more.
But these are by nature prone to vendor lock in, however for very high -end usage that should not be a problem.

Hadoop at a glance.

Hadoop is comprised of two main components ,-
  • HADOOP DISTRIBUTED FILE SYSTEM(HDFS)
A distributed storage or a file system that can span across a cluster made up of even commodity ( I mean no special server hardware needed) hardware. It can scale up to any number of nodes as needed.
  • MAP REDUCE
A framework for map reduce algorithm where a large computation can be broken into small tasks running in a cluster(same cluster as that of HDFS typically) and then results are combined or 'reduced' to  arrive at information desired.

Hadoop and Map reduce are integrated.



Typical Use cases.
  • E-Commerce.
There are specialized systems called recommender systems for recommendations.
Site performance on  DNS look up time, form loading time. Frequency of request by page etc can be found out analyzing very large log data.

  • Banking
Money laundering analysis
Risk analysis
Analytics of various investment and instruments related data
Market analysis

  • Network
VPN data analysis
Anomalous access detection

  • Telecom
Finding signal breaks and classification or anomaly detection.

Types of Algorithm for  analytic and ML.
  • Data analysis to find sample/ population characteristics
These type of use involves storing huge amount of data (what about sensor data of the steam pipe line of a factory or log data of a network) to do statistical /numerical analysis like finding average latency, average temperature, variance etc. The parameters & mathematical functions of such analysis are provided based on domain needs.
  •  Advanced algorithms
There are other advanced analytics in line with Machine learning  also there are those algos which try to analyze a data set and then try to find out a predictive model or grouping model and are intelligent enough to find the best fit parameters by themselves,-
    • Supervised learning: 
We have access to a learning set where we know that certain model of relation exist like y=f(x) and in the learning set some values of y corresponding to x are provided. 
Based on these , we can try to predict y values for other x's for which y is not known. 
In supervised learning , a learning set helps us find a model or the nature of predicted functions f(x) by predicting the parameter set of the linear or non linear functional used. 

Challenges involved in choosing a linear or none-linear model, optimization of the algorithm, data standardization so that the analysis runs within performance requirements and accuracy. There will be need for application of techniques like feature scaling ,proper selection of parameters to be employed.
The discussion done so far for example ,involves regression analysis of market price prediction based on various determining factors. Here the learning set may be a set of data where market price along with the determining factors are provided, once the learning is done with this, the parameters found can be used in other cases where determining factors are known but market price is not known.
 

On the other hand classification algorithm tries to classify data points in various categories one of the prominent example will be OCR.
    • Unsupervised learning
Unsupervised learning does not have the advantage of learning set. Classification is done by relative value of various parameters of sample points.
  • Those computational tasks that can be broken into distributed iterative logic is particularly applicable for Hadoop based systems.
  • Hadoop provides platform for massive data storage but analytics is performed with various tools that range form MR adaptation to pig , hive, Hbase , R as well as Mahout.

Concepts & Skill base.

Distributed file system
Large files stored in blocks and replicated across machines.
Takes care of network failure and recovery.
Optimizes data access based on topology & access point.


Skills needed for a big data professional
Although there are specializations but the following skills look to be important.
  • Tools expertise. Wide range of tool knowledge is required as it is not one size fit all.
  • Knowledge of statistical principles like Sampling, central tendency, variance, correlation, regression analysis & time series, various probability distributions like normal, chi-square , t distribution..
  • Knowledge of matrix algebra.
  • Knowledge of network and Linux/Unix operating system
  • Domain knowledge. 
For Java/J2EE developers, it may be important to remember that we are not( in most cases) having  the support of RDBMS any more and we are dealing with BIG data size, so slightest of unnecessary computational cost gets amplified and ends up in big inefficiency.Very obvious implementation technique like loops may not  prove to be good thing and we need to avoid loops as much as we can! 
Be ready for heavy intellectual challenges arising out of such optimization.

How does Hadoop storage  look like.


  • HDFS storage span across nodes(commodity hardware included) racks , clusters and data centers.
  • Master in this set up is the name node and slave is the Data node.
  • Master Name node serves locations and holds metadata and provides pointers to data nodes holding the requested resource.
  • Data nodes server data and are not dependent on Name node.
  • Data resources are replicated multiple times across Data nodes.
  • Name node is intelligent enough to provide reference to all available replications of same resources but ordered by fastest accessibility based of the configured topology.

The Ecosystem.