Introduction to Data science Part 8:Uploading data with flume
Parag Ray
29-Nov-2014
Introduction
This is the eighth part of the series of articles. Here we are looking into the needs of uploading data into HDFS or data acquisition tools. We shall be exploring flume as part of the strategy.
If you have not gone through already, you would need to go back to fourth part where multi node set up(see related readings) is discussed, as this post will depend on the multi node set up.
Please note the related readings and target audience section to get help to better follow the blog.
We are using the operating system Ubuntu 14.04. Hadoop version 1.2.1 and Flume version 1.5.2. java version 1.7.
Disclaimer
I do not represent any particular organization in providing this details here.Implementation of tools and techniques discussed in this presentation should be done by responsible Implementer, capable of undertaking such tasks.
Agenda
This is the eighth article of this series or blogs, other article shortcuts are available in the pages tab.
You would also like to look at Flume, Cloudera & Hadoop home page for further details.
Requirements of data acquisition strategy
While HDFS works as a good storage infrastructure for storing massive data, providing optimized & fault tolerant data serving capability, we do need separate strategy for acquiring data.
We need high portability or interoperability , scalability and fault tolerance also so that we can reliably gather data from wide range of sources.
Flume is one such platform which can be used to capture data into HDFS.
Flume introduction
Flume is a tool for capturing data from various sources and offers a very flexible way to capture large volume data into HDFS.
It provides features to capture data from multiple sources like logs of web applications or command output.
As a strategy it segregates accumulation of data and consumption of the same.
By providing a storage facility between the two , it can handle a large amount of difference in the accumulation rate and consumption rate.
With persistence between storage and consumption, it guarantees delivery of the captured data.
By having chaining capability and networking between various flume units(agents) it provides a complex and conditional data delivery capability which is also scalable horizontally.
This is not centrally controlled so it is not having any single point of failure.
Flume is designed to run stand alone as JVM process, each one is called an AGENT.
Every capture of data payload as byte array into the Agent is called an EVENT.
Typically , there is a client process , which will be the source of events. Flume can handle various types of data input as shown,-
Avro, log4j, syslog, Http POST(with JSON Body) ,twitter input and Exec source,last one being the out put of a local process call.
Source is a component which is responsible for capturing input. There are multiple type of sources by the nature of input that we are capturing.
Sink is the component which is persisting or delivering the captured data. Depending upon where the data is being delivered, there are various types of sinks like file system, hdfs etc.
Between source and the sink, there is CHANNEL. Channel holds data after it is written to by source and till it is consumed and delivered by sink.
Delivery guarantee is achieved by selecting persistent channel and it holds the data till it is delivered by sink. In case the Flume agent goes down before all of the data is delivered, it can redo delivery once it comes back up as it is retained by channel. However, in case the channel is none persistent, such guarantee can not work.
All configuration changes are reflected without restart and the flume agents can be chained to create complex and conditional topology.
This done by chaining the sink of one agent to the source of another.
Flume set up
Flume gzip file needs to be downloaded and extracted to a folder. Below picture shows the flume version 1.5.2 was downloaded and extracted
We shall be setting up the components as shown below,-
In the below configuration ,The agent is called 'agent1' which will have source 's1' , channel 'c1' and sink 'k1'
s1, c1 and k1 will have relevant properties and at the end source s1 and the sink k1 is linked to channel c1.
Here telnet is taken as source for simple testing, and hdfs path is provided to the sink.
Open flume-conf.properties.template in conf folder under the folder created by extraction of the flume set up, and provide the following entries. All other example entries can be commented.
agent1.sources = s1
agent1.channels = c1
agent1.sinks = k1
# For each one of the sources, the type is defined'
#telnet source indicated by type netcat and port as provided on localhost
agent1.sources.s1.type = netcat
agent1.sources.s1.bind = localhost
agent1.sources.s1.port = 44444
# Each sink's type must be defined, hdfs sink at specified url is given
#if needed, the host name (PRUBNode1) and port should be changed.
#host and port should be in line with hdfs core-site.xml config.
agent1.sinks.k1.type = hdfs
agent1.sinks.k1.hdfs.path =hdfs://PRUBNode1:9020/flume/webdata/
# Each channel's type is defined.
agent1.channels.c1.type = memory
agent1.channels.c1.capacity = 1000
agent1.channels.c1.transactionCapacity = 100
#link source and sink to channel
agent1.sources.s1.channels = c1
agent1.sinks.k1.channel = c1
Special attention should be given to hdfs.path the property should be as per fs.default.name in core site.The matching entry in my case is as under in core-site.xml.
<property>
<name>fs.default.name</name>
<value>hdfs://PRUBNode1:9020</value>
</property>
The part /flume/webdata is arbitrary and will be created by flume if not already present.
care should be taken for authentication. We are assuming that the user running flume is the same or used sudo to hadoop user.
After HDFS is started and we an see that datanode , name node , secondary name node are active , then we can fire the following command to start flume from the bin folder under the folder created in the extraction of tar file as part of set up,-
./flume-ng agent -n agent1 -c conf -f ../conf/flume-conf.properties.template
flume-ng is the executable command to start the flume following are the parameters explained,-
agent -n : choice of the agent name should come after this , in this case 'agent1'
conf -f: choice of the config file in this case flume-conf.properties.template under conf folder. we updated this properties file few steps before.
Once this command is fired,there will be a lot of out put on screen, however following would be something where the out put should stop and you shall be able to see as under,-
14/11/30 02:02:38 INFO source.NetcatSource: Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:44444]
Since the flume agent is started as foreground process,terminating the command will stop the agent . So we shall open another terminal and issues following command to connect to flume,-
telnet localhost 44444
Once the prompt comes back, we can use telnet prompt to input data by typing
hello world[return]
telnet returns>OK
first upload[return]
telnet returns>OK
second upload[return]
telnet returns>OK
Now we can go to admin console of hdfs to check file,-
Parag Ray
29-Nov-2014
Introduction
Welcome to the readers!
This is the eighth part of the series of articles. Here we are looking into the needs of uploading data into HDFS or data acquisition tools. We shall be exploring flume as part of the strategy.
If you have not gone through already, you would need to go back to fourth part where multi node set up(see related readings) is discussed, as this post will depend on the multi node set up.
Please note the related readings and target audience section to get help to better follow the blog.
We are using the operating system Ubuntu 14.04. Hadoop version 1.2.1 and Flume version 1.5.2. java version 1.7.
Disclaimer
I do not represent any particular organization in providing this details here.Implementation of tools and techniques discussed in this presentation should be done by responsible Implementer, capable of undertaking such tasks.
Agenda
- Target audience
- Related readings/other blogs
- Requirements of data acquisition strategy
- Flume introduction
- Flume set up
- Test
- This is an intermediate level discussion on Hadoop and related tools.
- Best suited for audience who are looking for introduction to this technology.
- Prior knowledge in java and Linux required.
- Intermediate level understanding of networking necessary.
This is the eighth article of this series or blogs, other article shortcuts are available in the pages tab.
You would also like to look at Flume, Cloudera & Hadoop home page for further details.
Requirements of data acquisition strategy
While HDFS works as a good storage infrastructure for storing massive data, providing optimized & fault tolerant data serving capability, we do need separate strategy for acquiring data.
We need high portability or interoperability , scalability and fault tolerance also so that we can reliably gather data from wide range of sources.
Flume is one such platform which can be used to capture data into HDFS.
Flume introduction
Flume is a tool for capturing data from various sources and offers a very flexible way to capture large volume data into HDFS.
It provides features to capture data from multiple sources like logs of web applications or command output.
As a strategy it segregates accumulation of data and consumption of the same.
By providing a storage facility between the two , it can handle a large amount of difference in the accumulation rate and consumption rate.
With persistence between storage and consumption, it guarantees delivery of the captured data.
By having chaining capability and networking between various flume units(agents) it provides a complex and conditional data delivery capability which is also scalable horizontally.
This is not centrally controlled so it is not having any single point of failure.
Flume is designed to run stand alone as JVM process, each one is called an AGENT.
Every capture of data payload as byte array into the Agent is called an EVENT.
Typically , there is a client process , which will be the source of events. Flume can handle various types of data input as shown,-
Avro, log4j, syslog, Http POST(with JSON Body) ,twitter input and Exec source,last one being the out put of a local process call.
Source is a component which is responsible for capturing input. There are multiple type of sources by the nature of input that we are capturing.
Sink is the component which is persisting or delivering the captured data. Depending upon where the data is being delivered, there are various types of sinks like file system, hdfs etc.
Between source and the sink, there is CHANNEL. Channel holds data after it is written to by source and till it is consumed and delivered by sink.
Delivery guarantee is achieved by selecting persistent channel and it holds the data till it is delivered by sink. In case the Flume agent goes down before all of the data is delivered, it can redo delivery once it comes back up as it is retained by channel. However, in case the channel is none persistent, such guarantee can not work.
All configuration changes are reflected without restart and the flume agents can be chained to create complex and conditional topology.
This done by chaining the sink of one agent to the source of another.
Flume set up
Flume gzip file needs to be downloaded and extracted to a folder. Below picture shows the flume version 1.5.2 was downloaded and extracted
We shall be setting up the components as shown below,-
In the below configuration ,The agent is called 'agent1' which will have source 's1' , channel 'c1' and sink 'k1'
s1, c1 and k1 will have relevant properties and at the end source s1 and the sink k1 is linked to channel c1.
Here telnet is taken as source for simple testing, and hdfs path is provided to the sink.
Open flume-conf.properties.template in conf folder under the folder created by extraction of the flume set up, and provide the following entries. All other example entries can be commented.
agent1.sources = s1
agent1.channels = c1
agent1.sinks = k1
# For each one of the sources, the type is defined'
#telnet source indicated by type netcat and port as provided on localhost
agent1.sources.s1.type = netcat
agent1.sources.s1.bind = localhost
agent1.sources.s1.port = 44444
# Each sink's type must be defined, hdfs sink at specified url is given
#if needed, the host name (PRUBNode1) and port should be changed.
#host and port should be in line with hdfs core-site.xml config.
agent1.sinks.k1.type = hdfs
agent1.sinks.k1.hdfs.path =hdfs://PRUBNode1:9020/flume/webdata/
# Each channel's type is defined.
agent1.channels.c1.type = memory
agent1.channels.c1.capacity = 1000
agent1.channels.c1.transactionCapacity = 100
#link source and sink to channel
agent1.sources.s1.channels = c1
agent1.sinks.k1.channel = c1
Special attention should be given to hdfs.path the property should be as per fs.default.name in core site.The matching entry in my case is as under in core-site.xml.
<property>
<name>fs.default.name</name>
<value>hdfs://PRUBNode1:9020</value>
</property>
The part /flume/webdata is arbitrary and will be created by flume if not already present.
care should be taken for authentication. We are assuming that the user running flume is the same or used sudo to hadoop user.
After HDFS is started and we an see that datanode , name node , secondary name node are active , then we can fire the following command to start flume from the bin folder under the folder created in the extraction of tar file as part of set up,-
./flume-ng agent -n agent1 -c conf -f ../conf/flume-conf.properties.template
flume-ng is the executable command to start the flume following are the parameters explained,-
agent -n : choice of the agent name should come after this , in this case 'agent1'
conf -f: choice of the config file in this case flume-conf.properties.template under conf folder. we updated this properties file few steps before.
Once this command is fired,there will be a lot of out put on screen, however following would be something where the out put should stop and you shall be able to see as under,-
14/11/30 02:02:38 INFO source.NetcatSource: Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:44444]
Since the flume agent is started as foreground process,terminating the command will stop the agent . So we shall open another terminal and issues following command to connect to flume,-
telnet localhost 44444
Once the prompt comes back, we can use telnet prompt to input data by typing
hello world[return]
telnet returns>OK
first upload[return]
telnet returns>OK
second upload[return]
telnet returns>OK
Now we can go to admin console of hdfs to check file,-