Parag Ray
29-Aug-2014
Introduction
Welcome to the readers!
The purpose of writing this blog is put together all the finding and knowledge that I currently have have and will be gathering in the field of data science and big data technology and allied application area.
I hope this collection will help you.
I shall be covering various concepts, technologies starting from this basic overview, please look at the target audience and related readings section for suitability of your need. This blog is written more like a book and may be edited for correction, addition and expansion.
Agenda
- Target audience
- Related readings/other blogs
- Data science and Big data definition
- Use of Hadoop in big data
- Hadoop at a glance.
- Typical Use cases.
- Types of Algorithm for analytics and ML.
- Concepts & Skill base.
- How does Hadoop storage look like.
- The ecosystem.
- This is an introductory discussion on data science, big data technologies and Hadoop.
- Best suited for audience who are looking for introduction to this technology.
- There is no prior knowledge required,except for basic understanding of computing and high level understanding of enterprise application environments.
This is the first of this series or blogs, we shall add other bog titles as they are added.
Other article shortcuts are available in the pages tab.
You would also like to look at cloudera home page & Hadoop home page for further details.
Data Science and Big data definition
I am using wikipedia definition here, as I find them very appropriate,-
Data science is the study of the generalizable extraction of knowledge from data,[1] yet the key word is science.[2] It incorporates varying elements and builds on techniques and theories from many fields, including signal processing, mathematics, probability models, machine learning, statistical learning, computer programming, data engineering, pattern recognition and learning, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products. The subject is not restricted to only big data, although the fact that data is scaling up makes big data an important aspect of data science. Another key ingredient that boosted the practice and applicability of data science is the development of machine learning - a branch of artificial intelligence - which is used to uncover patterns from data and develop practical and usable predictive models.
For more details please visit http://en.wikipedia.org/wiki/Data_science.
Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications.
The challenges include analysis, capture, curation, search, sharing, storage, transfer, visualization, and privacy violations. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, prevent diseases, combat crime and so on."[1]
Scientists regularly encounter limitations due to large data sets in many areas, including meteorology, genomics,[2] connectomics, complex physics simulations,[3] and biological and environmental research.[4] The limitations also affect Internet search, finance and business informatics.For more details please visit http://en.wikipedia.org/wiki/Big_data
Use of Hadoop in big data
Hadoop has become a very popular platform for big data analysis and also data science as it allows a reliable massive horizontally scalable data store.
By default, the data is not fetched linearly from a vertically scaled infrastructure( for example in case of an RDBMS), this makes it relatively faster in data retrieval.
Basically this is created for batch processing , there has been various related technologies which enable near real time access of data stored in Hadoop.
Although meant for commodity hardware, some of the vendors have come up with specialized and proprietory hardware, which improve the speed of data access even more.
But these are by nature prone to vendor lock in, however for very high -end usage that should not be a problem.
Hadoop at a glance.
Hadoop is comprised of two main components ,-
- HADOOP DISTRIBUTED FILE SYSTEM(HDFS)
- MAP REDUCE
Hadoop and Map reduce are integrated.
Typical Use cases.
- E-Commerce.
Site performance on DNS look up time, form loading time. Frequency of request by page etc can be found out analyzing very large log data.
- Banking
Risk analysis
Analytics of various investment and instruments related data
Market analysis
- Network
Anomalous access detection
- Telecom
Types of Algorithm for analytic and ML.
- Data analysis to find sample/ population characteristics
- Advanced algorithms
There are other advanced analytics in line with Machine learning also there are those algos which try to analyze a data set and then try to find out a predictive model or grouping model and are intelligent enough to find the best fit parameters by themselves,-
- Supervised learning:
Based on these , we can try to predict y values for other x's for which y is not known.
In supervised learning , a learning set helps us find a model or the nature of predicted functions f(x) by predicting the parameter set of the linear or non linear functional used.
Challenges involved in choosing a linear or none-linear model, optimization of the algorithm, data standardization so that the analysis runs within performance requirements and accuracy. There will be need for application of techniques like feature scaling ,proper selection of parameters to be employed.
On the other hand classification algorithm tries to classify data points in various categories one of the prominent example will be OCR.
- Unsupervised learning
- Those computational tasks that can be broken into distributed iterative logic is particularly applicable for Hadoop based systems.
- Hadoop provides platform for massive data storage but analytics is performed with various tools that range form MR adaptation to pig , hive, Hbase , R as well as Mahout.
Concepts & Skill base.
Distributed file system
Large files stored in blocks and replicated across machines.
Takes care of network failure and recovery.
Optimizes data access based on topology & access point.
Skills needed for a big data professional
Although there are specializations but the following skills look to be important.
- Tools expertise. Wide range of tool knowledge is required as it is not one size fit all.
- Knowledge of statistical principles like Sampling, central tendency, variance, correlation, regression analysis & time series, various probability distributions like normal, chi-square , t distribution..
- Knowledge of matrix algebra.
- Knowledge of network and Linux/Unix operating system
- Domain knowledge.
Be ready for heavy intellectual challenges arising out of such optimization.
How does Hadoop storage look like.
- HDFS storage span across nodes(commodity hardware included) racks , clusters and data centers.
- Master in this set up is the name node and slave is the Data node.
- Master Name node serves locations and holds metadata and provides pointers to data nodes holding the requested resource.
- Data nodes server data and are not dependent on Name node.
- Data resources are replicated multiple times across Data nodes.
- Name node is intelligent enough to provide reference to all available replications of same resources but ordered by fastest accessibility based of the configured topology.
The Ecosystem.
No comments:
Post a Comment
It will be by pleasure to respond to any of your queries, and i do welcome your suggestions for making the blogs better.