#Hadoop, which as named after the elephant toy of the inventor of Hadoop, was developed because the existing data storage and processing tools appeared to be inadequate to handle all the large amounts of data that started to appear after the internet bubble. First it was #Google who developed the paradigm #MapReduce to be able to cope with the flow of data that came via its mission to organize the world’s information and make it universally accessible and useful. #Yahoo in turn developed Hadoop in 2005 as an implementation of #MapReduce. It was released as an open source tool in 2007 under the #Apache license.
Over the years, Hadoop has converted into an operating system at a very large scale especially focused on distributed and parallel processing of the vast amounts of data created nowadays. As is with any ‘normal’ operating system, Hadoop consists of a file system, is able to write programs, can manage distributing those programs and return the results afterwards.
Hadoop supports data-intensive distributed applications that can run simultaneously on large clusters of normal, commodity, hardware. It is licensed under the Apache v2 license. A Hadoop network is reliable and extremely scalable and it can be used to query massive data sets. Hadoop is written in the #Java programming language, meaning it can run on any platform, and is used by a global community of distributors and big data technology vendors who have built layers on top of Hadoop.
The feature that makes Hadoop so useful is that the Hadoop Distributed File System ( #HDFS ). This is the storage system of Hadoop that is able to break down the data that it processes into smaller pieces, which are called blocks. These blocks are subsequently distributed throughout a cluster. This distributing of the data allows the map and reduce functions to be executed on smaller subsets instead of on one large data set. This increase efficiency, processing time and it enable the scalability necessary for processing vast amounts of data.
No comments:
Post a Comment