Posts

Showing posts with the label Hadoop

What is Hadoop?

Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the  Apache  project sponsored by the Apache Software Foundation. Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of  terabyte s. Its distributed file system facilitates rapid  data transfer rate s among nodes and allows the system to continue operating uninterrupted in case of a node failure. This approach lowers the risk of catastrophic system failure, even if a significant number of nodes become inoperative. Hadoop was inspired by  Google 's  MapReduce , a software framework in which an application  is broken down into numerous small parts. Any of these parts (also called fragments or blocks) can be run on any  node  in the  cluster . Doug Cutting, Hadoop's creator, named the framework after his child's stuffed toy elephant. T...

What Is Hadoop?

Hadoop is a term you will hear over and over again when discussing the processing of big data information. You might have also seen the yellow elephant image, which is the copyrighted icon depicting Hadoop (Hadoop was the name of the founder’s ( Doug Cutting’s ) son’s toy elephant). In the other post, I broke down the idea of MapReduce into the most easily digestible way possible; here is the same with Hadoop. A little history… Hadoop was born out of a need to process big data, as the amount of generated data continued to rapidly increase. As the Web generated more and more information, it was becoming quite challenging to index the content, so Google created MapReduce in 2004, then Yahoo! created Hadoop as a way to implement the MapReduce function. Hadoop is now an open-source Apache implementation project. Overall, Hadoop enables applications to work with huge amounts of data stored on various servers.  Hadoop’s functions allow the existing data to be pulled from vario...

What is Hadoop?

Image
I’m sure you’ve heard about Big Data. If not, I recommend you my blog post “ What is Big Data  ?” The most well known technology used for Big Data is Hadoop. It is used by Yahoo, eBay, LinkedIn and Facebook. It has been inspired from Google publications on MapReduce, GoogleFS and BigTable. As Hadoop can be hosted on commodity hardware (usually Intel PC on Linux with one or 2 CPU and a few TB on HDD, without any RAID replication technology), it allows them to store huge quantity of data (petabytes or even more) at very low cost (compared to SAN bay systems). Hadoop is an open source suite, under an apache foundation: http://hadoop.apache.org/ . The Hadoop “brand” contains many different tools. Two of them are core parts of Hadoop: Hadoop Distributed File System (HDFS)  is a virtual file system that looks like any other file system except than when you move a file on HDFS, this file is split into many small files, each of those files is replicated and stored on (usua...