You probably haven’t heard of Apache Hadoop before, unless you work within the world of Big Data. In that case, you already how big a deal it has become in the past few years, but for those who are new to the term, that’s due to the fact that Hadoop is one of those computer technologies that sits behinds the scenes, out of the limelight.
Hadoop is an open source framework that enables data-intensive distributed applications to efficiently process gigantic amounts of data. It’s an open source implementation of the MapReduce approach to processing data. MapReduce was invented at Google, in order to deal with the massive quantities of data necessary to index the web.
Typically implemented on large clusters of hundreds, or sometimes thousands, of computers, the Hadoop MapReduce runtime is tasked with partitioning the incoming data, then scheduling program execution spread among the entire set of machines, all the while taking care of inter-machine communication and any sort of machine failure that may occur.
While the software library is a framework allowing for the distributed processing of very large data sets across these clusters, it’s also able to detect and handle failures at the application layer, which provides a high-availability system architecture.
To give you an idea of the scale involved, Yahoo! now has over 42,000 servers in 1,200 racks running Hadoop in four data centers, with the largest single cluster comprised of 4,000 servers that process over 16 petabytes of data. Facebook, another major user of Hadoop, believes they have the world’s largest collection of data stored on the Hadoop Distributed File System (HDFS), over 100 petabytes worth, spread out over 100 different clusters across its data centers.
And how is all this data stored? Historically that job was assigned to hard drives. Lots of them, to be sure, but the best option available has been spinning disks. That’s beginning to shift with the advent of high-density SSD storage from companies such as Fusion-io, XtremIO and Violin Memory.
According to Don Basile, CEO of Violin Memory, big data applications involving Hadoop could run faster and more efficiently with Violin’s flash arrays as servers wouldn’t wait so long for I/O. He can see the number of servers needed by Hadoop apps being cut by a factor of ten because of this, lowering the cost of a Hadoop-capable server-storage infrastructure.
And don’t assume this advanced technology is confined to social networks or enterprise applications. As the volume of data streaming from aerial surveillance platforms continues to grow exponentially, government & defense applications dealing with image and video storage and analysis will become prime targets for Apache Hadoop on systems such as Trenton’s TSS5203 rackmount storage server.