Hadoop: free scalable data processing

Hadoop -- If you're a startup and think you have a lot of data, then the cool solution to your data processing problems is to use this technology. Hadoop is an open source distributed system for reading and transforming ("map") then sorting and summarizing ("reduce") raw text data on an arbitrarily large network of cheap computers.

In some specialized cases, Hadoop is becoming a competitor to commercial tools used for ETL ("Extract-Transform-Load") tools such as Informatic and SAS. Hadoop is free and far more scaleable than commercial alternatives. However, it is less flexible, less user friendly, and has no built in reporting or analytic capabilities, and has no database loading capabilities, leaving data in the same flatfile form from which it must come. In the right applications, Hadoop is relolutionary in terms of its price-performance.

While fewer and fewer searches are being made for ETL, Hadoop is experiencing a rapid growth in interest since it was released. Other open source distributed computing environments have remained uninspiring to the public. These trends are shown in the following Google search statistics.

Trend: a rapid increase in interest in Hadoop

Many users with a lot of data to process could meet their needs with a cluster of ten cheap linux boxes running Hadoop, and perhaps costing $10,000. This might compare to the results they would experience on a commercial hardware and software combination costing more than ten times as much. But either case may be a question of working harder rather than smarter. For many applications, working with an appropriate random sample will provide all the necessary accuracy in a fraction of the time and cost. Processing a five percent random sample can often be more than twenty times faster than processing the full dataset while producing practically indistinguishable results. But this is often a conceptual leap too far for many engineering teams. Those who get it, can use Hadoop too, but be an order of magnitude faster than other Hadoop users, and two orders of magnitude faster than conventional ETL teams typically found in Fortune 500 datawarehouse departments.