Data Processing Performance Options


Posted by Lawrence Sinclair on 14 Sep 2009 at 04:19

Here are a few of my thoughts on technologies and approaches for achieving better data processing performance in the current technology landscape.

Using mySQL or another RDBMS, performance might be addressed with better indexing or by partitioning the data.
A non-relational approach might be to useHadoopor one of its distributions (such asCloudera). This would allow processing to be distributed anywhere from 3 local machines, or a virtually unlimited (hundreds+++) number of machines on the cloud (such as Amazon EC2). But this is best suited for analytic and data processing tasks that can takes several minutes or hours.
Somewhere in between these two systems isHadoopDBbyDaniel Abadiof Yale. It uses the Hadoop...

Hadoop: free scalable data processing


Posted by Lawrence Sinclair on 10 Jan 2009 at 04:53

Hadoop -- If you're a startup and think you have a lot of data, then the cool solution to your data processing problems is to use this technology. Hadoop is an open source distributed system for reading and transforming ("map") then sorting and summarizing ("reduce") raw text data on an arbitrarily large network of cheap computers.

In some specialized cases, Hadoop is becoming a competitor to commercial tools used for ETL ("Extract-Transform-Load") tools such as Informatic and SAS. Hadoop is free and far more scaleable than commercial alternatives. However, it is less flexible, less user friendly, and has no built in reporting or analytic capabilities, and has no database loading capabilities, leaving data in the same flatfile form from which it must come. In the right applications,...