Data Processing Performance Options

Here are a few of my thoughts on technologies and approaches for achieving better data processing performance in the current technology landscape.

RDBMS

Using mySQL or another RDBMS, performance might be addressed with better indexing or by partitioning the data.

MAP-REDUCE NON-RELATIONAL SYSTEMS

A non-relational approach might be to use Hadoop or one of its distributions (such as Cloudera). This would allow processing to be distributed anywhere from 3 local machines, or a virtually unlimited (hundreds+++) number of machines on the cloud (such as Amazon EC2). But this is best suited for analytic and data processing tasks that can takes several minutes or hours.

THE BEST OF BOTH WORLDS?!

Somewhere in between these two systems is HadoopDB by Daniel Abadi of Yale. It uses the Hadoop distributed execution architecture as well as its distributed sort/process/load abilities to operate in a completely linearly scalable fashion across multiple local machines or vast numbers of cloud-based machines. But to each node, it integrates a RDBMS (postgreSQL, or possibly mySQL).

This allows a highly scalable open source (free) distributed relational database. Data can be partitioned and distributed among multiple nodes, indexed, and queried using SQL. A query that takes 20 seconds on one mySQL machine, might take 2 seconds if distributed among 10 machines (it might take less, but then there is overhead to distributing the query). If you find that between 9am and 11am each morning demands are high, you can use 30 machines or 300 machines. Or you might find using 3 machines is adequate overnight. Similarly, if you end up with a flood of new usage (new users) then it is easy to scale linearly. See peformance comparisons: http://radar.oreilly.com/2009/07/hadoopdb-an-open-source-parallel-database.html.HadoopDB is very new stuff, so the normal bleeding edge caveats should apply.

OTHER NON-RELATIONAL SYSTEMS

Tokyo Cabinet is essentially a very efficient flat-file database, that can be distributed to the cloud. It is a recommended part of the Engine Yard stack. Tokyo Cabinet is not redundant like Hadoop. But it focuses on search/query, rather than analytics, and supports a lot of specialized table structures. Like Hadoop, it can be distributed to the cloud (using Tokyo Tyrant) to achieve linear scalability. There are simple Ruby interfaces to Tokyo Cabinet. Bleeding edge caveats apply here too.

OTHER CONVENTIONAL TECHNOLOGIES

SAS has been around for a long time, and if budgets support its use (USD 15000 is an entry level annual license), it is an excellent choice for data processing. It integrates exceptionally strong advanced analytics capabilities with rapid and powerful data processing and storage infrastructure, including the ability to execute processes in parallel.

Other technologies, such as Informatica, are widely used, can achieve high performance, but in my opinion do not provide much benefit compared to open source alternatives like Hadoop, especially considering their exceptionally high license fees. Their day has come, and gone.