< Back

Data Processing Performance Options


Posted by Lawrence Sinclair on 14 Sep 2009 at 04:19

Here are a few of my thoughts on technologies and approaches for achieving better data processing performance in the current technology landscape. 

Using mySQL or another RDBMS, performance might be addressed with better indexing or by partitioning the data.
A non-relational approach might be to use Hadoop or one of its distributions (such as Cloudera). This would allow processing to be distributed anywhere from 3 local machines, or a virtually unlimited (hundreds+++) number of machines on the cloud (such as Amazon EC2). But this is best suited for analytic and data processing tasks that can takes several minutes or hours.
Somewhere in between these two systems is HadoopDB by Daniel Abadi of Yale. It uses the Hadoop distributed execution architecture as well as its distributed sort/process/load abilities to operate in a completely linearly scalable fashion across multiple local machines or vast numbers of cloud-based machines. But to each node, it integrates a RDBMS (postgreSQL, or possibly mySQL). 
This allows a highly scalable open source (free) distributed relational database. Data can be partitioned and distributed among multiple nodes, indexed, and queried using SQL. A query that takes 20 seconds on one mySQL machine, might take 2 seconds if distributed among 10 machines (it might take less, but then there is overhead to distributing the query). If you find that between 9am and 11am each morning demands are high, you can use 30 machines or 300 machines. Or you might find using 3 machines is adequate overnight. Similarly, if you end up with a flood of new usage (new users) then it is easy to scale linearly. See peformance comparisons: http://radar.oreilly.com/2009/07/hadoopdb-an-open-source-parallel-database.html.HadoopDB is very new stuff, so the normal bleeding edge caveats should apply.
Tokyo Cabinet is essentially a very efficient flat-file database, that can be distributed to the cloud. It is a recommended part of the Engine Yard stack. Tokyo Cabinet is not redundant like Hadoop. But it focuses on search/query, rather than analytics, and supports a lot of specialized table structures. Like Hadoop, it can be distributed to the cloud (using Tokyo Tyrant) to achieve linear scalability. There are simple Ruby interfaces to Tokyo Cabinet. Bleeding edge caveats apply here too.
SAS has been around for a long time, and if budgets support its use (USD 15000 is an entry level annual license), it is an excellent choice for data processing. It integrates exceptionally strong advanced analytics capabilities with rapid and powerful data processing and storage infrastructure, including the ability to execute processes in parallel.
Other technologies, such as Informatica, are widely used, can achieve high performance, but in my opinion do not provide much benefit compared to open source alternatives like Hadoop, especially considering their exceptionally high license fees. Their day has come, and gone.


1 comment
  1. Alex Nguyen - Sep 14 2009

    HadoopDB is interesting. Have you take a look at MongoDB? (http://www.mongodb.org/). Currently is a FAST document-oriented storage engine.It's nearly as fast as key-value store engine (Tokyo, Redis ...) and faster than CouchDB (with some trade-of: don't support MVCC for example). MongoDB will be very powerful when upcoming features are completed: * Shading (Scale to Cloud like Google BigTable) * MapReduce (JavaScript code like CouchDB) Alex.

Leave a comment