Text Mining in Apache Mahout

0

Posted by Anonymous on 29 Aug 2013 at 18:51

Lately we've been working on text mining using clustering techniques to group together similar documents. Apache Mahout has proven an excellent tool for this. Mahout is an open-source library that implements scalable machine learning algorithms. It is very fast and has excellent integration with other popular open-source Apache libraries, such as hadoop and lucene. One of mahout's core capabilities is clustering. To perform text mining, simply take a bunch of text documents, represent each document as a feature vector that says which words the document contains, and apply a clustering algorithm. A possible application is grouping blogs into different groups that can be targeted for ads.
 
Here's the basic workflow in mahout:
 
1. Start with a dataset, i.e. a...

Everything about Everyone, in one random access table

0

Posted by Lawrence Sinclair on 18 Sep 2009 at 16:27

HBASE - Sumit Khanna pointed out this element of the data processing space. I like to think about it as enabling one big table, big enough for every fact about everyone who ever lived.

HBase is the Hadoop database. Use it when you need random, realtime read/write access to your Big Data.

HBase ia an open-source, distributed, column-oriented store modeled after the Google paper,Bigtable: A Distributed Storage System for Structured Databy Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop. HBase's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware

Data Processing Performance Options

1

Posted by Lawrence Sinclair on 14 Sep 2009 at 04:19

Here are a few of my thoughts on technologies and approaches for achieving better data processing performance in the current technology landscape.

RDBMS
Using mySQL or another RDBMS, performance might be addressed with better indexing or by partitioning the data.
MAP-REDUCE NON-RELATIONAL SYSTEMS
A non-relational approach might be to useHadoopor one of its distributions (such asCloudera). This would allow processing to be distributed anywhere from 3 local machines, or a virtually unlimited (hundreds+++) number of machines on the cloud (such as Amazon EC2). But this is best suited for analytic and data processing tasks that can takes several minutes or hours.
THE BEST OF BOTH WORLDS?!
Somewhere in between these two systems isHadoopDBbyDaniel Abadiof Yale. It uses the Hadoop...

Mobile Platform Analytics

0

Posted by Lawrence Sinclair on 27 Jul 2009 at 11:36

Recently, we've been doing a lot of analytics work for clients with social networks and games on mobile platforms ranging from basic phones using SMS to iPhones. The focus in these cases involves discovering the drivers behind customer behavior and developing viable business models against a rapidly changing competitive landscape. Our onsite PhD analytics guru, Tim, has been really key in making this part of our business work, supported by the rest of the offsite team at East Agile. Tim carries some pretty powerful tools in his belt, which includes Hadoop on EC2, SAS, and some special joint-entropy optimization tools.

Hadoop: free scalable data processing

0

Posted by Lawrence Sinclair on 10 Jan 2009 at 04:53

Hadoop -- If you're a startup and think you have a lot of data, then the cool solution to your data processing problems is to use this technology. Hadoop is an open source distributed system for reading and transforming ("map") then sorting and summarizing ("reduce") raw text data on an arbitrarily large network of cheap computers.

In some specialized cases, Hadoop is becoming a competitor to commercial tools used for ETL ("Extract-Transform-Load") tools such as Informatic and SAS. Hadoop is free and far more scaleable than commercial alternatives. However, it is less flexible, less user friendly, and has no built in reporting or analytic capabilities, and has no database loading capabilities, leaving data in the same flatfile form from which it must come. In the right applications,...