Text Mining in Apache Mahout

Lately we've been working on text mining using clustering techniques to group together similar documents. Apache Mahout has proven an excellent tool for this. Mahout is an open-source library that implements scalable machine learning algorithms. It is very fast and has excellent integration with other popular open-source Apache libraries, such as hadoop and lucene. One of mahout's core capabilities is clustering. To perform text mining, simply take a bunch of text documents, represent each document as a feature vector that says which words the document contains, and apply a clustering algorithm. A possible application is grouping blogs into different groups that can be targeted for ads.

Here's the basic workflow in mahout:

1. Start with a dataset, i.e. a collection of documents, where each document is a body of text. In the case of blog posts, you might choose to represent each post as the combination of the post along with the comments it received.

2. Convert the dataset into a SequenceFile. Each document is represented as a key-value pair. For example:

key: "awesome-blog-123", value: "This is the content of some crazy blog, it goes on and on ..."

SequenceFile is a hadoop format that is very convenient to use for distributed computations.

3. Convert each document into a feature vector. Its feature has a nonzero weight for each word present in the document, which can optionally be weighted against the percentage of documents that have that word (for example, a rarer word gets a higher weight). This step takes advantage of Apache Lucene to tokenize documents into words with a lucene analyzer. You can just use the default lucene analyzer, or you can build your own analzyer to use things like use a custom stopword list, or apply stemming. There are many other options, such as ignoring words that occur into too many or too few of the documents.

4. Apply a clustering algorithm, such as k-means. Depending on which algorithm you choose, there are various options, such as how to measure the distance between two documents. The Tanimoto distance measure seems to work best for text mining. There is an option (which I always enable) to save the cluster assignments.

5. View the results. Mahout makes nice report showing keywords and number of documents for each cluster. Here is an example of two of the clusters (k-means performed on the Reuters-21578 news article dataset with k=60 clusters):

:VL-1923{n=204 c=[0.1:0.002, ...

Top Terms:

usda => 0.06753121243377262

crop =>0.060854562343767626

agriculture =>0.052520322074035335

department =>0.047680346785091234

agriculture department => 0.04585521454744383

weather =>0.043366392581595054

u.s agriculture =>0.040774120617042435

u.s => 0.0373052876175809

winter => 0.03153209838656951

wheat => 0.0311025438002388

corn => 0.03093221202278233

acres =>0.030793208105163974

report => 0.02956791296402638

grain =>0.028564292016765833

farmers =>0.027331930526710808

:VL-4959{n=177 c=[00.07:0.001, ...

Top Terms:

gas => 0.10623738021091945

natural gas => 0.06380484169276382

natural => 0.06174956508835678

plant => 0.03982467383734964

cubic => 0.03337140109800385

company => 0.03196306851031449

feet => 0.03176449790815804

energy => 0.03030570701579984

cubic feet => 0.02978411802350815

power => 0.02976383599057059

oil => 0.02853383729688095

unit => 0.02700221281318857

mln =>0.024737173544495786

nuclear =>0.023698185273465604

electric =>0.022483151984386215

VL-1923 is the name of a cluster, n is the number of documents it contains, and the c vector is the center of the cluster. So for example, the component of the dimension for the word '0.1' of the center of this cluster is 0.002. Top Terms is computed from the max of these components. If you print out enough of the c=... line, you eventually see 'crop:0.061'. This corresponds to the 2nd top term, 'crop=>0.060854562343767626'.

For the Reuters dataset of 21,578 news articles, once the parameters are set, it takes only a few minutes total to run all these steps in mahout on my local machine. This is a fairly small dataset. Once you're dealing with large enough datasets, it makes sense to switch over to distrubuted computing with hadoop. No change to the code is necessary; you just tell mahout to switch over to distributed mode.

There are many more things you can do from here. For example, say you have 360,000 documents. You could first cluster them into 60 clusters of about 6000 documents. Then you could cluster each of those 60 clusters into 60 sub-clusters. The result is 360 sub-clusters of about 100 documents each. This is an example of double-tiered clustering. It is simultaneously provides a nice hierarchy to interpret the results, and speeds up the computation by an order of magnitude (compared to making 360 clusters from the start).

Happy text mining!

References for getting started

1. Mahout in Action, an excellent book on how to use mahout.

http://manning.com/owen/

2. An overview of how to cluster from the command line.

https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+analysis+using+the+Mahout+command+line

3. Instructions for running mahout in the cloud on Amazon Elastic MapReduce

https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce

General idea is good, though the versions used are slightly dated. Especially useful when combined with current EMR documentation:

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/