< Back

Text Mining in Apache Mahout


Posted by Anonymous on 29 Aug 2013 at 18:51

Lately we've been working on text mining using clustering techniques to group together similar documents. Apache Mahout has proven an excellent tool for this. Mahout is an open-source library that implements scalable machine learning algorithms. It is very fast and has excellent integration with other popular open-source Apache libraries, such as hadoop and lucene. One of mahout's core capabilities is clustering. To perform text mining, simply take a bunch of text documents, represent each document as a feature vector that says which words the document contains, and apply a clustering algorithm. A possible application is grouping blogs into different groups that can be targeted for ads.
Here's the basic workflow in mahout:
1. Start with a dataset, i.e. a collection of documents, where each document is a body of text. In the case of blog posts, you might choose to represent each post as the combination of the post along with the comments it received.
2. Convert the dataset into a SequenceFile. Each document is represented as a key-value pair. For example:
key: "awesome-blog-123", value: "This is the content of some crazy blog, it goes on and on ..."
SequenceFile is a hadoop format that is very convenient to use for distributed computations.
3. Convert each document into a feature vector. Its feature has a nonzero weight for each word present in the document, which can optionally be weighted against the percentage of documents that have that word (for example, a rarer word gets a higher weight). This step takes advantage of Apache Lucene to tokenize documents into words with a lucene analyzer. You can just use the default lucene analyzer, or you can build your own analzyer to use things like use a custom stopword list, or apply stemming. There are many other options, such as ignoring words that occur into too many or too few of the documents.
4. Apply a clustering algorithm, such as k-means. Depending on which algorithm you choose, there are various options, such as how to measure the distance between two documents. The Tanimoto distance measure seems to work best for text mining. There is an option (which I always enable) to save the cluster assignments.
5. View the results. Mahout makes nice report showing keywords and number of documents for each cluster. Here is an example of two of the clusters (k-means performed on the Reuters-21578 news article dataset with k=60 clusters):
    :VL-1923{n=204 c=[0.1:0.002, ...
            Top Terms:
                    usda                                    => 0.06753121243377262
                    crop                                    =>0.060854562343767626
                    agriculture                             =>0.052520322074035335
                    department                              =>0.047680346785091234
                    agriculture department                  => 0.04585521454744383
                    weather                                 =>0.043366392581595054
                    u.s agriculture                         =>0.040774120617042435
                    u.s                                     =>  0.0373052876175809
                    winter                                  => 0.03153209838656951
                    wheat                                   =>  0.0311025438002388
                    corn                                    => 0.03093221202278233
                    acres                                   =>0.030793208105163974
                    report                                  => 0.02956791296402638
                    grain                                   =>0.028564292016765833
                    farmers                                 =>0.027331930526710808
    :VL-4959{n=177 c=[00.07:0.001, ...
            Top Terms:
                    gas                                     => 0.10623738021091945
                    natural gas                             => 0.06380484169276382
                    natural                                 => 0.06174956508835678
                    plant                                   => 0.03982467383734964
                    cubic                                   => 0.03337140109800385
                    company                                 => 0.03196306851031449
                    feet                                    => 0.03176449790815804
                    energy                                  => 0.03030570701579984
                    cubic feet                              => 0.02978411802350815
                    power                                   => 0.02976383599057059
                    oil                                     => 0.02853383729688095
                    unit                                    => 0.02700221281318857
                    mln                                     =>0.024737173544495786
                    nuclear                                 =>0.023698185273465604
                    electric                                =>0.022483151984386215
VL-1923 is the name of a cluster, n is the number of documents it contains, and the c vector is the center of the cluster. So for example, the component of the dimension for the word '0.1' of the center of this cluster is 0.002. Top Terms is computed from the max of these components. If you print out enough of the c=... line, you eventually see 'crop:0.061'. This corresponds to the 2nd top term, 'crop=>0.060854562343767626'.
For the Reuters dataset of 21,578 news articles, once the parameters are set, it takes only a few minutes total to run all these steps in mahout on my local machine. This is a fairly small dataset. Once you're dealing with large enough datasets, it makes sense to switch over to distrubuted computing with hadoop. No change to the code is necessary; you just tell mahout to switch over to distributed mode.
There are many more things you can do from here. For example, say you have 360,000 documents. You could first cluster them into 60 clusters of about 6000 documents. Then you could cluster each of those 60 clusters into 60 sub-clusters. The result is 360 sub-clusters of about 100 documents each. This is an example of double-tiered clustering. It is simultaneously provides a nice hierarchy to interpret the results, and speeds up the computation by an order of magnitude (compared to making 360 clusters from the start).
Happy text mining!
References for getting started
1. Mahout in Action, an excellent book on how to use mahout.
2. An overview of how to cluster from the command line.
3. Instructions for running mahout in the cloud on Amazon Elastic MapReduce
General idea is good, though the versions used are slightly dated. Especially useful when combined with current EMR documentation:
Leave a comment