The Project

The aim of this project is to dig out a list of top R terms of varying lengths M through N that are especially interesting using the TF*IDF measure, from a large corpus of text. The goal is to do this as fast as possible with the use of multiple processors, but still retaining the accuracy. We are using the Gigaword corpus (a large archive of newspaper stories) as an input to this system.

Our approach to this problem, makes use of the datastructures like suffix array, LCP(Least common prefix) vectors, class array and hash table. The work distribution is done using a manager- worker strategy wherein one processor is responsible for all I/O and distributes data to all other processors. The collection is done using a binomial tree communication pattern among processors so that none of the processor gets burried under a mountain of data.

The Project Name

Samudra Manthan (Devanagari: समुद्र मंथन)  is a process of churning ocean and emerging interesting things out of it. We think of our huge data as an ocean and we are trying to find interesting terms out of it and hence the name Samudra-Manthan.

The Team Calakmul

1) Dinesh Bhirud
2) Prasad Kulkarni
3) Varada Kolhatkar

The Advisor
Dr. Ted Pedersen

The Development

Currently, we have released version 1.0 of the system,code for which is available for download here
The current stage of the development is that we are able to get top M interesting Ngrams(upto length 6) from the Gigaword (around 10gb) corpus of text.
The majority of development for this project is on the IBM BladeCenter Linux Cluster at the Minnesota Supercomputing Institute for Digital Simulation and Advanced Computation at the University of Minnesota, using C and MPI.

Presentation (ppt)