Samudra-Manthan
The Project
The
aim of this project is to dig out a list of top R terms of varying
lengths M through N that are especially interesting using the TF*IDF measure, from a large
corpus of text. The goal is to do this as fast as possible with the use
of multiple processors, but still retaining the accuracy. We are using
the Gigaword corpus (a large archive of newspaper stories) as an input
to this system.
Our approach to this problem, makes use of the datastructures like
suffix array, LCP(Least common prefix) vectors, class array and hash
table. The work distribution is done using a manager- worker strategy
wherein one processor is responsible for all I/O and distributes data
to all other processors. The collection is done using a binomial tree
communication pattern among processors so that none of the processor
gets burried under a mountain of data.
The Project Name
Samudra
Manthan (Devanagari:
समुद्र मंथन) is a process of churning ocean
and
emerging interesting things out of it. We think of our
huge data
as an ocean and we are trying to find interesting terms out of it and
hence the name Samudra-Manthan.
The Team Calakmul
1) Dinesh
Bhirud
2) Prasad
Kulkarni
3) Varada
Kolhatkar
The Advisor
Dr.
Ted Pedersen
The Development
Currently, we have released version 1.0 of the system,code for which is
available for download
here
The current stage of the development is that we are able to get top M
interesting Ngrams(upto length 6) from the Gigaword (around 10gb)
corpus of text.
The majority of development for this project is on the IBM
BladeCenter Linux Cluster at the Minnesota Supercomputing Institute
for Digital Simulation and Advanced Computation at the University of
Minnesota, using C and MPI.
Downloads
Source
Report
Presentation (ppt)