Clustering of text documents by implementation of K-means algorithms



Abstract:

The document clustering has investigated for use in a number of different areas of text mining and information retrieval. Initially document clustering was investigated for improving the precision or recall in information retrieval systems as an efficient way of finding the nearest neighbours of a document. Clustering has proposed for use in browsing a collection of documents or in organizing the results returned by a search engine in response to a user’s query. Document clustering also has used for automatically generate hierarchical clusters of documents. The automatic generation of taxonomy of web documents like that provided by Yahoo, often cited as a goal. A different approach finds natural clusters in already existing documents taxonomy and then uses these clusters to produce and effective document classifier for new documents. Initially we also believed that hierarchical clustering was superior to Kmeans clustering for clustering the text documents. During the course of my experiments, I analysed that a simple K-means and a variant of K-means i.e. spherical K-means can produce the clusters of documents that are better than those produced by ‘regular’ K-means. I have also been able to find what we think is a reasonable explanation for this behaviour. I applied K-means and spherical K-means code written in MATLAB 7.7 to waste water treatment plant and 20 Newsgroup (Ng) dataset. I have taken test data of 20 Ng with 200 documents and clustered these with different no. of cluster values k=5, 10, 20, 25. We obtained efficient results.

 

Full Article


Leave a Reply

Your email address will not be published. Required fields are marked *