Leveraging TF-IDF Matrix for Document Clustering with K-Means Algorithm
DOI:
https://doi.org/10.38124/ijsrmt.v3i10.61Keywords:
Document Clustering,, TF- IDF Matrix, K-Means Algorithm,, Evaluation Metrics, Text pre-processingAbstract
Document clustering is an important task for information retrieval, it aims for grouping of similar kind of documents together for efficient organization and retrieval. This paper presents a new approach for document clustering by combination of the Term Frequency-Inverse Document Frequency (TF-IDF) matrix with the K-Means algorithm. The Proposed system overcomes the obstacles of the traditional methods integrating TF-IDF matrices to convey document semantics and K-Means clustering to get homogeneous document clusters. Key components of the system include text pre-processing techniques such as stop-word removal, stemming, and tokenization, which improve the quality of TF-IDF representations. Additionally, evaluation metrics like purity, F-measure, and silhouette score are applied to evaluate the system’s clustering performance. Our proposed approach shows that it is feasible to process large volumes of documents and at the same time ensuring robustness by discarding outliers and noisy data in the data. The obtained results upon a benchmark dataset demonstrate the superiority of suggested approach in comparison to the baseline techniques and these results underline the effectiveness of the proposed method in terms of the efficiency of the document clustering and facilitating the streamlined document organization and retrieval in different domains.
Downloads
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 International Journal of Scientific Research and Modern Technology
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.