Leveraging TF-IDF Matrix for Document Clustering with K-Means Algorithm

Authors

  • Shilpi Kulshrestha Department of CSE, Jaipur National University, Jaipur, India
  • Dharmesh Santani Department of CSE, Jaipur National University, Jaipur, India

DOI:

https://doi.org/10.38124/ijsrmt.v3i10.61

Keywords:

Document Clustering,, TF- IDF Matrix, K-Means Algorithm,, Evaluation Metrics, Text pre-processing

Abstract

Document clustering is an important task for information retrieval, it aims for grouping of similar kind of documents together for efficient organization and retrieval. This paper presents a new approach for document clustering by combination of the Term Frequency-Inverse Document Frequency (TF-IDF) matrix with the K-Means algorithm. The Proposed system overcomes the obstacles of the traditional methods integrating TF-IDF matrices to convey document semantics and K-Means clustering to get homogeneous document clusters. Key components of the system include text pre-processing techniques such as stop-word removal, stemming, and tokenization, which improve the quality of TF-IDF representations. Additionally, evaluation metrics like purity, F-measure, and silhouette score are applied to evaluate the system’s clustering performance. Our proposed approach shows that it is feasible to process large volumes of documents and at the same time ensuring robustness by discarding outliers and noisy data in the data. The obtained results upon a benchmark dataset demonstrate the superiority of suggested approach in comparison to the baseline techniques and these results underline the effectiveness of the proposed method in terms of the efficiency of the document clustering and facilitating the streamlined document organization and retrieval in different domains.

Downloads

Download data is not yet available.

Downloads

Published

2024-10-19

How to Cite

Kulshrestha, S., & Santani, D. (2024). Leveraging TF-IDF Matrix for Document Clustering with K-Means Algorithm. International Journal of Scientific Research and Modern Technology, 3(10). https://doi.org/10.38124/ijsrmt.v3i10.61