2023-01-29 13:04:37
摘 要
A New Document Clustering Method
Clustering analysis is also called group analysis, which originated from taxonomy, it belongs to an unsupervised algorithm. Clustering can classify data sets without knowing the previous relationship of any data set. We usually regard clustering as a process of data preprocessing. The function of document clustering is: document clustering can be used as a preliminary processing of documents with huge data. For the results of returned data obtained by online search, we can also classify the results by document clustering, so that users can get the data they need better and faster. Document clustering can better discover users"current needs, and it can also achieve information filtering and active recommendation functions.. Document clustering can also optimize the final results of text categorization and realize the fast retrieval service of the library.
At present, we are based on the assumption that the similarity between documents of the same type is greater than that between documents of different types to achieve document clustering. Document clustering has been developed for decades. These algorithms are divided into: partition-based clustering algorithm, hierarchical clustering algorithm, density-based clustering algorithm, grid-based clustering algorithm, model-based clustering algorithm and so on.
However, some shortcomings of these clustering algorithms lead to some limitations in text clustering. In this paper, a novel DDC clustering algorithm is applied to document clustering. Compared with other clustering methods, this clustering algorithm has the advantages of not being disturbed by the number of clusters and being suitable for data sets with arbitrary spatial distribution.
The core of DDC algorithm is to determine cluster centers according to the density and distance of data objects, and then cluster according to cluster centers. After it was proposed, it has been widely used in the band selection of the direction of the extended spectrum image, the detection of the community in the hypernetwork, the age estimation of the facial image and so on. But it is seldom used in text clustering. In this topic, we apply DDC to document clustering algorithm and compare its advantages.
Before clustering, we need to vectorize the document data set we collected. Because computers can not directly obtain the relevant information in our data set, we need to convert the document into machine-readable data. Finally, after clustering, we need to evaluate the final clustering results and analyze the quality of clustering. This is also an important link.
Keywords:DDC; Clustering;document clustering; text vectorization; cluster quality assessment
第一章 引言 1
1.1 研究背景及意义 1
1.2 算法研究现状 2
1.2.1 K-means算法 2
1.2.2 DBSCAN算法 2
1.2.3 BIRCH算法 2
1.2.4 DDC算法 2
1.3 论文研究内容和组织结构 3
第二章 聚类算法理论与基础 4
2.1聚类的基础 4
2.1.1 聚类的相似度计算 4
2.1.2 聚类指标 5
2.2 K-means算法基础 7
2.3 DBSCAN算法基础 8
2.4 BIRCH算法基础 9
2.5 DDC算法基础 10
2.6本章小结 12
第三章 文档处理的基础知识 13
3.1文档处理 13
3.1.1 文档分词 14
3.1.2 停用词过滤 14
3.2文本表示模型 15
3.2.1 布尔模型 15
3.2.2 向量空间模型 15
3.3 TF-IDF处理文本示例: 16
3.4本章小结 17
第四章 文档聚类算法实现及对比 19
4.1 文档预处理 19
4.1.1 文档向量化基础 19
4.1.2文本向量化过程 19
4.2 K-means算法实现 20
4.3 DBSCAN算法实现 20
4.4 DDC算法实现 21
4.4.2求数据点密度代码: 21
4.4.4计算相似度代码: 22
4.4.5 Ryan Seghers求密度的方法: 22
4.5 DBI指数测评 23
4.6 轮廓系数测评 24
4.7 实验结果对比及分析 25
4.7.1 实验结果对比 25
4.7.2 实验结果分析 37
4.8 本章小结 37
第五章 总结与期望 38
5.1 全文工作总结 38
5.1.1 全文的主要研究内容 38
5.1.2 缺点与不足 38
5.2 对未来工作的展望 38
致 谢 40
参考文献 41
1.1 研究背景及意义