问题描述
我有大量的扫描文档需要索引,但是感兴趣的文档只占分类器需要识别的整个程序包的一小部分.为了了解类的最佳数量以及如何最好地合并类中的文档,我想进行无监督的聚类分析.
I have a large set of scanned documents that I need to index however the the documents of interest are a small proportion of the entire package my classifier needs to identify. To get an idea of the optimum number of classes and how best to merge documents in a class I wanted to run an unsupervised clustering analysis.
哪种距离方法会更好地捕获结构信息.集聚层次聚类是否也将是给定任务的最佳聚类方法?谢谢
Which distance method would work better to capture the structural information. Also would agglomerative Hierarchical clustering be the best clustering approach for the given task? Thanks
推荐答案
一种无监督的聚类技术在扫描的文档上失败了,因为它无法掌握底层结构并最终给出了毫无意义的聚类.因此,该方法从根本上来说是有缺陷的.但是,如果文档具有不同的结构,则使用具有足够数据和精心选择的不同类的深度卷积神经网络进行的分类可以胜过OCR技术.
An unsupervised clustering technique fails on scanned documents since it fails to grasp the underlying structure and ends up giving non nonsensical clusters. So the approach is fundamentally flawed. However Classification using deep convolutional neural networks, with sufficient data and carefully chosen distinct classes, can outperform OCR techniques if the documents have a distinct structure.
这篇关于图像的聚类分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!