问题描述
我有一个文档集合,每个文档都随着时间而迅速增长.任务是在任何固定时间查找相似的文档.我有两种可能的方法:
I have a collection of documents, where each document is rapidly growing with time. The task is to find similar documents at any fixed time. I have two potential approaches:
-
向量嵌入(word2vec,GloVe或fasttext),对文档中的词向量求平均,并使用余弦相似度.
A vector embedding (word2vec, GloVe or fasttext), averaging over word vectors in a document, and using cosine similarity.
字词包:tf-idf或其变体,例如BM25.
Bag-of-Words: tf-idf or its variations such as BM25.
其中之一会产生明显更好的结果吗?是否有人对tf-idf和平均word2vec进行了文档比较以进行定量比较?
Will one of these yield a significantly better result? Has someone done a quantitative comparison of tf-idf versus averaging word2vec for document similarity?
是否存在另一种方法,该方法可以在添加更多文本时动态优化文档的矢量?
Is there another approach, that allows to dynamically refine the document's vectors as more text is added?
推荐答案
- doc2vec或word2vec吗?
根据文章,doc2vec或段落2vec的性能对于短文档而言很差. [学习超短文本的语义相似性,2015年,IEEE]
According to article, the performance of doc2vec or paragraph2vec is poor for short-length documents. [Learning Semantic Similarity for Very Short Texts, 2015, IEEE]
- 短文档...?
如果要比较简短文档之间的相似性,则可能需要通过word2vec对文档进行矢量化.
If you want to compare the similarity between short documents, you might want to vectorize the document via word2vec.
- 如何构造?
例如,您可以使用tf-idf使用加权平均向量构造文档向量.
For example, you can construct a document vector with a weighted average vector using tf-idf.
- 相似性度量
此外,出于相似性考虑,我建议使用ts-ss而不是余弦或欧几里得.
In addition, I recommend using ts-ss rather than cosine or euclidean for similarity.
请参考以下文章或以下github中的摘要.一种用于度量文档之间相似度和文档聚类的混合几何方法"
Please refer to the following article or the summary in github below."A Hybrid Geometric Approach for Measuring Similarity Level Among Documents and Document Clustering"
https://github.com/taki0112/Vector_Similarity
谢谢
这篇关于文件相似度:向量嵌入与Tf-Idf效能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!