本文介绍了R:使用 tm 和代理计算术语文档矩阵的余弦距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

限时删除!!

我想计算语料库作者之间的余弦距离.让我们以 20 个文档的语料库为例.

I want to calculate the cosine distance among authors of a corpus. Let's take a corpus of 20 documents.

require(tm)
data("crude")
length(crude)
# [1] 20

我想找出这20个文档之间的余弦距离(相似度).我用

I want to find out the cosine distance (similarity) among these 20 documents. I create a term-document matrix with

tdm <- TermDocumentMatrix(crude,
                          control = list(removePunctuation = TRUE,
                                         stopwords = TRUE))

然后我必须将其转换为矩阵以将其传递给 proxy 包的 dist()

then I have to convert it to a matrix to pass it to dist() of the proxy package

tdm <- as.matrix(tdm)
require(proxy)
cosine_dist_mat <- as.matrix(dist(t(tdm), method = "cosine"))

最后,我删除了余弦距离矩阵的对角线(因为我对文档与其自身之间的距离不感兴趣)并计算每个文档与语料库中其他 19 个文档之间的平均距离

Finally I remove the diagonal of my cosine distance matrix (since I am not interested in the distance between a document and itself) and compute the average distance between each document and the other 19 document of the corpus

diag(cosine_dist_mat) <- NA
cosine_dist <- apply(cosine_dist_mat, 2, mean, na.rm=TRUE)

cosine_dist
# 127       144       191       194
# 0.6728505 0.6788326 0.7808791 0.8003223
# 211       236       237       242
# 0.8218699 0.6702084 0.8752164 0.7553570
# 246       248       273       349
# 0.8205872 0.6495110 0.7064158 0.7494145
# 352       353       368       489
# 0.6972964 0.7134836 0.8352642 0.7214411
# 502       543       704       708
# 0.7294907 0.7170188 0.8522494 0.8726240

到目前为止一切顺利(使用小语料库).问题是这种方法不能很好地扩展更大的文档语料库.这一次似乎效率低下,因为两次调用 as.matrix(),将 tdmtm 传递给 proxy 最后计算平均值.

So far so good (with small corpora). The problem is that this method doesn't scale well for larger corpora of documents. For once it seems inefficient because of the two calls to as.matrix(), to pass the tdm from tm to proxy and finally to calculate the average.

是否可以设想一种更智能的方法来获得相同的结果?

Is it possible to conceive a smarter way to obtain the same result?

推荐答案

由于 tm 的术语文档矩阵只是来自 slam 包的稀疏简单三元组矩阵",您可以使用那里的函数直接根据余弦相似度的定义计算距离:

Since tm's term document matrices are just sparse "simple triplet matrices" from the slam package, you could use the functions there to calculate the distances directly from the definition of cosine similarity:

library(slam)
cosine_dist_mat <- 1 - crossprod_simple_triplet_matrix(tdm)/(sqrt(col_sums(tdm^2) %*% t(col_sums(tdm^2))))

这利用了稀疏矩阵乘法.在我手中,一个包含 220 个文档中的 2963 个术语和 97% 稀疏度的 tdm 只用了几秒钟.

This takes advantage of sparse matrix multiplication. In my hands, a tdm with 2963 terms in 220 documents and 97% sparsity took barely a couple of seconds.

我没有对此进行分析,所以我不知道它是否比 proxy::dist() 更快.​​

I haven't profiled this, so I have no idea if it's any faster than proxy::dist().

注意:为此,您应该不要将 tdm 强制转换为常规矩阵,即不要执行 tdm <- as.matrix(tdm).

NOTE: for this to work, you should not coerce the tdm into a regular matrix, i.e don't do tdm <- as.matrix(tdm).

这篇关于R:使用 tm 和代理计算术语文档矩阵的余弦距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

1403页,肝出来的..

09-06 06:59