问题描述
要计算两个文档之间的相似度,我创建了一个包含术语频率的特征向量.但是,接下来,我无法在"余弦相似度"和"汉明距离.
To compute the similarity between two documents, I create a feature vector containing the term frequencies. But then, for the next step, I can't decide between "Cosine similarity" and "Hamming distance".
我的问题:您是否有使用这些算法的经验?哪一个给您更好的结果?
My question: Do you have experience with these algorithms? Which one gives you better results?
除此之外:您能告诉我如何在PHP中编写Cosine相似性吗?对于汉明距离,我已经有了代码:
In addition to that: Could you tell me how to code the Cosine similarity in PHP? For Hamming distance, I've already got the code:
function check ($terms1, $terms2) {
$counts1 = array_count_values($terms1);
$totalScore = 0;
foreach ($terms2 as $term) {
if (isset($counts1[$term])) $totalScore += $counts1[$term];
}
return $totalScore * 500 / (count($terms1) * count($terms2));
}
我不想使用任何其他算法.我只想在两者之间做出决定.
I don't want to use any other algorithm. I would only like to have help to decide between both.
也许有人可以对如何改进算法说些什么.如果您过滤掉停用词或常用词,您会得到更好的结果吗?
And maybe someone can say something to how to improve the algorithms. Will you get better results if you filter out the stop words or common words?
希望您能帮助我.预先感谢!
I hope you can help me. Thanks in advance!
推荐答案
汉明距离应在两个等长字符串之间进行,并考虑顺序.
A Hamming distance should be done between two strings of equal length and with the order taken into account.
由于您的文档肯定有不同的长度,并且如果不考虑位置"一词,那么余弦相似度会更好(请注意,根据您的需要,存在更好的解决方案). :)
As your documents are certainly of different length and if the words places do not count, cosine similarity is better (please note that depending your needs, better solutions exist). :)
这是2个单词数组的余弦相似度函数:
Here is a cosine similarity function of 2 arrays of words:
function cosineSimilarity($tokensA, $tokensB)
{
$a = $b = $c = 0;
$uniqueTokensA = $uniqueTokensB = array();
$uniqueMergedTokens = array_unique(array_merge($tokensA, $tokensB));
foreach ($tokensA as $token) $uniqueTokensA[$token] = 0;
foreach ($tokensB as $token) $uniqueTokensB[$token] = 0;
foreach ($uniqueMergedTokens as $token) {
$x = isset($uniqueTokensA[$token]) ? 1 : 0;
$y = isset($uniqueTokensB[$token]) ? 1 : 0;
$a += $x * $y;
$b += $x;
$c += $y;
}
return $b * $c != 0 ? $a / sqrt($b * $c) : 0;
}
速度快(在大型阵列上,isset()
而不是in_array()
是致命的杀手).
It is fast (isset()
instead of in_array()
is a killer on large arrays).
如您所见,结果未考虑每个单词的大小".
As you can see, the results does not take into account the "magnitude" of each the word.
我用它来检测几乎"复制粘贴文本的多条消息.它运作良好. :)
I use it to detect multi-posted messages of "almost" copy-pasted texts. It works well. :)
有关字符串相似性指标的最佳链接: http://www.dcs.shef.ac.uk/~sam/stringmetrics .html
更多有趣的读物:
http://www.miislita.com/information- extracting-tutorial/cosine-similarity-tutorial.html http://bioinformatics.oxfordjournals.org/cgi/content/full/22 /18/2298
这篇关于余弦相似度与汉明距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!