问题描述
我想计算两个任意句子之间的相似度.例如:
I want to compute how similar two arbitrary sentences are to each other. For example:
- 一个数学家找到了解决这个问题的方法.
- 这个问题由一位年轻的数学家解决了.
我可以使用标记器,词干分析器和解析器,但是我不知道如何检测这些句子是否相似.
I can use a tagger, a stemmer, and a parser, but I don’t know how detect that these sentences are similar.
推荐答案
这两个句子不仅相似,而且几乎释义,即表达相同含义的两种替代方法.这也是复述的一种非常简单的情况,其中两种话语都使用相同的词,唯一的例外是一种是主动形式,而另一种是被动形式. (这两个句子不完全是释义,因为在第二个句子中,数学家是年轻的".此附加信息使两个句子之间的语义关系不对称.在这种情况下,您可以说第二个话语第一个,或者换句话说,可以从第二个推断出第一个.
These two sentences are not just similar, they are almost paraphrases, i.e., two alternative ways of expressing the same meaning. It is also a very simple case of paraphrase, in which both utterances use the same words with the only exception of one being in active form while the other is passive. (The two sentences are not exactly paraphrases because in the second sentence the mathematician is "young". This additional information makes the semantic relation between the two sentences non symmetric. In these cases, you would say that the second utterance "entails" the first one, or in other words that the first can be inferred from the second).
从该示例中无法了解您是否实际上对释义检测,文本蕴含或总体上的句子相似性感兴趣,这是一个更广泛,更模糊的问题.例如,人们吃食物"更类似于人们吃面包"还是男人吃食物"?
From the example it is not possible to understand whether you are actually interested in paraphrase detection, textual entailment or in sentence similarity in general, which is an even broader and fuzzier problem. For example, is "people eat food" more similar to "people eat bread" or to "men eat food"?
释义检测和文本相似性都是自然语言处理中复杂的,开放的研究问题,有大量活跃的研究人员对此进行研究.尚不清楚您对此主题的兴趣程度如何,但是请考虑,尽管许多杰出的研究人员花费并花费了整个职业生涯来试图破解它,但我们离找到通常可行的合理解决方案还差得很远.
Both paraphrase detection and text similarity are complex, open research problems in Natural Language Processing, with a large and active community of researchers working on them. It is not clear what is the extent of your interest in this topic, but consider that even though many brilliant researchers have spent and spend their whole careers trying to crack it, we are still very far from finding sound solutions that just work in general.
除非您对非常肤浅的解决方案感兴趣,该解决方案仅在特定情况下有效并且不能捕获语法替换(在这种情况下),否则我建议您更深入地研究文本相似性问题.一个很好的起点是这本书统计自然语言处理基础" ,该书很好地组织了大多数统计知识自然语言处理主题.弄清要求后(例如,您的方法应该在什么条件下工作?您追求的精确度/召回水平是什么?可以安全地忽略哪些现象,以及需要解决哪些现象?)通过研究最新的研究工作来开始研究特定的方法.在这里,一个很好的起点是计算语言学协会(ACL)的在线档案,它是大多数研究的出版商结果.
Unless you are interested in a very superficial solution that would only work in specific cases and that would not capture syntactic alternation (as in this case), I would suggest that you look into the problem of text similarity in more depth. A good starting point would be the book "Foundations of Statistical Natural Language Processing", which provides a very well organised presentation of most statistical natural language processing topics. Once you have clarified your requirements (e.g., under what conditions is your method supposed to work? what levels of precision/recall are you after? what kind of phenomena can you safely ignore, and which ones you need to account for?) you can start looking into specific approaches by diving into recent research work. Here, a good place to start would be the online archives of the Association for Computational Linguistics (ACL), which is the publisher of most research results in the field.
仅为给您一些实用的知识,句子相似度的一个非常粗略的基线是余弦相似度将句子表示为单词袋的两个二进制向量.一袋单词是非常简化的文本表示形式,通常用于信息检索,其中您完全不考虑语法,而仅将句子表示为矢量,其大小等于词汇量(即语言中的单词数) ),并且如果单词中位置"i"的单词出现在句子中,则其成分"i"的值为"1",否则为"0".
Just to give you something practical to work with, a very rough baseline for sentence similarity would be the cosine similarity between two binary vectors representing the sentences as bags of words. A bag of word is a very simplified representation of text, commonly used for information retrieval, in which you completely disregard syntax and only represent a sentence as a vector whose size is the size of the vocabulary (i.e., the number of words in the language) and whose component "i" is valued "1" if the word at position "i" in the vocabulary appears in the sentence, and "0" otherwise.
这篇关于如何发现两个句子相似?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!