问题描述
我正在尝试识别数据库中名称的重复条目。我是数据库的新手,但是我很熟悉R.我可以使用R中的模糊匹配和soundex来获取近乎重复的聚类。但是有几个名字是彼此的同义词。我想根据这个标准和上面的标准来集中名称。
I am trying to identify near duplicate entries of names in a database. I am new to databases, however i am familiar with R. I can get clusters of near duplicates using fuzzy matching and soundex in R. However there are several names which are synonyms of each other. I would like to cluster the names based on this criteria along with the above ones.
我想按照但具有同义词。我知道有一种名为WordNet的英文单词的同义词数据库,同义词集合称为synsets。但是字段名称中的条目是不同的格式和语言。
I want to do as suggested in Techniques for finding near duplicate records but with synonyms. I understand there is a sort of database of synonyms for English words called WordNet with sets of synonyms called synsets. But the entries in the field names are in different formats and languages.
例如,如果知道R 3.0.3和Warm Puppy是同义词。我想要使用这样的自定义synsets syn1 - c(R版本3.0.3,温暖小狗)用于在重复项附近进行聚类。
For example If know "R version 3.0.3" and "Warm Puppy" are synonyms. I want to be able to use such custom synsets syn1 <- c("R version 3.0.3", "Warm Puppy") for clustering near duplicates.
Down道路我也想根据记录的其他字段(列)中的条目分离同音异义。
Down the road I would also like to separate homonyms in clusters based on entries in other fields(columns) of a record.
在R中是否有任何方法实现? / p>
Is there any method to implement this in R?
推荐答案
作物,这不是一个答案,但可能会帮助你或其他人回答。
Crops, this is not an answer but might help with you or others who answer.
如我所知,TM包允许自定义停止字,但我不记得在温暖小狗示例中的自定义同义词向量。这将是非常有用的。
As I assume you know, the TM package allows custom stop words, but I can't recall a custom vector of synonyms as in your Warm Puppy example. That would be very useful.
其次,Tyler Rinker的qdap包具有很多功能,并且可能(或者他可能会创建)这样的同义词功能。
Second, Tyler Rinker's qdap package has lots of capabilities and might have (or he might create) such a synonym capability.
第三,RTextTools包合并了许多包和函数。背后的团队可能会有所帮助。
Third, the RTextTools package amalgamates many packages and functions. The team behind it may help.
为我所做的一切具有同义词向量功能将是非常有用的。祝你好运,我会回来查看。
It would be very useful to have a synonym-vector capability for what I do. Good luck and I will check back.
这篇关于使用R中的同义词识别近乎重复的条目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!