I am trying to identify near duplicate entries of names in a database. I am new to databases, however i am familiar with R. I can get clusters of near duplicates using fuzzy matching and soundex in R. However there are several names which are synonyms of each other. I would like to cluster the names based on this criteria along with the above ones.
I want to do as suggested in Techniques for finding near duplicate records but with synonyms. I understand there is a sort of database of synonyms for English words called WordNet with sets of synonyms called synsets. But the entries in the field names are in different formats and languages.
例如,如果知道R 3.0.3和Warm Puppy是同义词。我想要使用这样的自定义synsets syn1 - c(R版本3.0.3,温暖小狗)用于在重复项附近进行聚类。
For example If know "R version 3.0.3" and "Warm Puppy" are synonyms. I want to be able to use such custom synsets syn1 <- c("R version 3.0.3", "Warm Puppy") for clustering near duplicates.
Down the road I would also like to separate homonyms in clusters based on entries in other fields(columns) of a record.
在R中是否有任何方法实现? / p>
Is there any method to implement this in R?
Crops, this is not an answer but might help with you or others who answer.
As I assume you know, the TM package allows custom stop words, but I can't recall a custom vector of synonyms as in your Warm Puppy example. That would be very useful.
其次,Tyler Rinker的qdap包具有很多功能,并且可能(或者他可能会创建)这样的同义词功能。
Second, Tyler Rinker's qdap package has lots of capabilities and might have (or he might create) such a synonym capability.
Third, the RTextTools package amalgamates many packages and functions. The team behind it may help.
It would be very useful to have a synonym-vector capability for what I do. Good luck and I will check back.