我正在寻求标准化一组手动输入的字符串,以便:
index fruit
1 Apple Pie
2 Apple Pie.
3 Apple. Pie
4 Apple Pie
5 Pear
应该看起来像:
index fruit
1 Apple Pie
2 Apple Pie
3 Apple Pie
4 Apple Pie
5 Pear
对于我的用例,按phonetic声音将它们分组是很好的,但是我缺少有关如何用最常见的字符串替换最不常见的字符串的文章。
library(tidyverse)
library(stringdist)
index <- seq(1,5,1)
fruit <- c("Apple Pie", "Apple Pie.", "Apple. Pie", "Apple Pie", "Pear")
df <- data.frame(index, fruit) %>%
mutate(grouping = phonetic(fruit)) %>%
add_count(fruit) %>%
# Missing Code
select(index, fruit)
最佳答案
听起来您需要group_by
分组,然后选择最频繁的(模式)项
df%>%mutate(grouping = phonetic(fruit))%>%
group_by(grouping)%>%
mutate(fruit = names(which.max(table(fruit))))
# A tibble: 5 x 3
# Groups: grouping [2]
index fruit grouping
<dbl> <fctr> <chr>
1 1 Apple Pie A141
2 2 Apple Pie A141
3 3 Apple Pie A141
4 4 Apple Pie A141
5 5 Pear P600
关于r - R:用最常见的变体替换字符串,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/56651367/