我正在寻求标准化一组手动输入的字符串,以便:

index   fruit
1   Apple Pie
2   Apple Pie.
3   Apple. Pie
4   Apple Pie
5   Pear

应该看起来像:
index   fruit
1   Apple Pie
2   Apple Pie
3   Apple Pie
4   Apple Pie
5   Pear

对于我的用例,按phonetic声音将它们分组是很好的,但是我缺少有关如何用最常见的字符串替换最不常见的字符串的文章。
library(tidyverse)
library(stringdist)

index <- seq(1,5,1)
fruit <- c("Apple Pie", "Apple Pie.", "Apple. Pie", "Apple Pie", "Pear")

df <- data.frame(index, fruit) %>%
  mutate(grouping = phonetic(fruit)) %>%
  add_count(fruit) %>%
  # Missing Code
  select(index, fruit)

最佳答案

听起来您需要group_by分组,然后选择最频繁的(模式)项

df%>%mutate(grouping = phonetic(fruit))%>%
     group_by(grouping)%>%
     mutate(fruit = names(which.max(table(fruit))))

# A tibble: 5 x 3
# Groups:   grouping [2]
  index     fruit grouping
  <dbl>    <fctr>    <chr>
1     1 Apple Pie     A141
2     2 Apple Pie     A141
3     3 Apple Pie     A141
4     4 Apple Pie     A141
5     5      Pear     P600

关于r - R:用最常见的变体替换字符串,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/56651367/

10-12 21:32