我有一个非常大的表(1,000,000 X 20)要处理,需要快速完成。

例如,我的表中有2列X2和X3:

enter image description here

    X1  X2                                          X3
c1  1   100020003001, 100020003002, 100020003003    100020003001, 100020003002, 100020003004
c2  2   100020003001, 100020004002, 100020004003    100020003001, 100020004007, 100020004009
c3  3   100050006003, 100050006001, 100050006001    100050006011, 100050006013, 100050006021

现在我想创建2个新列,其中包含

1)常用字或相同数字

例如:[1] "100020003001" "100020003002"
2)常用字或相同数字的计数

例如:[1] 2
我从下面的线程尝试了该方法,但是,由于我使用for循环进行了处理,因此处理时间很慢:

Count common words in two strings
 library(stringi)
 Reduce(`intersect`,stri_extract_all_regex(vec1,"\\w+"))

谢谢您的帮助!
我真的在这里挣扎...

最佳答案

我们可以通过,拆分'X2','X3'列,使用intersect获得相应的list元素的map2,并使用lengths来“计算” list中的元素数量

library(tidyverse)
df1 %>%
   mutate(common_words = map2(strsplit(X2, ", "),
                              strsplit(X3, ", "),
                                   intersect),
          count = lengths(common_words))
# X1                                       X2                                       X3
#1  1 100020003001, 100020003002, 100020003003 100020003001, 100020003002, 100020003004
#2  2 100020003001, 100020004002, 100020004003 100020003001, 100020004007, 100020004009
#3  3 100050006003, 100050006001, 100050006001 100050006011, 100050006013, 100050006021
#                common_words count
#1 100020003001, 100020003002     2
#2               100020003001     1
#3                                0

或使用base R
df1$common_words <- Map(intersect, strsplit(df1$X2, ", "), strsplit(df1$X3, ", "))
df1$count <- lengths(df1$common_words)

数据
df1 <- structure(list(X1 = 1:3, X2 = c("100020003001, 100020003002, 100020003003",
"100020003001, 100020004002, 100020004003", "100050006003,
 100050006001, 100050006001"
 ), X3 = c("100020003001, 100020003002, 100020003004", "100020003001,
 100020004007, 100020004009",
 "100050006011, 100050006013, 100050006021")), class = "data.frame",
  row.names = c("c1", "c2", "c3"))

关于r - R:如何从大型表格中快速选择2列中的常用单词或相同数字?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/52142521/

10-15 13:02