问题描述
当我在R中使用相同定界符的数目不同时,我需要帮助弄清楚如何根据最后一个定界符在数据帧的列中拆分字符串。例如,
I need help figuring out how to split strings in a column of a data frame based on the last delimiter when I have varying numbers of the same delimiter in R. For example,
col1 <- c('a', 'b', 'c')
col2 <- c('a_b', 'a_b_c', 'a_b_c_d')
df <- data.frame(cbind(col1, col2))
我想将df $ col2拆分成一个看起来像这样的数据框:
And I would like to split df$col2 to have a data frame that looks like:
col1 <- c('a', 'b', 'c')
col2 <- c('a', 'a_b', 'a_b_c')
col3 <- c('b', 'c', 'd')
推荐答案
这些不使用任何软件包。他们假定 col2
的每个元素至少都有一个下划线。 (请注意是否需要解除此限制。)
These use no packages. They assume that each element of col2
has at least one underscore. (See note if lifting this restriction is needed.)
1)第一个正则表达式(。*)_
匹配所有内容,直到最后一个下划线,然后匹配其余所有。*
和第一个 sub
用括号内的匹配部分替换整个匹配项。之所以有用,是因为这样的比赛很贪心,所以第一个。*
会占用所有可能的内容,而剩下的则留给第二个。*
。第二个正则表达式将所有内容匹配到最后一个下划线,第二个 sub
将其替换为空字符串。
1) The first regular expression (.*)_
matches everything up to the last underscore followed by everything remaining .*
and the first sub
replaces the entire match with the matched part within parens. This works because such matches are greedy so the first .*
will take everything it can leaving the rest for the second .*
. The second regular expression matches everything up to the last underscore and the second sub
replaces that with the empty string.
transform(df, col2 = sub("(.*)_.*", "\\1", col2), col3 = sub(".*_", "", col2))
2),这是一个更加对称的变化形式。对于两个 sub
调用,它使用相同的正则表达式。
2) Here is a variation that is a bit more symmetric. It uses the same regular expression for both sub
calls.
pat <- "(.*)_(.*)"
transform(df, col2 = sub(pat, "\\1", col2), col3 = sub(pat, "\\2", col2))
注意:如果我们确实想处理字符串完全没有下划线,以便将 xyz分为 xyz和,然后将其用于第二个 sub
。它尝试匹配|的左侧。首先,如果失败(在没有下划线的情况下会发生),则整个字符串将与右侧匹配,并且 sub
会将其替换为空字符串。
Note: If we did want to handle strings with no underscore at all such that "xyz" is split into "xyz" and "" then use this for the second sub
. It tries to match the left hand side of the | first and if that fails (which will occur if there are no underscores) then the entire string will match the right hand side and sub
will replace that with the empty string.
sub(".*_|^[^_]*$", "", col2)
这篇关于分割字符串最后一个定界符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!