本文介绍了分割字符串最后一个定界符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我在R中使用相同定界符的数目不同时,我需要帮助弄清楚如何根据最后一个定界符在数据帧的列中拆分字符串。例如,

I need help figuring out how to split strings in a column of a data frame based on the last delimiter when I have varying numbers of the same delimiter in R. For example,

col1 <- c('a', 'b', 'c')
col2 <- c('a_b', 'a_b_c', 'a_b_c_d')
df <- data.frame(cbind(col1, col2))

我想将df $ col2拆分成一个看起来像这样的数据框:

And I would like to split df$col2 to have a data frame that looks like:

col1 <- c('a', 'b', 'c')
col2 <- c('a', 'a_b', 'a_b_c')
col3 <- c('b', 'c', 'd')


推荐答案

这些不使用任何软件包。他们假定 col2 的每个元素至少都有一个下划线。 (请注意是否需要解除此限制。)

These use no packages. They assume that each element of col2 has at least one underscore. (See note if lifting this restriction is needed.)

1)第一个正则表达式(。*)_ 匹配所有内容,直到最后一个下划线,然后匹配其余所有。* 和第一个 sub 用括号内的匹配部分替换整个匹配项。之所以有用,是因为这样的比赛很贪心,所以第一个。* 会占用所有可能的内容,而剩下的则留给第二个。* 。第二个正则表达式将所有内容匹配到最后一个下划线,第二个 sub 将其替换为空字符串。

1) The first regular expression (.*)_ matches everything up to the last underscore followed by everything remaining .* and the first sub replaces the entire match with the matched part within parens. This works because such matches are greedy so the first .* will take everything it can leaving the rest for the second .* . The second regular expression matches everything up to the last underscore and the second sub replaces that with the empty string.

transform(df, col2 = sub("(.*)_.*", "\\1", col2), col3 = sub(".*_", "", col2))

2),这是一个更加对称的变化形式。对于两个 sub 调用,它使用相同的正则表达式。

2) Here is a variation that is a bit more symmetric. It uses the same regular expression for both sub calls.

pat <- "(.*)_(.*)"
transform(df, col2 = sub(pat, "\\1", col2), col3 = sub(pat, "\\2", col2))

注意:如果我们确实想处理字符串完全没有下划线,以便将 xyz分为 xyz和,然后将其用于第二个 sub 。它尝试匹配|的左侧。首先,如果失败(在没有下划线的情况下会发生),则整个字符串将与右侧匹配,并且 sub 会将其替换为空字符串。

Note: If we did want to handle strings with no underscore at all such that "xyz" is split into "xyz" and "" then use this for the second sub. It tries to match the left hand side of the | first and if that fails (which will occur if there are no underscores) then the entire string will match the right hand side and sub will replace that with the empty string.

sub(".*_|^[^_]*$", "", col2)

这篇关于分割字符串最后一个定界符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-06 04:52