问题描述
我一直试图更好地理解如何处理 strsplit
的输出.我经常有这样的数据想要拆分:
I've been trying to understand how to deal with the output of strsplit
a bit better. I often have data such as this that I wish to split:
mydata <- c("144/4/5", "154/2", "146/3/5", "142", "143/4", "DNB", "90")
#[1] "144/4/5" "154/2" "146/3/5" "142" "143/4" "DNB" "90"
分割后的结果如下:
strsplit(mydata, "/")
#[[1]]
#[1] "144" "4" "5"
#[[2]]
#[1] "154" "2"
#[[3]]
#[1] "146" "3" "5"
#[[4]]
#[1] "142"
#[[5]]
#[1] "143" "4"
#[[6]]
#[1] "DNB"
#[[7]]
#[1] "90"
我从 strsplit 帮助指南中知道不会生成最终的空字符串.因此,根据要拆分的/"的数量,我的每个结果中都会有 1、2 或 3 个元素
I know from the strsplit help guide that final empty strings are not produced. Therefore, there will be 1, 2 or 3 elements in each of my results based on the number of "/" to split by
获取第一个元素非常简单:
Getting the first element is very trivial:
sapply(strsplit(mydata, "/"), "[[", 1)
#[1] "144" "154" "146" "142" "143" "DNB" "90"
但我不确定如何获得第二个、第三个......当每个结果中的元素数量不等时.
But I am not sure how to get the 2nd, 3rd... when there are these unequal number of elements in each result.
sapply(strsplit(mydata, "/"), "[[", 2)
# Error in FUN(X[[4L]], ...) : subscript out of bounds
我希望从一个有效的解决方案中返回,如下:
I would hope to return from a working solution, the following:
#[1] "4" "2" "3" "NA" "4" "NA" "NA"
这是一个相对较小的例子.我可以很容易地对这些数据做一些 for 循环,但是对于具有 1000 次观察的真实数据来运行 strsplit 以及由此产生的几十个元素,我希望找到一个更通用的解决方案.
This is a relatively small example. I could do some for loop very easily on these data, but for real data with 1000s of observations to run the strsplit on and dozens of elements produced from that, I was hoping to find a more generalizable solution.
推荐答案
(至少对于一维向量)[
似乎返回 NA
当 "i > length(x)" 而 [[
返回错误.
(at least regarding 1D vectors) [
seems to return NA
when "i > length(x)" whereas [[
returns an error.
x = runif(5)
x[6]
#[1] NA
x[[6]]
#Error in x[[6]] : subscript out of bounds
挖掘一下,do_subset_dflt
(即[
)调用ExtractSubset
我们注意到当一个想要的索引 ("ii") 是 ">length(x)" NA
被返回(稍微修改为干净):
Digging a bit, do_subset_dflt
(i.e. [
) calls ExtractSubset
where we notice that when a wanted index ("ii") is "> length(x)" NA
is returned (a bit modified to be clean):
if(0 <= ii && ii < nx && ii != NA_INTEGER)
result[i] = x[ii];
else
result[i] = NA_INTEGER;
另一方面do_subset2_dflt
(即 [[
) 如果想要的索引 ("offset") 是 ">length(x)" 则返回错误(稍微修改为干净):
On the other hand do_subset2_dflt
(i.e. [[
) returns an error if the wanted index ("offset") is "> length(x)" (modified a bit to be clean):
if(offset < 0 || offset >= xlength(x)) {
if(offset < 0 && (isNewList(x)) ...
else errorcall(call, R_MSG_subs_o_b);
}
where #define R_MSG_subs_o_b _("下标越界")
(我不确定上面的代码片段,但根据他们的回报,它们确实是相关的)
(I'm not sure about the above code snippets but they do seem relevant based on their returns)
这篇关于从 strsplit 之后的嵌套列表中提取第 n 个元素 - R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!