问题描述
我有一个大数据框,其中每一行都代表入院.每次入院时在第 5 至 24 列中最多附有 20 个诊断代码.
I have a large data frame, each row of which refers to an admission to hospital. Each admission is accompanied by up to 20 diagnosis codes in columns 5 to 24.
Col1 Col2 Col3 Col4 Diag_1 Diag_2 Diag_3 ... Diag_20
data data data data J123 F456 H789 E468
data data data data T452 NA NA NA
另外,我有一个长度为 136 的向量 (risk_codes),所有字符串.这些字符串是风险代码,可以类似于截断的诊断代码(例如 J12 可以,F4 可以,H798 不行).
Separately, I have a vector (risk_codes) of length 136, all strings. These strings are risk codes that can be similar to the truncated diagnosis codes (e.g. J12 would be ok, F4 would be ok, H798 would not).
如果任何风险代码与任何诊断代码相似,我希望向数据框中添加一列返回1.我不需要知道有多少,只要至少有一个.
I wish to add a column to the data frame that returns 1 if any of the risk codes are similar to any of the diagnosis codes. I don't need to know how many, just that at least one is.
到目前为止,我已经尝试了以下方法,但比其他尝试取得了最大的成功:
So far, I've tried the following with the most success over other attempts:
for (in in 1:length(risk_codes){
df$newcol <- apply(df,1,function(x) sum(grepl(risk_codes[i], x[c(5:24)])))
}
它适用于单个字符串,并在列中填充 0 表示没有类似的代码,1 表示类似的代码,但是当检查第二个代码时,所有内容都会被覆盖,对 risk_codes 向量的 136 个元素以此类推.
It works well for a single string and populates the column with 0 for no similar codes and 1 for a similar code, but then everything is overwritten when the second code is checked, and so on over the 136 elements of the risk_codes vector.
有什么想法吗?对每一行的每一列中的每个 risk_code 运行循环是不可行的.
Any ideas, please? Running a loop over every risk_code in every column for every row would not be feasible.
解决方案看起来像这样
Col1 Col2 Col3 Col4 Diag_1 Diag_2 Diag_3 ... Diag_20 newcol
data data data data J123 F456 H789 E468 1
data data data data T452 NA NA NA 0
如果我的 risk_codes 包含 J12、F4、T543,例如.
if my risk_codes contained J12, F4, T543, for example.
推荐答案
我们希望一次应用带有所有 risk_codes 的 grepl.所以我们每行一次得到一个结果.我们可以通过 sapply
和 any
做到这一点.
We want to apply the grepl with all the risk_codes at once. So we get one result per row at once. We can do that with sapply
and any
.
所以,我们可以去掉 for 循环,你的代码变成这样:
So, we can drop the for loop and your code becomes like this:
my_df <- read.table(text="Col1 Col2 Col3 Col4 Diag_1 Diag_2 Diag_3 Diag_20
data data data data J123 F456 H789 E468
data data data data T452 NA NA NA", header=TRUE)
risk_codes <- c("F456", "XXX") # test codes
my_df$newcol <- apply(my_df,1,function(x)
any(sapply(risk_codes,
function(codes) grepl(codes,
x[c(5:24)]))))
结果是一个逻辑向量.
如果你仍然想使用 1 和 0 而不是 TRUE/FALSE,你只需要结束:
If you still want to use 1 and 0 instead of the TRUE/FALSE, you just need to finish with:
my_df$new_col <- ifelse(my_df$newcol, 1, 0)
结果将是:
> my_df
Col1 Col2 Col3 Col4 Diag_1 Diag_2 Diag_3 Diag_20 newcol
1 data data data data J123 F456 H789 E468 1
2 data data data data T452 <NA> <NA> <NA> 0
这篇关于R - 如果向量中的任何字符串出现在几列中的任何列中,则返回布尔值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!