我有一个大数据框,其中每一行都代表入院.每次入院时在第 5 至 24 列中最多附有 20 个诊断代码.
I have a large data frame, each row of which refers to an admission to hospital. Each admission is accompanied by up to 20 diagnosis codes in columns 5 to 24.
Col1 Col2 Col3 Col4 Diag_1 Diag_2 Diag_3 ... Diag_20
data data data data J123 F456 H789 E468
data data data data T452 NA NA NA
另外,我有一个长度为 136 的向量 (risk_codes),所有字符串.这些字符串是风险代码,可以类似于截断的诊断代码(例如 J12 可以,F4 可以,H798 不行).
Separately, I have a vector (risk_codes) of length 136, all strings. These strings are risk codes that can be similar to the truncated diagnosis codes (e.g. J12 would be ok, F4 would be ok, H798 would not).
I wish to add a column to the data frame that returns 1 if any of the risk codes are similar to any of the diagnosis codes. I don't need to know how many, just that at least one is.
So far, I've tried the following with the most success over other attempts:
for (in in 1:length(risk_codes){
df$newcol <- apply(df,1,function(x) sum(grepl(risk_codes[i], x[c(5:24)])))
它适用于单个字符串,并在列中填充 0 表示没有类似的代码,1 表示类似的代码,但是当检查第二个代码时,所有内容都会被覆盖,对 risk_codes 向量的 136 个元素以此类推.
It works well for a single string and populates the column with 0 for no similar codes and 1 for a similar code, but then everything is overwritten when the second code is checked, and so on over the 136 elements of the risk_codes vector.
有什么想法吗?对每一行的每一列中的每个 risk_code 运行循环是不可行的.
Any ideas, please? Running a loop over every risk_code in every column for every row would not be feasible.
Col1 Col2 Col3 Col4 Diag_1 Diag_2 Diag_3 ... Diag_20 newcol
data data data data J123 F456 H789 E468 1
data data data data T452 NA NA NA 0
如果我的 risk_codes 包含 J12、F4、T543,例如.
if my risk_codes contained J12, F4, T543, for example.
我们希望一次应用带有所有 risk_codes 的 grepl.所以我们每行一次得到一个结果.我们可以通过 sapply
和 any
We want to apply the grepl with all the risk_codes at once. So we get one result per row at once. We can do that with sapply
and any
所以,我们可以去掉 for 循环,你的代码变成这样:
So, we can drop the for loop and your code becomes like this:
my_df <- read.table(text="Col1 Col2 Col3 Col4 Diag_1 Diag_2 Diag_3 Diag_20
data data data data J123 F456 H789 E468
data data data data T452 NA NA NA", header=TRUE)
risk_codes <- c("F456", "XXX") # test codes
my_df$newcol <- apply(my_df,1,function(x)
function(codes) grepl(codes,
如果你仍然想使用 1 和 0 而不是 TRUE/FALSE,你只需要结束:
If you still want to use 1 and 0 instead of the TRUE/FALSE, you just need to finish with:
my_df$new_col <- ifelse(my_df$newcol, 1, 0)
> my_df
Col1 Col2 Col3 Col4 Diag_1 Diag_2 Diag_3 Diag_20 newcol
1 data data data data J123 F456 H789 E468 1
2 data data data data T452 <NA> <NA> <NA> 0
这篇关于R - 如果向量中的任何字符串出现在几列中的任何列中,则返回布尔值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!