本文介绍了根据落在columnA:columnB范围内的值在数据框中查找对应的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个data.frame和一个像这样的向量:
I have a data.frame and a vector like:
df = data.frame(id = 1:3,
start = c(1, 1000, 16000),
end = c(100, 1100, 16100),
info = c("a", "b", "c"))
vec = cbind(id= 1:150, pos=c(sample(1:100, 50),
sample(1000:1100, 50),
sample(1600:16100, 50)))
对于每个 vec
我想在 df
中找到相应的行,其中:
For every value of vec
I want to find the corresponding row in df
where:
-
vec $ pos> = df $ start
-
vec $ pos< = df $ end
-
vec $ id == df $ id
vec$pos >= df$start
vec$pos <= df$end
vec$id == df$id
所以我可以提取 info
列。
问题是 df
长1000行,而 vec
是200万个值的长度。因此,使用sapply遍历vec很慢。有人可以通过循环 df
来做到这一点吗?
The problem is that df
is 1000 rows long and vec
is 2 million values long. Therefore looping over vec using sapply is slow. Can anyone do it by looping over df
instead?
推荐答案
从 vec
进行间隔,并使用 data.table :: foverlaps
。
library(data.table)
# Make df a data.table and set key
setDT(df)
setkey(df, start, end)
# Turn vector into a data.table with start and end
vec <- data.table(start = vec, end = vec)
setkey(vec, start, end)
# Apply overlaps for each vec entry
# This will get only those vec entries that overlap with df
foverlaps(vec, df, nomatch = NULL)
# Or if you want only info and vec column use:
foverlaps(vec, df, mult = "first", nomatch = NULL)[, .(info, vec = i.start)]
我在虚拟数据(与OP相同的尺寸)上对其进行了测试,
I tested it on dummy data (same dimensions as OPs) and it takes seconds.
df <- data.table(start = sample(1:1e7, 1e3),
info = sample(letters, 1e3, replace = TRUE))
df$end <- df$start + 10
setkey(df, start, end)
vec <- sample(2e6)
vec <- data.table(start = vec, end = vec)
setkey(vec, start, end)
microbenchmark::microbenchmark(
foverlaps(vec, df, mult = "first", nomatch = NULL)
)
# Unit: seconds
# expr min lq mean median uq max neval
# foverlaps(vec, df, mult = "first", nomatch = NULL) 4.255962 4.274029 4.304148 4.294534 4.329679 4.45406 100
这篇关于根据落在columnA:columnB范围内的值在数据框中查找对应的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!