本文介绍了子集递归地一个数据框架的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 我有一个数据框,其中有近400万行。我需要一种有效的方法来基于两个标准对数据进行子集。我可以做一个循环,但是想知道是否有更优雅的方式来做到这一点,显然效率更高。 data.frame如下所示: SNP CHR BP P rs1000000 chr1 126890980 0.000007 rs10000010 chr4 21618674 0.262098 rs10000012 chr4 1357325 0.344192 rs10000013 chr4 37225069 0.726325 rs10000017 chr4 84778125 0.204275 rs10000023 chr4 95733906 0.701778 rs10000029 chr4 138685624 0.260899 rs1000002 chr3 183635768 0.779574 rs10000030 chr4 103374154 0.964166 rs10000033 chr2 139599898 0.111846 rs10000036 chr4 139219262 0.564791 rs10000037 chr4 38924330 0.392908 rs10000038 chr4 189176035 0.971481 rs1000003 chr3 98342907 0.000004 rs10000041 chr3 165621955 0.573376 rs10000042 chr3 5237152 0.834206 rs10000056 chr4 189321617 0.268479 rs1000005 chr1 34433051 0.764046 rs10000062 chr4 5254744 0.238011 rs10000064 chr4 127809621 0.000044 rs10000068 chr2 36924287 0.000003 rs10000075 chr4 179488911 0.100225 rs10000076 chr4 183288360 0.962476 rs1000007 chr2 237752054 0.594928 rs10000081 chr1 17348363 0.517486 rs10000082 chr1 167310192 0.261577 rs10000088 chr1 182605350 0.649975 rs10000092 chr4 21895517 0.000005 rs10000100 chr4 19510493 0.296693 首先我要做的是选择那些 SNP 与 P 值低于阈值,然后通过 CHR 和 POS 命令此子集。这是很容易的部分,使用子集和订单。然而,下一步是棘手的​​。一旦我有这个子集,我需要从重要的 SNP SNP c>,这一步将定义一个区域。我需要为所有重要的 SNP 执行此操作,并将每个区域存储到列表中或类似的内容进行进一步分析。例如,在 CHR == chr1 SNP (即低于0.001的阈值) >是 rs1000000 和 CHR == chr4 是 rs10000092 。因此,这两个 SNP 将定义两个区域,我需要在每个这些区域中获取落入从$ code上升到500,000的区域的SNP每个最重要的 SNP 的POS 。 我知道这有点复杂,现在我手工做了棘手的部分,但需要很长时间才能做到。任何帮助将不胜感激。解决方案这是一个部分解决方案ir R使用 data.table ,这可能是处理大数据集时R中最快的方式。 库(数据)。表)#v1.9.7(devel版本) df< - fread(C:/folderpath/data.csv)#加载数据 setDT(df )将数据转换成data.table 第一步 #筛选阈值0.05以下的数据,并按CHR排序,POS df 第2步 df [,{idx =(1:.N)[which.min(P)] SNP [seq(max,1,idx- 5e5),min(.N,idx + 5e5))]},by = CHR] 保存输出不同文件 df [,fwrite(copy(.SD)[,SNP:= SNP],paste0 ,SNP,csv)),by = SNP] ps。请注意,此答案使用 fwrite ,它仍在开发版本的 data.table 中。 访问安装说明。你可以简单地使用 write.csv ,但是你正在处理一个大数据集,所以速度是非常重要的,而 fwrite 当然是最快捷的选择之一。 I have a data frame with close to a 4 million of rows in it. I need an efficient to way to subset the data based on two criteria. I can do this is a for loop but was wondering if there is a more elegant way to do this, and obviously more efficient. The data.frame looks like this:SNP CHR BP Prs1000000 chr1 126890980 0.000007rs10000010 chr4 21618674 0.262098 rs10000012 chr4 1357325 0.344192rs10000013 chr4 37225069 0.726325 rs10000017 chr4 84778125 0.204275 rs10000023 chr4 95733906 0.701778rs10000029 chr4 138685624 0.260899rs1000002 chr3 183635768 0.779574rs10000030 chr4 103374154 0.964166 rs10000033 chr2 139599898 0.111846 rs10000036 chr4 139219262 0.564791rs10000037 chr4 38924330 0.392908 rs10000038 chr4 189176035 0.971481 rs1000003 chr3 98342907 0.000004rs10000041 chr3 165621955 0.573376rs10000042 chr3 5237152 0.834206 rs10000056 chr4 189321617 0.268479rs1000005 chr1 34433051 0.764046rs10000062 chr4 5254744 0.238011 rs10000064 chr4 127809621 0.000044rs10000068 chr2 36924287 0.000003rs10000075 chr4 179488911 0.100225 rs10000076 chr4 183288360 0.962476rs1000007 chr2 237752054 0.594928rs10000081 chr1 17348363 0.517486 rs10000082 chr1 167310192 0.261577 rs10000088 chr1 182605350 0.649975rs10000092 chr4 21895517 0.000005rs10000100 chr4 19510493 0.296693 The first I need to do is to select those SNP with a P value lower than a threshold, then order this subset by CHR and POS. This is the easy part, using subset and order. However, the next step is the tricky one. Once I have this subset, I need to fetch all the SNP that fall into a 500,000 window up and down from the significant SNP, this step will define a region. I need to do it for all the significant SNP and store each region into a list or something similar to carry out further analysis. For example, in the displayed data frame the most significant SNP (i.e below a threshold of 0.001) for CHR==chr1 is rs1000000 and for CHR==chr4 is rs10000092. Thus these two SNP would define two regions and I need to fetch in each of these regions the SNPs that fall into a region of 500,000 up and down from the POS of each of the most significant SNP. I know it a bit complicated, right now, I am doing the tricky part by hand but it takes a long time to do it. Any help would be appreciated. 解决方案 Here is a partial solution ir R using data.table, which is probably the fastest way to go in R when dealing with large datasets.library(data.table) # v1.9.7 (devel version)df <- fread("C:/folderpath/data.csv") # load your datasetDT(df) # convert your dataset into data.table1st step# Filter data under threshold 0.05 and Sort by CHR, POS df <- df[ P < 0.05, ][order(CHR, POS)]2nd stepdf[, {idx = (1:.N)[which.min(P)] SNP[seq(max(1, idx - 5e5), min(.N, idx + 5e5))]}, by = CHR]Saving output in different filesdf[, fwrite(copy(.SD)[, SNP := SNP], paste0("output", SNP,".csv")), by = SNP]ps. note that this answer uses fwrite, which is still in the development version of data.table. Go here for install instructions. You could simply use write.csv, however you're dealing with a big dataset so speed is quite important and fwrite is certainly one of the fastest alternatives. 这篇关于子集递归地一个数据框架的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!
09-25 11:46