本文介绍了在R中按列排序最快的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 我有一个数据框 full ,我要从中取最后一列,并且一个列 v 。然后我想以最快的方式排序 v 的两列。 code>从csv中读取,但这可以用于测试(包括一些NAs的现实主义): n full full [sample(n,10000),'A'] v 我在这里有 v ,但实际上它可以改变,满有很多列。 我已经尝试过排序数据框,数据表和矩阵,每个都有订单和 sort.list (一些想法来自 #DATA FRAME ord_df< - function(){a a [with(a,order(a [1]),] } sl_df a a [sort.list(a [[1]]),] } #DATA TABLE require(data.table) ord_dt< - function(){a< ; - as.data.table(full [c(v,length(full))]) colnames(a)[1] a [order $ b} sl_dt a< - as.data.table(full [c(v,length(full))]) colnames (a)[1]< - 'values'a [sort.list(values)] } #MATRIX $ b b ord_mat a a [order(a [,1]),] } sl_mat a a [sort。 list(a [,1]),] } 时间结果: ord_df sl_df ord_dt sl_dt ord_mat sl_mat 0.230 0.1500 0.1300 0.120 0.140 0.1400 中位数0.250 0.1600 0.1400 0.140 0.140 0.1400 平均0.244 0.1610 0.1430 0.136 0.142 0.1450 最大。 0.250 0.1700 0.1600 0.140 0.160 0.1600 或使用 microbenchmark (结果以毫秒为单位): min lq median uq max 1 ord_df()243.0647 248.2768 254.0544 265.2589 352.3984 2 ord_dt()133.8159 140.0111 143.8202 148.4957 181.2647 3 ord_mat()140.5198 146.8131 149.9876 154.6649 191.6897 4 sl_df()152.6985 161.5591 166.5147 171.2891 194.7155 5 sl_dt()132.1414 139.7655 144.1281 149.6844 188.8592 6 sl_mat()139.2420 146.8578 151.6760 156.6174 186.5416 看起来像排序数据表胜利。 order 和 sort.list 之间没有太大区别,除非使用数据框时 sort.list 更快。 在数据表版本中,我也尝试设置 v 作为键(因为它然后根据文档排序),但我不能得到它的工作,因为 v 的内容不是整数。 我希望尽可能提高速度,因为我必须为不同的 v 值。有谁知道我如何能够加快这个过程更进一步?也许值得尝试一个 Rcpp 实现?非常感谢。 这里是我用于计时的代码,如果它对任何人都有用: sortMethods< - list(ord_df,sl_df,ord_dt,sl_dt,ord_mat,sl_mat) require(plyr)定时 colnames(timings) apply(timing,2,summary) require(microbenchmark) mb plot(mb) 解决方案我不知道如果把这种东西作为一个编辑更好,但它似乎更像是回答所以这里做。更新的测试功能: n full< - data.frame(A = runif ,1,10000),B = floor(runif(n,0,1.9))) full [sample(n,100000),'A'] fdf < - full fma< - as.matrix(full) fdt< - as.data.table(full) setnames(fdt,colnames(fdt)[1] 'value') #DATA FRAME ord_df< - function(){fdf [order(fdf [1]),]} sl_df #DATA TABLE require(data.table) ord_dt< - function fdt [order(values)]} key_dt< - function(){ setkey(fdt,values) fdt } #MATRIX ord_mat< - function(){fma [order(fma [,1]),]} sl_mat 结果(使用不同的计算机,R 2.13.1和 data.table 1.8.2): ord_df sl_df ord_dt key_dt ord_mat sl_mat 最小。 37.56 20.86 2.946 2.249 20.22 20.21 1st Qu。 37.73 21.15 2.962 2.255 20.54 20.59 中位数38.43 21.74 3.002 2.280 21.05 20.82 平均值38.76 21.75 3.074 2.395 21.09 20.95 第三。 39.85 22.18 3.151 2.445 21.48 21.42 最大。 40.36 23.08 3.330 2.797 22.41 21.84 所以data.table是明显的赢家。使用一个键比排序更快,并有一个更好的语法,我会争辩。感谢您的帮助。 I have a data frame full from which I want to take the last column and a column v. I then want to sort both columns on v in the fastest way possible. full is read in from a csv but this can be used for testing (included some NAs for realism):n <- 200000full <- data.frame(A = runif(n, 1, 10000), B = floor(runif(n, 0, 1.9)))full[sample(n, 10000), 'A'] <- NAv <- 1I have v as one here, but in reality it could change, and full has many columns.I have tried sorting data frames, data tables and matrices each with order and sort.list (some ideas taken from this thread). The code for all these:# DATA FRAMEord_df <- function() { a <- full[c(v, length(full))] a[with(a, order(a[1])), ]}sl_df <- function() { a <- full[c(v, length(full))] a[sort.list(a[[1]]), ] }# DATA TABLErequire(data.table)ord_dt <- function() { a <- as.data.table(full[c(v, length(full))]) colnames(a)[1] <- 'values' a[order(values)]}sl_dt <- function() { a <- as.data.table(full[c(v, length(full))]) colnames(a)[1] <- 'values' a[sort.list(values)]}# MATRIXord_mat <- function() { a <- as.matrix(full[c(v, length(full))]) a[order(a[, 1]), ] }sl_mat <- function() { a <- as.matrix(full[c(v, length(full))]) a[sort.list(a[, 1]), ] }Time results: ord_df sl_df ord_dt sl_dt ord_mat sl_matMin. 0.230 0.1500 0.1300 0.120 0.140 0.1400Median 0.250 0.1600 0.1400 0.140 0.140 0.1400Mean 0.244 0.1610 0.1430 0.136 0.142 0.1450Max. 0.250 0.1700 0.1600 0.140 0.160 0.1600Or using microbenchmark (results are in milliseconds): min lq median uq max1 ord_df() 243.0647 248.2768 254.0544 265.2589 352.39842 ord_dt() 133.8159 140.0111 143.8202 148.4957 181.26473 ord_mat() 140.5198 146.8131 149.9876 154.6649 191.68974 sl_df() 152.6985 161.5591 166.5147 171.2891 194.71555 sl_dt() 132.1414 139.7655 144.1281 149.6844 188.85926 sl_mat() 139.2420 146.8578 151.6760 156.6174 186.5416Seems like ordering the data table wins. There isn't all that much difference between order and sort.list except when using data frames where sort.list is much faster.In the data table versions I also tried setting v as the key (since it is then sorted according to the documentation) but I couldn't get it work since the contents of v are not integer. I would ideally like to speed this up as much as possible since I have to do it many times for different v values. Does anyone know how I might be able to speed this process up even further? Also might it be worth trying an Rcpp implementation? Thanks.Here's the code I used for timing if it's useful to anyone:sortMethods <- list(ord_df, sl_df, ord_dt, sl_dt, ord_mat, sl_mat)require(plyr)timings <- raply(10, sapply(sortMethods, function(x) system.time(x())[[3]]))colnames(timings) <- c('ord_df', 'sl_df', 'ord_dt', 'sl_dt', 'ord_mat', 'sl_mat')apply(timings, 2, summary) require(microbenchmark)mb <- microbenchmark(ord_df(), sl_df(), ord_dt(), sl_dt(), ord_mat(), sl_mat())plot(mb) 解决方案 I don't know if it's better to put this sort of thing in as an edit but it seems more like answer so here will do. Updated test functions:n <- 1e7full <- data.frame(A = runif(n, 1, 10000), B = floor(runif(n, 0, 1.9)))full[sample(n, 100000), 'A'] <- NAfdf <- fullfma <- as.matrix(full)fdt <- as.data.table(full)setnames(fdt, colnames(fdt)[1], 'values')# DATA FRAMEord_df <- function() { fdf[order(fdf[1]), ] }sl_df <- function() { fdf[sort.list(fdf[[1]]), ] }# DATA TABLErequire(data.table)ord_dt <- function() { fdt[order(values)] }key_dt <- function() { setkey(fdt, values) fdt}# MATRIXord_mat <- function() { fma[order(fma[, 1]), ] }sl_mat <- function() { fma[sort.list(fma[, 1]), ] }Results (using a different computer, R 2.13.1 and data.table 1.8.2): ord_df sl_df ord_dt key_dt ord_mat sl_matMin. 37.56 20.86 2.946 2.249 20.22 20.211st Qu. 37.73 21.15 2.962 2.255 20.54 20.59Median 38.43 21.74 3.002 2.280 21.05 20.82Mean 38.76 21.75 3.074 2.395 21.09 20.953rd Qu. 39.85 22.18 3.151 2.445 21.48 21.42Max. 40.36 23.08 3.330 2.797 22.41 21.84So data.table is the clear winner. Using a key is faster than ordering, and has a nicer syntax as well I'd argue. Thanks for the help everyone. 这篇关于在R中按列排序最快的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!
09-23 00:42