本文介绍了R:子集上的数据表,不包括值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 使用 data.table 在 R ,我试图对子集排除所选元素。我使用 by 运算符,但我不知道这是否是正确的方法。 Using data.table in R, I'm trying to make an operation on the subset excluding selected element. I'm using the by operator, but I don't know if this is the right approach. 这里有一个例子。例如。在 IAH:SNA 中的 Delta 的值为(3 + 3)/ 2,其是停止在 IAH:SNA 一次 Delta > Here's an example. E.g. the value for Delta in IAH:SNA is (3+3)/2 which is the mean of Stops in IAH:SNA once Delta has been excluded.library(data.table)s1 <- "Market Carrier StopsIAH:SNA Delta 1IAH:SNA Delta 1IAH:SNA Southwest 3IAH:SNA Southwest 3MSP:CLE Southwest 2MSP:CLE Southwest 2MSP:CLE American 2MSP:CLE JetBlue 1"d <- data.table(read.table(textConnection(s1), header=TRUE))setkey(d, Carrier, Market)f <- function(x, y){ subset(d, !(Carrier %in% x) & Market == y, Stops)[, mean(Stops)]}d[, s := f(.BY[[1]], .BY[[2]]), by=list(Carrier, Market)]## Market Carrier Stops s## 1: MSP:CLE American 2 1.666667## 2: IAH:SNA Delta 1 3.000000## 3: IAH:SNA Delta 1 3.000000## 5: IAH:SNA Southwest 3 1.000000## 6: IAH:SNA Southwest 3 1.000000## 7: MSP:CLE Southwest 2 1.500000## 8: MSP:CLE Southwest 2 1.500000上述解决方案),但我不知道如何做一个快速的 data.table 样的方式。 The above solution performs very poorly on large data sets (it's essentially an mapply), but I'm not sure how to do it in a fast data.table-like way. 也许可以(动态地)产生一个这样的因素?我只是不知道如何。 。 。Perhaps one could (dynamically) generate a factor that does this? I'm just not sure how. . .有办法改善吗? 编辑:只是为了它,这里是一个方法来获得上面的更大的版本 Just for the heck of it, here's a way to get a bigger version of the above library(data.table)dl.dta <- function(...){ ## input years .. years <- gsub("\\.", "_", c(...)) baseurl <- "http://www.transtats.bts.gov/Download/" names <- paste("Origin_and_Destination_Survey_DB1BMarket", years, sep="_") info <- t(sapply(names, function(x) file.exists(paste(x, c("zip", "csv"), sep=".")))) to.download <- paste(baseurl, names, ".zip", sep="")[!apply(info, 1, any)] if (length(to.download) > 0){ message("starting download...") sapply(to.download, function(x) download.file(x, rev(strsplit(x, "/")[[1]])[1]))} to.unzip <- paste(names, "zip", sep=".")[!info[, 2]] if (length(to.unzip > 0)){ message("starting to unzip...") sapply(to.unzip, unzip)} paste(names, "csv", sep=".")}countWords.split <- function(x, s=":"){ ## Faster on my machine than grep for some reanon sapply(strsplit(as.character(x), s), length)}countWords.grep <- function(x){ sapply(gregexpr("\\W+", x), length)+1}fname <- dl.dta(2013.1)cols <- rep("NULL", 41)## Columns to keep: 9 is Origin, 18 is Dest, 24 is groups of airports in travel## 30 is RPcarrier (reporting carrier). ## For more columns: 35 is market fare and 36 is distance.cols[9] <- cols[18] <- cols[24] <- cols[30] <- NAd <- data.table(read.csv(file=fname, colClasses=cols))d[, Market := paste(Origin, Dest, sep=":")]## should probablyd[, Stops := -2 + countWords.split(AirportGroup)]d[, Carrier := RPCarrier]d[, c("RPCarrier", "Origin", "Dest", "AirportGroup") := NULL] 推荐答案 @ Roland的答案将适用于某些功能最好),但不是一般。不幸的是,您不能将分割 - 应用 - 组合策略应用于数据,就像执行任务一样,但如果您使数据更大,您可以。让我们从一个更简单的例子开始:@Roland's answer will work for some functions (and when it does it will be best) but not in general. Unfortunately you can't apply the split-apply-combine strategy to the data as is to do the task, but you can if you make the data larger. Let's start with a simpler example:dt = data.table(a = c(1,1,2,2,3,3), b = c(1:6), key = 'a')# now let's extend this table the following way# take the unique a's and construct all the combinations excluding one elementcombinations = dt[, combn(unique(a), 2)]# now combine this into a data.table with the excluded element as the index# and merge it back into the original data.tableextension = rbindlist(apply(combinations, 2, function(x) data.table(a = x, index = setdiff(c(1,2,3), x))))setkey(extension, a)dt.extended = extension[dt, allow.cartesian = TRUE]dt.extended[order(index)]# a index b# 1: 2 1 3# 2: 2 1 4# 3: 3 1 5# 4: 3 1 6# 5: 1 2 1# 6: 1 2 2# 7: 3 2 5# 8: 3 2 6# 9: 1 3 1#10: 1 3 2#11: 2 3 3#12: 2 3 4# Now we have everything we need:dt.extended[, mean(b), by = list(a = index)]# a V1#1: 3 2.5#2: 2 3.5#3: 1 4.5返回原始数据(并做一些操作略有不同, ):Going back to original data (and doing some operations slightly differently, to simplify expressions):extension = d[, {Carrier.uniq = unique(Carrier); .SD[, rbindlist(combn(Carrier.uniq, length(Carrier.uniq)-1, function(x) data.table(Carrier = x, index = setdiff(Carrier.uniq, x)), simplify = FALSE))]}, by = Market]setkey(extension, Market, Carrier)extension[d, allow.cartesian = TRUE][, mean(Stops), by = list(Market, Carrier = index)]# Market Carrier V1#1: IAH:SNA Southwest 1.000000#2: IAH:SNA Delta 3.000000#3: MSP:CLE JetBlue 2.000000#4: MSP:CLE Southwest 1.500000#5: MSP:CLE American 1.666667 这篇关于R:子集上的数据表,不包括值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!
09-21 16:58