本文介绍了data.table join(在vecseq中的错误)是关键的必要在X和i?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 我是新的R和 data.table ,我觉得有用和快速。我想加入2个数据表: > TotFreq Legacy_Store_Number WeekDay Date Item_Key Distr NoSellingDays meanUnits ItemType 1:113802 1 2013-03-24 000000000120 2.428985e-04 0 8.00 FM 2:113802 1 2013-03-24 000000000126 1.104030e -03 0 47.50 FM 3:113802 1 2013-03-24 000000000170 1.126004e-03 0 48.75 FM 4:113802 1 2013-03-24 000000000180 5.143034e-04 0 19.00 FM 5:113802 1 2013-03-24 000000000260 3.854306e-04 0 12.25 FM 160167:113802 7 2013-03-23 978125002327 5.902655e-07 27 1.00 SM 160168:113802 7 2013-03 -23 978141970584 1.770796e-06 25 1.00 SM 160169:113802 7 2013-03-23 978145300697 1.180531e-06 26 1.00 SM 160170:113802 7 2013-03-23 978145552558 5.902655e-07 27 1.00 SM 160171:113802 7 2013-03-23 978160139536 5.902655e-07 27 1.00 SM > Count_SM_FM Legacy_Store_Number WeekDay ItemType物件 1:113802 1 SM 12305 2:113802 1 FM 1942 3:113802 2 SM 11014 4:113802 2 FM 1398 5:113802 3 SM 10154 6:113802 3 FM 1117 7:113802 4 SM 10414 8:113802 4 FM 1167 9:113802 5 SM 10258 10:113802 5 FM 1200 11:113802 6 SM 11116 12:113802 6 FM 1575 13:113802 7 SM 13098 14:113802 7 FM 2326 > setkey(TotFreq,Legacy_Store_Number,WeekDay,ItemType)> > ResultJoin< - TotFreq [Count_SM_FM] 在vecseq中出错(f__,len__,if(allow.cartesian)NULL else as.integer(max(nrow(x),:加入结果为320342行;超过160171 = max(nrow(x),nrow(i))。检查i中的重复键值,每次重复加入x中的同一个组,如果可以,尝试包括`j`如果你确定你想继续,重新运行allow.cartesian = TRUE,否则,请搜索这个错误消息 但是我没有重复的键 i ! 使用: > ResultJoin > > ResultJoin Legacy_Store_Number WeekDay Date Item_Key Distr NoSellingDays meanUnits ItemType ItemType。 1目录 1:113802 1 2013-03-24 000000000120 2.428985e-04 0 8.00 FM SM 12305 2:113802 1 2013-03-24 000000000126 1.104030e-03 0 47.50 FM SM 12305 3:113802 1 2013-03-24 000000000170 1.126004e-03 0 48.75 FM SM 12305 4:113802 1 2013-03-24 000000000180 5.143034e-04 0 19.00 FM SM 12305 5: 113802 1 2013-03-24 000000000260 3.854306e-04 0 12.25 FM SM 12305 --- 320338:113802 7 2013-03-23 978125002327 5.902655e-07 27 1.00 SM FM 2326 320339:113802 7 2013-03-23 978141970584 1.770796e-06 25 1.00 SM FM 2326 320340:113802 7 2013-03-23 978145300697 1.180531e-06 26 1.00 SM FM 2326 320341:113802 7 2013-03-23 978145552558 5.902655e-07 27 1.00 SM FM 2326 320342:113802 7 2013-03-23 978160139536 5.902655e-07 27 1.00 SM FM 2326 我实际上是原来的 TotFreq 表中的记录的两倍。如果我在 Count_SM_FM 上添加键,则连接正常工作: ; setkey(TotFreq,Legacy_Store_Number,WeekDay,ItemType)> setkey(Count_SM_FM,Legacy_Store_Number,WeekDay,ItemType)> ResultJoin< - TotFreq [Count_SM_FM] > > ResultJoin Legacy_Store_Number WeekDay ItemType Date Item_Key Distr NoSellingDays meanUnits ObjItems 1:113802 1 FM 2013-03-24 000000000120 2.428985e-04 0 8.00 1942 2:113802 1 FM 2013-03-24 000000000126 1.104030e-03 0 47.50 1942 3:113802 1 FM 2013-03-24 000000000170 1.126004e-03 0 48.75 1942 4:113802 1 FM 2013-03-24 000000000180 5.143034e-04 0 19.00 1942 5:113802 1 FM 2013-03-24 000000000260 3.854306e-04 0 12.25 1942 --- 160167:113802 7 SM 2013-03-23 978125002327 5.902655e-07 27 1.00 13098 160168:113802 7 SM 2013-03-23 978141970584 1.770796e-06 25 1.00 13098 160169:113802 7 SM 2013-03-23 978145300697 1.180531e-06 26 1.00 13098 160170:113802 7 SM 2013-03-23 978145552558 5.902655e-07 27 1.00 13098 160171:113802 7 SM 2013-03-23 978160139536 5.902655e-07 27 1.00 13098 我试图验证一个例子,也许问题是没有关键变量作为 TotFreq 或没有 Count_SM_FM 未排序,但无法重现此错误 > daysType< - data.table( + key1 = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1), + key2 = c(1,1,2,2,3,3,4,4,5,5,6,6,7,7), + key3 = c(b,a a,b,a,b,a,b,a,b,a,b + var1 = c(2,4,6,8,4,5,7,3,7,9,6,3,5,6) +)> > > detailData< - data.table( + key1 = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 ,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 ,1,1,1,1,1,1,1), + key2 = c(1,1,1,1,1,1,1,2,2,2,2,2,2 ,2,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,6,6,6 ,6,6,6,7,7,7,7,7,7,7,7), + var2 = c(10,11,12,13,15,16,17,10,11 ,12,13,14,15,16,10,11,12,15,16,17,10,11,12,13,14,15,16,17,10,11,13,14,15,16 ,17,10,11,12,13,14,15,10,11,12,13,14,15,16,17), + var3 = c(1,2,4,6,6 ,7,3,6,8,9,3,5,7,8,6,7,8,6,7,2,4,6,7,8,2,3,5,7,4,7 ,8,3,6,4,2,5,7,3,6,7,3,4,2,4,6,4,7,2,9), + key3 = c( a,a,a,a,b,b,b,a ,a,a,a,a,b,b,a b,b,a,a,a,a,b,b ,a,b,b,b,b) + )> > setkey(detailData,key1,key2,key3)> JoinResult 问题与问题不同 加入两个data.tables失败 allow.cartesian 解决了问题。 这里有什么问题?为什么要添加键到 Count_SM_FM 可以解决它? 谢谢!解决方案 / strong> Arun在v1.9.5中修复: allow.cartesian 现在在 i 没有重复项时被忽略, #742 和#508 。感谢@nigmastar,@ user3645882和其他报告。 上一个答案... 首先让我们来处理 allow.cartesian 部分。错误消息应该可以改变,指出你可以得到大尺寸,即使你没有在 i 中的重复,但你有左侧的重复 data.table 。这里有一个简单的例子: dt1 = data.table(a = c(1,1),b = key ='a') dt2 = data.table(a = c(1,2),c = 3:4) dt1 [dt2]#这给出一个错误,因为连接结果为3行,如下所示 dt1 [dt2,allow.cartesian = TRUE] #abc #1:1 1 3 #2: 1 2 3 #3:2 NA 4 现在, - 不需要为 i 设置键,它将假定前几列是键。查看您的第一个加入结果,可以看到不已加入 ItemType ,并且您使用的是旧的 data.table version(我使用的是1.9.3)。所以我的猜测是,或者你没有真正设置的键正确,并没有包括 ItemType 或者有一些错误在旧版本,自那以后被修复。 p> I am new to R and to data.table, which I find useful and fast. I am trying to join 2 data tables:> TotFreq Legacy_Store_Number WeekDay Date Item_Key Distr NoSellingDays meanUnits ItemType 1: 113802 1 2013-03-24 000000000120 2.428985e-04 0 8.00 FM 2: 113802 1 2013-03-24 000000000126 1.104030e-03 0 47.50 FM 3: 113802 1 2013-03-24 000000000170 1.126004e-03 0 48.75 FM 4: 113802 1 2013-03-24 000000000180 5.143034e-04 0 19.00 FM 5: 113802 1 2013-03-24 000000000260 3.854306e-04 0 12.25 FM160167: 113802 7 2013-03-23 978125002327 5.902655e-07 27 1.00 SM160168: 113802 7 2013-03-23 978141970584 1.770796e-06 25 1.00 SM160169: 113802 7 2013-03-23 978145300697 1.180531e-06 26 1.00 SM160170: 113802 7 2013-03-23 978145552558 5.902655e-07 27 1.00 SM160171: 113802 7 2013-03-23 978160139536 5.902655e-07 27 1.00 SM> Count_SM_FM Legacy_Store_Number WeekDay ItemType ObjItems 1: 113802 1 SM 12305 2: 113802 1 FM 1942 3: 113802 2 SM 11014 4: 113802 2 FM 1398 5: 113802 3 SM 10154 6: 113802 3 FM 1117 7: 113802 4 SM 10414 8: 113802 4 FM 1167 9: 113802 5 SM 1025810: 113802 5 FM 120011: 113802 6 SM 1111612: 113802 6 FM 157513: 113802 7 SM 1309814: 113802 7 FM 2326> setkey(TotFreq,Legacy_Store_Number,WeekDay,ItemType)> > ResultJoin <- TotFreq[Count_SM_FM]Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), : Join results in 320342 rows; more than 160171 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.But I do not have duplicate keys in i!Using:> ResultJoin <- TotFreq[Count_SM_FM,allow.cartesian=T]> > ResultJoin Legacy_Store_Number WeekDay Date Item_Key Distr NoSellingDays meanUnits ItemType ItemType.1 ObjItems 1: 113802 1 2013-03-24 000000000120 2.428985e-04 0 8.00 FM SM 12305 2: 113802 1 2013-03-24 000000000126 1.104030e-03 0 47.50 FM SM 12305 3: 113802 1 2013-03-24 000000000170 1.126004e-03 0 48.75 FM SM 12305 4: 113802 1 2013-03-24 000000000180 5.143034e-04 0 19.00 FM SM 12305 5: 113802 1 2013-03-24 000000000260 3.854306e-04 0 12.25 FM SM 12305 --- 320338: 113802 7 2013-03-23 978125002327 5.902655e-07 27 1.00 SM FM 2326320339: 113802 7 2013-03-23 978141970584 1.770796e-06 25 1.00 SM FM 2326320340: 113802 7 2013-03-23 978145300697 1.180531e-06 26 1.00 SM FM 2326320341: 113802 7 2013-03-23 978145552558 5.902655e-07 27 1.00 SM FM 2326320342: 113802 7 2013-03-23 978160139536 5.902655e-07 27 1.00 SM FM 2326I get in fact double the records that I had in my original TotFreq table. If I add a key also on Count_SM_FM the join works:> setkey(TotFreq,Legacy_Store_Number,WeekDay,ItemType)> setkey(Count_SM_FM,Legacy_Store_Number,WeekDay,ItemType)> ResultJoin <- TotFreq[Count_SM_FM]> > ResultJoin Legacy_Store_Number WeekDay ItemType Date Item_Key Distr NoSellingDays meanUnits ObjItems 1: 113802 1 FM 2013-03-24 000000000120 2.428985e-04 0 8.00 1942 2: 113802 1 FM 2013-03-24 000000000126 1.104030e-03 0 47.50 1942 3: 113802 1 FM 2013-03-24 000000000170 1.126004e-03 0 48.75 1942 4: 113802 1 FM 2013-03-24 000000000180 5.143034e-04 0 19.00 1942 5: 113802 1 FM 2013-03-24 000000000260 3.854306e-04 0 12.25 1942 --- 160167: 113802 7 SM 2013-03-23 978125002327 5.902655e-07 27 1.00 13098160168: 113802 7 SM 2013-03-23 978141970584 1.770796e-06 25 1.00 13098160169: 113802 7 SM 2013-03-23 978145300697 1.180531e-06 26 1.00 13098160170: 113802 7 SM 2013-03-23 978145552558 5.902655e-07 27 1.00 13098160171: 113802 7 SM 2013-03-23 978160139536 5.902655e-07 27 1.00 13098I tried to verify with an example, maybe the problem was in not having the key variables as the first columns in TotFreq or in not having Count_SM_FM not sorted but I couldn't reproduce the error> daysType <- data.table(+ key1=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1),+ key2=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7),+ key3=c("b","a","a","b","a","b","a","b","a","b","a","b","a","b"),+ var1=c(2,4,6,8,4,5,7,3,7,9,6,3,5,6)+ ) > > > detailData <- data.table(+ key1=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),+ key2=c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,6,6,6,6,6,6,7,7,7,7,7,7,7,7),+ var2=c(10,11,12,13,15,16,17,10,11,12,13,14,15,16,10,11,12,15,16,17,10,11,12,13,14,15,16,17,10,11,13,14,15,16,17,10,11,12,13,14,15,10,11,12,13,14,15,16,17),+ var3=c(1,2,4,6,6,7,3,6,8,9,3,5,7,8,6,7,8,6,7,2,4,6,7,8,2,3,5,7,4,7,8,3,6,4,2,5,7,3,6,7,3,4,2,4,6,4,7,2,9),+ key3=c("a","a","a","a","b","b","b","a","a","a","a","b","b","b","a","a","a","b","b","b","a","a","a","a","b","b","b","b","a","a","a","b","b","b","b","a","a","a","a","b","b","a","a","a","a","b","b","b","b")+ ) > > setkey(detailData,key1,key2,key3)> JoinResult <- detailData[daysType]The problem is different than that in question Join of two data.tables failsbecause there the allow.cartesian solves the issue.What is the problem here? Why adding the key to Count_SM_FM solves it? Thanks! 解决方案 Update Oct 2014: Arun fixed it in v1.9.5 : allow.cartesian is now ignored when i has no duplicates, #742 and #508. Thanks to @nigmastar, @user3645882 and others for the reports.Previous answer ...First let's address the allow.cartesian part. The error message should probably be changed to point out that you can get large sizes even if you don't have duplicates in i, but you have duplicates in the left hand side data.table. Here's a simple example:dt1 = data.table(a = c(1,1), b = 1:2, key = 'a')dt2 = data.table(a = c(1,2), c = 3:4)dt1[dt2] # this gives an error, because join results in 3 rows, as seen belowdt1[dt2, allow.cartesian = TRUE]# a b c#1: 1 1 3#2: 1 2 3#3: 2 NA 4Now as far as setting the key goes - no you don't need to set the key for i, it will just assume the first few columns are the keys. Looking at your first join result one can see that it was not joined on ItemType and that you're using an older data.table version (I'm using 1.9.3). So my guess is that either you didn't actually set the key correctly and didn't include ItemType or there was some bug in older versions that's been fixed since then. 这篇关于data.table join(在vecseq中的错误)是关键的必要在X和i?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 10-30 18:11