本文介绍了当表被“复制”时,data.table中的二级密钥(“索引”属性)通过选择列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 我有一个data.table myDT ,并且我通过三种不同的方式制作此表的副本: myDT myDT [colA == 3] copy1< - copy(myDT) copy2< - myDT#是我知道它是一个引用,不是真正的副本 copy3< - myDT [,。(colA)]#表 然后我将这些副本与原始表格进行比较: 完全相同(myDT,copy1)#TRUE 相同(myDT,copy2)#TRUE 相同myDT,copy3)#FALSE 我试图找出 myDT 和 copy3 same(names(myDT),names(copy3))#TRUE all.equal(myDT,copy3,check.attributes = FALSE)#TRUE all .equal(myDT,copy3,check.attributes = FALSE,trim.levels = FALSE,check.names = TRUE)#TRUE attr.all.equal(myDT,copy3,check.attributes = FALSE ,trim.levels = FALSE,check.names = TRUE)#NULL all.equal(myDT,copy3)#[1]Attributes:长度不匹配:在前1个组件上的比较> attr.all.equal(myDT,copy3)#[1]属性:名称:1个字符串不匹配>#[2]属性:长度不匹配:前3个分量上的比较>#[3]属性:组件3:属性:模式:list,NULL> >#[4]属性:组件3:属性:目标的名称,但不是当前的> >#[5]属性:组件3:属性:当前不是列表式> >#[6]属性:组件3:数字:长度(0,3)不同> 最后我来到使用 attributes()函数: attr0 attr3 str(attr0) str(attr3) 它表明原始 data.table 有一个 code> 解决方案为了使这个问题更清楚(对未来的读者来说可能有用),这里真正发生的是,你可能不设置辅助键,同时显式调用 set2key ,OR, data.table 似乎设置了一个辅助键这是V 1.9.4中添加的(不是这样)新功能 DT [column == value]现在已经优化了使用键(DT)[1] ==column时使用 DT的键的DT [%]值, index)会自动添加,所以下一个DT [column == value]的速度就快。不需要更改代码;现有代码应该自动获益。可以使用set2key()手动添加辅助键,使用key2()选择存在。这些优化和函数 names / arguments是实验性的,可以通过选项(datatable.auto.index = FALSE)关闭。 让我们重现这个 myDT < - data.table(A = 1:3) options(datatable.verbose = TRUE) myDT [A == 3] # ~~~这里是#forder占用0秒#强制双列i.'V1'为整数以匹配x.'A'的类型。请避免强制提高效率。 #开始bmerge ...在0秒内完成#A #1:3 attr(myDT,index)#或使用`key2 myDT)`#integer(0)#attr(,__ A)#integer(0) 因此,与您假设不同的是,您实际上 创建了副本,因此辅助键未随其传输。比较 copy1< - myDT attr(copy1,index)#integer )#检查j是否使用这些列: #attr(,__ A)#integer(0) copy2 attr(copy2,index)#NULL identical(myDT,copy1)# 1] TRUE identical(myDT,copy2)#[1] FALSE tracemem(myDT)#[1]< 00000000159CBBB0> tracemem(copy1)#[1]< 00000000159CBBB0> tracemem(copy2)#[1]< 000000001A5A46D8> 这里最有趣的结论,即使对象保持不变, [。data.table 也会创建副本。 I have a data.table myDT, and I'm making "copies" of this table by 3 different ways:myDT <- data.table(colA = 1:3)myDT[colA == 3]copy1 <- copy(myDT)copy2 <- myDT # yes I know that it's a reference, not real copycopy3 <- myDT[,.(colA)] # I list all columns from the original tableThen I'm comparing those copies with the original table:identical(myDT, copy1) # TRUEidentical(myDT, copy2)# TRUEidentical(myDT, copy3)# FALSEI was trying to figure out what was the difference between myDT and copy3identical(names(myDT), names(copy3))# TRUEall.equal(myDT, copy3, check.attributes=FALSE)# TRUEall.equal(myDT, copy3, check.attributes=FALSE, trim.levels=FALSE, check.names=TRUE)# TRUEattr.all.equal(myDT, copy3, check.attributes=FALSE, trim.levels=FALSE, check.names=TRUE)# NULLall.equal(myDT, copy3)# [1] "Attributes: < Length mismatch: comparison on first 1 components >"attr.all.equal(myDT, copy3)# [1] "Attributes: < Names: 1 string mismatch >" # [2] "Attributes: < Length mismatch: comparison on first 3 components >" # [3] "Attributes: < Component 3: Attributes: < Modes: list, NULL > >" # [4] "Attributes: < Component 3: Attributes: < names for target but not for current > >"# [5] "Attributes: < Component 3: Attributes: < current is not list-like > >" # [6] "Attributes: < Component 3: Numeric: lengths (0, 3) differ >"My original question was how to understand the last output. Finally I came to using the attributes() function:attr0 <- attributes(myDT)attr3 <- attributes(copy3)str(attr0)str(attr3)it has shown that original data.table had an index attribute which was not copied when I created copy3. 解决方案 In order to make this question a bit clearer (and maybe useful for future readers), what really happened here is that you (probably not) set a secondary key while explicitly calling set2key, OR, data.table seemingly set a secondary key while you were making some ordinary operations such as filtering. This is a (not so) new feature added in V 1.9.4 DT[column==value] and DT[column %in% values] are now optimized to use DT's key when key(DT)[1]=="column", otherwise a secondary key (a.k.a. index) is automatically added so the next DT[column==value] is much faster. No code changes are needed; existing code should automatically benefit. Secondary keys can be added manually using set2key() and existence checked using key2(). These optimizations and function names/arguments are experimental and may be turned off with options(datatable.auto.index=FALSE).Lets reproduce thismyDT <- data.table(A = 1:3)options(datatable.verbose = TRUE)myDT[A == 3] # Creating new index 'A' <~~~~ Here it is# forder took 0 sec# Coercing double column i.'V1' to integer to match type of x.'A'. Please avoid coercion for efficiency.# Starting bmerge ...done in 0 secs# A# 1: 3attr(myDT, "index") # or using `key2(myDT)`# integer(0)# attr(,"__A")# integer(0)So, unlike you were assuming, you actually did create a copy and thus the secondary key wasn't transferred with it. Comparecopy1 <- myDTattr(copy1, "index")# integer(0)# attr(,"__A")# integer(0)copy2 <- myDT[,.(A)]# Detected that j uses these columns: A <~~~ This is where the copy occuresattr(copy2, "index")# NULLidentical(myDT, copy1)# [1] TRUEidentical(myDT, copy2)# [1] FALSEAnd for some further validationtracemem(myDT)# [1] "<00000000159CBBB0>"tracemem(copy1)# [1] "<00000000159CBBB0>"tracemem(copy2)# [1] "<000000001A5A46D8>"The most interesting conclusion here, one could claim, that [.data.table does create a copy, even if the object remains unchanged. 这篇关于当表被“复制”时,data.table中的二级密钥(“索引”属性)通过选择列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!
10-28 11:27