R快速单项查找从列表vs data.table对哈希

本文介绍了R快速单项查找从列表vs data.table对哈希的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述我经常面临的一个问题是需要从data.table中查找任意行。我昨天遇到了一个问题，我试图加快一个循环，并使用 profvis 我发现从 data.table 是循环中最昂贵的部分。然后我决定尝试找到在R中执行单个项目查找的最快方法。数据通常采用数据的形式。表，其中包含字符类型的键列。其余列通常是数值。我试图创建一个随机表，具有类似的特性，我经常处理，意味着> 100K行。我比较了原生列表， data.table 包和 hash 包。原生列表和 data.table 在单个项目查找性能方面具有可比性。 Hash 似乎快两个数量级。测试由随机抽样的10组10,000个密钥组成，以提供访问行为的变化。每个查找方法使用相同的键集合。最后，我的首选项将是获取行查找data.table更快，而不必创建一个哈希我的数据表或建立它不能做，只是使用哈希包时，我必须做快速查找。我不知道这是否可能，但是你可以创建一个哈希表引用的数据表中的行，以允许使用哈希包快速查找？我知道这种类型的事情是可能在C + +，但我的知识R不允许这种事情，由于缺乏指针。总结： 1）我正在使用data.table查找，因此这是我应该期望的单行查找的速度？ 2）是否可以创建一个指向data.table行的指针的散列，以允许以这种方式进行快速查找？测试系统： h2> Windows 10 Pro x64 R 3.2.2 data.table 1.9.6 散列2.2.6 Intel Core i7-5600U，带16 GB RAM 代码： library（microbenchmarkCore）＃install.packages microbenchmarkCore，repos =http://olafmersmann.github.io/drat）库（data.table）库（散列）＃设置种子42以确保重复性 set.seed（42）＃设置测试------ ＃生成产品ID product_ids< ; - as.vector（ outer（LETTERS [seq（1,26,1）]， outer（outer（LETTERS [seq（1，26，1）]，LETTERS [seq 26，1）]，paste，sep =）， LETTERS [seq（1,26,1）]，paste，sep =），paste，sep = $ b））＃创建测试查找数据 test_lookup_list< - lapply（product_ids，function（id）{ return_list< - list $ b product_id = id， val_1 = rnorm（1）， val_2 = rnorm（1）， val_3 = rnorm（1）， val_4 = rnorm val_5 = rnorm（1）， val_6 = rnorm（1）， val_7 = rnorm（1） val_8 = rnorm（1）） return（return_list）}）＃设置列表中的项目名称名称（test_lookup_list）＃创建查找散列 lookup_hash ＃从列表创建数据表set key of data.table to product_id field test_lookup_dt< - rbindlist（test_lookup_list） setkey（test_lookup_dt，product_id） test_lookup_env ＃生成用于速度测试的键示例 lookup_tests< - lapply（1:10，function（x）{ lookups< - sample（test_lookup_dt $ product_id，10000 ） return（lookups）}）＃本地列表计时 native_list_timings< - sapply（lookup_tests，function（lookups）{ timing< ; - system.nanotime（ for（lookup in lookups）{ return_value< - test_lookup_list [[lookup]] } ） return 'elapsed']]）}）＃Data.table timing datatable_timings< - sapply（lookup_tests，function（lookups）{ timing& system.nanotime（ for（lookup in lookups）{ return_value< - test_lookup_dt [lookup] } ） return（timing [['elapsed'] ] }）＃Hashtable timing hashtable_timings< - sapply（lookup_tests，function（lookups）{ timing< nanotime（ for（lookup in lookups）{ return_value< - lookup_hash [[lookup]] } ） return（timing [['elapsed'] ] }）＃环境时间 environment_timings< - sapply（lookup_tests，function（lookups）{ timing< - system.nanotime $ b for（lookup in lookups）{ return_value< - test_lookup_env [[lookup]] } ） return（timing [['elapsed']]） $ b summary（hashtable_timings） summary（environment_timings） summary（hashtable_timings）摘要b $ b 这些是结果： > ＃时间结果摘要> summary（native_list_timings）最小。第一次。中位数最大。 35.12 36.20 37.28 37.05 37.71 39.24 > summary（datatable_timings）最小。第一次。中位数最大。 49.13 51.51 52.64 52.76 54.39 55.13 > summary（hashtable_timings）最小。第一次。中位数最大。 0.1588 0.1857 0.2107 0.2213 0.2409 0.3258 > summary（environment_timings）最小。第一次。中位数最大。 0.09322 0.09524 0.10680 0.11850 0.13760 0.17140 似乎查找在这个特定场景中比本地列表或 data.table 快两个数量级。更新时间：2015-12-11 3:00 PM PST 我收到了Neal Fultz的反馈，建议使用原生Environment对象。下面是我得到的代码和结果： test_lookup_env< - list2env（test_lookup_list）＃ $ b environment_timings< - sapply（lookup_tests，function（lookups）{ timing< - system.nanotime（ for（lookup in lookups）{ return_value< - test_lookup_env [[lookup ]] } ） return（timing [['elapsed']]）}） summary（environment_timings）> summary（environment_timings）最小。第一次。中位数最大。 0.09322 0.09524 0.10680 0.11850 0.13760 0.17140 事实上，环境对个别项目访问。谢谢Neal Fultz指出这个方法。我的问题仍然存在：我正在使用 data.table （我期望这样，但我开放批评），并且有一种方法来提供对 data.table 的行的访问，使用某种指针魔术，这将提供更快的单独行访问。澄清：2015-12-11 3:52 PST 有人提到我的内部访问模式大多数循环我的测试是低效的。我同意。我想做的是尽可能接近我所处理的情况。这实际发生的循环不允许向量化，这就是为什么我不使用它。我意识到这不是严格的'R'做事的方式。我的代码中的 data.table 提供了引用信息，我不知道我需要的行，直到我在循环内，这是为什么我试图找出如何以尽可能快地访问单个项目，优选地，数据仍然存储在 data.table 中。更新2：2015-12-11 4:12 PM太平洋这是一个好奇心问题，我收到来自@jangrorecki的反馈，使用 Sys.time（）是测量函数性能的无效方法。我已经修改代码使用 system.nanotime（） per建议。原始代码已更新，并且计时结果。问题仍然是：这是最快的方式来执行行查找数据.table 如果是这样，是否可以创建一个哈希指针的行快速查找？在这一点上，我最好奇的是，R可以推多远。作为来自C ++的人，这是一个有趣的挑战。结论我接受了提供的答案Neal Fultz因为它讨论了我实际想要知道的。也就是说，这不是方式 data.table 旨在被使用，所以没有人应该解释这意味着 data.table 很慢，它实际上是令人难以置信的快。这是一个非常特别的用例，我很好奇。我的数据以 data.table 的形式出现，我想知道是否可以获得快速行访问，而保留为 data.table 。我还想比较 data.table 访问速度和哈希表，这是常用于快速，非向量化项目查找。解决方案对于非向量化访问模式，您可能想尝试内置的环境对象： require（microbenchmark） test_lookup_env< - list2env（test_lookup_list） x microbenchmark（ lookup_hash [[x]]， test_lookup_list [[x]]， test_lookup_dt [x]， test_lookup_env [[x]] ）你可以看到它甚至比哈希： b $ b expr min lq mean median uq max neval lookup_hash [[x]] 10.767 12.9070 22.67245 23.2915 26.1710 68.654 100 test_lookup_list [[x]] 847.700 853.2545 887.55680 863.0060 893.8925 1369.395 100 test_lookup_dt [x] 2652.023 2711.9405 2771.06400 2758.8310 2803.9945 3373.273 100 test_lookup_env [[x]] 1.588 1.9450 4.61595 2.5255 6.6430 27.977 100 编辑：步骤通过 data.table :::`[.data.table `是有益的，为什么你看到dt减速。当你用一个字符索引并且有一个键集，它做了相当多的簿记，然后下降到 bmerge ，这是一个二进制搜索。另一方面，环境使用散列（默认情况下），并且有固定的访问时间尊重n。要解决问题，您可以手动构建地图和索引： x e #example access： test_lookup_dt [e [[x]]，] $ b b 但是，在data.table方法中看到这么多的簿记代码，我会尝试一下老的data.frames： test_lookup_df rownames（test_lookup_df）< - test_lookup_df $ product_id 如果我们真的偏执，我们可以跳过 [ 这里是更多的时间（来自不同于上面的机器）： > microbenchmark（ + test_lookup_dt [x，]， + test_lookup_dt [x]， + test_lookup_dt [e [[x]]，]， + test_lookup_df [x， + test_lookup_df [e [[x]]， + lapply（test_lookup_df，`[`，e [[x]]）， + lapply（test_lookup_dt，`[`， e [[x]]）， + lookup_hash [[x]] +）单位：微秒 expr min lq mean median uq max neval test_lookup_dt [ x，] 1658.585 1688.9495 1992.57340 1758.4085 1992.57340 1758.4085 2466.7120 2895.592 100 test_lookup_dt [x] 1652.181 1695.1660 2019.12934 1764.8710 2487.9910 2934.832 100 test_lookup_dt [e [[x]]，] 1040.869 1123.0320 1356.49050 1280.6670 1390.1075 2247.503 100 test_lookup_df [x，] 17355.734 17538.6355 18325.74549 17676.3340 17987.6635 41450.080 100 test_lookup_df [e [[x]]，] 128.749 151.0940 190.74834 174.1320 218.6080 366.122 100 lapply（test_lookup_df，`[`，e [[x] ]）18.913 25.0925 44.53464 35.2175 53.6835 146.944 100 lapply（test_lookup_dt，`[`，e [[x]]）37.483 50.4990 94.87546 81.2200 124.1325 241.637 100 lookup_hash [[x]] 6.534 15.3085 39.88912 49.8245 55.5680 145.552 100 总的来说，为了回答你的问题，你不是使用data.table错误也没有按照它的意图（矢量化访问）的方式使用它。但是，您可以手动构建地图以建立索引，并获得大部分性能。 One of the problems I often face is needing to look up an arbitrary row from a data.table. I ran into a problem yesterday where I was trying to speed up a loop and using profvis I found that the lookup from the data.table was the most costly part of the loop. I then decided to try and find the fastest way to do a single item lookup in R. The data often takes the form of a data.table with a key column of the character type. The remaining columns are typically numeric values. I tried to create a random table with similar characteristics to what I often deal with which means >100K rows. I compared the native list, data.table package and the hash package. The native list and data.table were comparable for individual item lookup performance. Hash appeared to be two orders of magnitude faster. The tests were made up of 10 sets of 10,000 keys randomly sampled to provide for variance in access behavior. Each lookup method used the same sets of keys.Ultimately my preference would be to either get the row lookup for data.table to be faster instead of having to create a hash table of my data or establish that it cannot be done and just use the hash package when I have to do fast lookup. I don't know if it would be possible but could you create a hash table of references to the rows in the data.table to allow for fast lookup using the hash package? I know that type of thing is possible in C++ but to my knowledge R does not allow this kind of thing due to the lack of pointers.To Summarize:1) Am I using data.table correctly for the lookups and therefore this is the speed I should expect for a single row lookup?2) Would it be possible to create a hash of pointers to the data.table rows to allow for fast lookup that way?Test System:Windows 10 Pro x64R 3.2.2data.table 1.9.6hash 2.2.6Intel Core i7-5600U with 16 GB RAMCode:library(microbenchmarkCore) # install.packages("microbenchmarkCore", repos="http://olafmersmann.github.io/drat")library(data.table)library(hash)# Set seed to 42 to ensure repeatabilityset.seed(42)# Setting up test ------# Generate product idsproduct_ids <- as.vector( outer(LETTERS[seq(1, 26, 1)], outer(outer(LETTERS[seq(1, 26, 1)], LETTERS[seq(1, 26, 1)], paste, sep=""), LETTERS[seq(1, 26, 1)], paste, sep = "" ), paste, sep = "" ))# Create test lookup datatest_lookup_list <- lapply(product_ids, function(id){ return_list <- list( product_id = id, val_1 = rnorm(1), val_2 = rnorm(1), val_3 = rnorm(1), val_4 = rnorm(1), val_5 = rnorm(1), val_6 = rnorm(1), val_7 = rnorm(1), val_8 = rnorm(1) ) return(return_list)})# Set names of items in listnames(test_lookup_list) <- sapply(test_lookup_list, function(elem) elem[['product_id']])# Create lookup hashlookup_hash <- hash(names(test_lookup_list), test_lookup_list)# Create data.table from list and set key of data.table to product_id fieldtest_lookup_dt <- rbindlist(test_lookup_list)setkey(test_lookup_dt, product_id)test_lookup_env <- list2env(test_lookup_list)# Generate sample of keys to be used for speed testinglookup_tests <- lapply(1:10, function(x){ lookups <- sample(test_lookup_dt$product_id, 10000) return(lookups)})# Native list timingnative_list_timings <- sapply(lookup_tests, function(lookups){ timing <- system.nanotime( for(lookup in lookups){ return_value <- test_lookup_list[[lookup]] } ) return(timing[['elapsed']])})# Data.table timingdatatable_timings <- sapply(lookup_tests, function(lookups){ timing <- system.nanotime( for(lookup in lookups){ return_value <- test_lookup_dt[lookup] } ) return(timing[['elapsed']])})# Hashtable timinghashtable_timings <- sapply(lookup_tests, function(lookups){ timing <- system.nanotime( for(lookup in lookups){ return_value <- lookup_hash[[lookup]] } ) return(timing[['elapsed']])})# Environment timingenvironment_timings <- sapply(lookup_tests, function(lookups){ timing <- system.nanotime( for(lookup in lookups){ return_value <- test_lookup_env[[lookup]] } ) return(timing[['elapsed']])})# Summary of timing resultssummary(native_list_timings)summary(datatable_timings)summary(hashtable_timings)summary(environment_timings)These were the results:> # Summary of timing results> summary(native_list_timings) Min. 1st Qu. Median Mean 3rd Qu. Max. 35.12 36.20 37.28 37.05 37.71 39.24 > summary(datatable_timings) Min. 1st Qu. Median Mean 3rd Qu. Max. 49.13 51.51 52.64 52.76 54.39 55.13 > summary(hashtable_timings) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.1588 0.1857 0.2107 0.2213 0.2409 0.3258 > summary(environment_timings) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.09322 0.09524 0.10680 0.11850 0.13760 0.17140 It appears that the hash lookup is approximately two orders of magnitude faster than either the native list or data.table in this particular scenario.Update: 2015-12-11 3:00PM PSTI received feedback from Neal Fultz suggesting the use of the native Environment object. Here is the code and result I got:test_lookup_env <- list2env(test_lookup_list)# Environment timingenvironment_timings <- sapply(lookup_tests, function(lookups){ timing <- system.nanotime( for(lookup in lookups){ return_value <- test_lookup_env[[lookup]] } ) return(timing[['elapsed']])})summary(environment_timings)> summary(environment_timings) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.09322 0.09524 0.10680 0.11850 0.13760 0.17140 Indeed, it appears that Environment is faster for individual item access in this scenario. Thank you Neal Fultz for pointing this method out. I appreciate having a more thorough understanding of the object types available in R. My questions still stand: am I using data.table correctly (I expect so but I am open to critique) and is there a way to provide row access to the rows of a data.table using some kind of pointer magic which would provide faster individual row access.Clarification: 2015-12-11 3:52 PM PSTThere have been some mentions that my access pattern in the inner-most loop of my test is inefficient. I agree. What I am trying to do is emulate as closely as possible the situation that I am dealing with. The loop that this is actually occurring in does not allow for vectorization which is why I am not using it. I realize this is not strictly the 'R' way of doing things. The data.table in my code is providing reference information and I do not necessarily know which row I need until I am inside the loop which is why I am trying to figure out how to access an individual item as quickly as possible, preferably with the data still stored in a data.table. This is also in part a curiosity question, can it be done?Update 2: 2015-12-11 4:12 PM PSTI received feedback from @jangrorecki that using Sys.time() is an ineffective means of measuring the performance of a function. I have since revised the code to use system.nanotime() per the suggestion. The original code has been updated and the timing results.The question still stands: is this the fastest way to do a row lookup of a data.table and if so is it possible to create a hash of pointers to the rows for quick lookup? At this point I am most curious how far R can be pushed. As someone who came from C++, this is a fun challenge.ConclusionI accepted the answer provided by Neal Fultz because it discussed what I was actually wanting to know. That said, this is not the way data.table was intended to be used so no one should interpret this to mean data.table is slow, it is actually incredibly fast. This was a very particular use case that I was curious about. My data comes in as a data.table and I wanted to know if I could get quick row access while leaving it as a data.table. I also wanted to compare the data.table access speed with a hash-table which is what is often used for fast, non-vectorized item lookup. 解决方案 For a non-vectorized access pattern, you might want to try the builtin environment objects:require(microbenchmark)test_lookup_env <- list2env(test_lookup_list)x <- lookup_tests[[1]][1]microbenchmark( lookup_hash[[x]], test_lookup_list[[x]], test_lookup_dt[x], test_lookup_env[[x]])Here you can see it's even zippier than hash :Unit: microseconds expr min lq mean median uq max neval lookup_hash[[x]] 10.767 12.9070 22.67245 23.2915 26.1710 68.654 100 test_lookup_list[[x]] 847.700 853.2545 887.55680 863.0060 893.8925 1369.395 100 test_lookup_dt[x] 2652.023 2711.9405 2771.06400 2758.8310 2803.9945 3373.273 100 test_lookup_env[[x]] 1.588 1.9450 4.61595 2.5255 6.6430 27.977 100EDIT:Stepping through data.table:::`[.data.table` is instructive why you are seeing dt slow down. When you index with a character and there is a key set, it does quite a bit of bookkeeping, then drops down into bmerge, which is a binary search. Binary search is O(log n) and will get slower as n increases.Environments, on the other hand, use hashing (by default) and have constant access time with respect to n.To work around, you can manually build a map and index through it:x <- lookup_tests[[2]][2]e <- list2env(setNames(as.list(1:nrow(test_lookup_dt)), test_lookup_dt$product_id))#example access:test_lookup_dt[e[[x]], ]However, seeing so much bookkeeping code in the data.table method, I'd try out plain old data.frames as well:test_lookup_df <- as.data.frame(test_lookup_dt)rownames(test_lookup_df) <- test_lookup_df$product_idIf we are really paranoid, we could skip the [ methods altogether and lapply over the columns directly.Here are some more timings (from a different machine than above):> microbenchmark(+ test_lookup_dt[x,],+ test_lookup_dt[x],+ test_lookup_dt[e[[x]],],+ test_lookup_df[x,],+ test_lookup_df[e[[x]],],+ lapply(test_lookup_df, `[`, e[[x]]),+ lapply(test_lookup_dt, `[`, e[[x]]),+ lookup_hash[[x]]+ )Unit: microseconds expr min lq mean median uq max neval test_lookup_dt[x, ] 1658.585 1688.9495 1992.57340 1758.4085 2466.7120 2895.592 100 test_lookup_dt[x] 1652.181 1695.1660 2019.12934 1764.8710 2487.9910 2934.832 100 test_lookup_dt[e[[x]], ] 1040.869 1123.0320 1356.49050 1280.6670 1390.1075 2247.503 100 test_lookup_df[x, ] 17355.734 17538.6355 18325.74549 17676.3340 17987.6635 41450.080 100 test_lookup_df[e[[x]], ] 128.749 151.0940 190.74834 174.1320 218.6080 366.122 100 lapply(test_lookup_df, `[`, e[[x]]) 18.913 25.0925 44.53464 35.2175 53.6835 146.944 100 lapply(test_lookup_dt, `[`, e[[x]]) 37.483 50.4990 94.87546 81.2200 124.1325 241.637 100 lookup_hash[[x]] 6.534 15.3085 39.88912 49.8245 55.5680 145.552 100Overall, to answer your questions, you are not using data.table "wrong" but you are also not using it in the way it was intended (vectorized access). However, you can manually build a map to index through and get most of the performance back. 这篇关于R快速单项查找从列表vs data.table对哈希的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！