本文介绍了R 中缓存/记忆/散列的选项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图找到一种简单的方法来在 R 中使用类似 Perl 的散列函数(本质上是缓存),因为我打算同时进行 Perl 样式的散列并编写自己的计算记忆.然而,其他人已经打败了我,并提供了用于记忆的包.我挖掘的越多,我发现的就越多,例如 memoiseR.cache,但差异并不容易清楚.此外,目前还不清楚除了使用 hash 包之外,还有什么方法可以获得 Perl 风格的哈希(或 Python 风格的字典)并编写自己的备忘录,这似乎并不支持两个备忘包.

I am trying to find a simple way to use something like Perl's hash functions in R (essentially caching), as I intended to do both Perl-style hashing and write my own memoisation of calculations. However, others have beaten me to the punch and have packages for memoisation. The more I dig, the more I find, e.g.memoise and R.cache, but differences aren't readily clear. In addition, it's not clear how else one can get Perl-style hashes (or Python-style dictionaries) and write one's own memoization, other than to use the hash package, which doesn't seem to underpin the two memoization packages.

由于我在 CRAN 或其他地方找不到任何信息来区分选项,也许这应该是一个关于 SO 的社区 wiki 问题:R 中的记忆和缓存选项有哪些,它们有什么区别?

Since I can find no information on CRAN or elsewhere to distinguish between the options, perhaps this should be a community wiki question on SO: What are the options for memoization and caching in R, and what are their differences?

作为比较的基础,这里列出了我找到的选项.此外,在我看来,这一切都依赖于散列,所以我也会注意到散列选项.键/值存储有点相关,但会打开大量有关 DB 系统(例如 BerkeleyDB、Redis、MemcacheDB 和 其他人的分数).

As a basis for comparison, here is a list of the options I've found. Also, it seems to me that all depend on hashing, so I'll note the hashing options as well. Key/value storage is somewhat related, but opens a huge can of worms regarding DB systems (e.g. BerkeleyDB, Redis, MemcacheDB and scores of others).

看起来选项是:

  • digest - 为任意 R 对象提供散列.
  • digest - provides hashing for arbitrary R objects.
  • memoise - 一个非常简单的函数记忆工具.
  • R.cache - 为备忘,虽然有些功能似乎缺少示例.
  • memoise - a very simple tool for memoization of functions.
  • R.cache - offers more functionality for memoization, though it seems some of the functions lack examples.
  • hash - 提供类似于 Perl 的哈希和 Python 字典的缓存功能.
  • hash - Provides caching functionality akin to Perl's hashes and Python dictionaries.

这些是 R 对象外部存储的基本选项.

These are basic options for external storage of R objects.

  • cacher - 这似乎更类似于 检查点.
  • CodeDepends - 一个 OmegaHat 项目,支持 cacher 并提供一些有用的功能.
  • DMTCP(不是 R 包)- 似乎支持多种语言的检查点,并且 一位开发人员最近寻求帮助在 R 中测试 DMTCP 检查点.
  • cacher - this seems to be more akin to checkpointing.
  • CodeDepends - An OmegaHat project that underpins cacher and provides some useful functionality.
  • DMTCP (not an R package) - appears to support checkpointing in a bunch of languages, and a developer recently sought assistance testing DMTCP checkpointing in R.
  • Base R 支持:命名向量和列表、数据框的行和列名称以及环境中的项目名称.在我看来,使用列表有点麻烦.(还有 pairlist,但是 它已被弃用.)
  • data.table 包支持在数据表中快速查找元素.
  • Base R supports: named vectors and lists, row and column names of data frames, and names of items in environments. It seems to me that using a list is a bit of a kludge. (There's also pairlist, but it is deprecated.)
  • The data.table package supports rapid lookups of elements in a data table.

虽然我最感兴趣的是了解选项,但我有两个基本用例:

Although I'm mostly interested in knowing the options, I have two basic use cases that arise:

  1. 缓存:简单的字符串计数.[注意:这不是针对 NLP 的,而是通用的,所以 NLP 库有点矫枉过正;表是不够的,因为我不想等到整个字符串集加载到内存中.Perl 样式的散列具有正确的实用程序级别.]
  2. 记忆可怕的计算.

这些真的出现是因为我深入研究一些慢速代码的分析 我真的很想只计算简单的字符串,看看我是否可以通过记忆加速一些计算.能够对输入值进行哈希处理,即使我不记忆,也能让我看看记忆是否有帮助.

These really arise because I'm digging in to the profiling of some slooooow code and I'd really like to just count simple strings and see if I can speed up some calculations via memoization. Being able to hash the input values, even if I don't memoize, would let me see if memoization can help.

注 1:关于可重复研究的 CRAN 任务视图列出了几个包(cacherR.cache),但没有详细说明使用选项.

Note 1: The CRAN Task View on Reproducible Research lists a couple of the packages (cacher and R.cache), but there is no elaboration on usage options.

注2:为了帮助其他人寻找相关代码,这里有一些关于作者或包的注释.一些作者使用 SO.:)

Note 2: To aid others looking for related code, here a few notes on some of the authors or packages. Some of the authors use SO. :)

  • Dirk Eddelbuettel:digest - 许多其他软件包都依赖于此.
  • Roger Peng:cacherfilehashstashR - 这些以不同的方式解决不同的问题;请参阅Roger 的网站了解更多软件包.
  • Christopher Brown:hash - 似乎是一个有用的包,但不幸的是,指向 ODG 的链接已关闭.
  • Henrik Bengtsson:R.cache &Hadley Wickham:memoise——目前还不清楚什么时候更喜欢一个包而不是另一个.
  • Dirk Eddelbuettel: digest - a lot of other packages depend on this.
  • Roger Peng: cacher, filehash, stashR - these address different problems in different ways; see Roger's site for more packages.
  • Christopher Brown: hash - Seems to be a useful package, but the links to ODG are down, unfortunately.
  • Henrik Bengtsson: R.cache & Hadley Wickham: memoise -- it's not yet clear when to prefer one package over the other.

注意 3:有些人使用 memoise/memoisation 其他人使用 memoize/memoization.如果您正在四处寻找,请注意.Henrik 使用z",Hadley 使用s".

Note 3: Some people use memoise/memoisation others use memoize/memoization. Just a note if you're searching around. Henrik uses "z" and Hadley uses "s".

推荐答案

我在 memoise 上运气不佳,因为它给 a 的某些函数带来了太深的递归问题打包我试过.使用 R.cache 我有更好的运气.以下是我改编自 R.cache 文档的更多注释代码.代码显示了不同的缓存选项.

I did not have luck with memoise because it gave too deep recursive problem to some function of a packaged I tried with. With R.cache I had better luck. Following is more annotated code I adapted from R.cache documentation. The code shows different options to do caching.

# Workaround to avoid question when loading R.cache library
dir.create(path="~/.Rcache", showWarnings=F)
library("R.cache")
setCacheRootPath(path="./.Rcache") # Create .Rcache at current working dir
# In case we need the cache path, but not used in this example.
cache.root = getCacheRootPath()
simulate <- function(mean, sd) {
    # 1. Try to load cached data, if already generated
    key <- list(mean, sd)
    data <- loadCache(key)
    if (!is.null(data)) {
        cat("Loaded cached data
")
        return(data);
    }
    # 2. If not available, generate it.
    cat("Generating data from scratch...")
    data <- rnorm(1000, mean=mean, sd=sd)
    Sys.sleep(1) # Emulate slow algorithm
    cat("ok
")
    saveCache(data, key=key, comment="simulate()")
    data;
}
data <- simulate(2.3, 3.0)
data <- simulate(2.3, 3.5)
a = 2.3
b = 3.0
data <- simulate(a, b) # Will load cached data, params are checked by value
# Clean up
file.remove(findCache(key=list(2.3,3.0)))
file.remove(findCache(key=list(2.3,3.5)))

simulate2 <- function(mean, sd) {
    data <- rnorm(1000, mean=mean, sd=sd)
    Sys.sleep(1) # Emulate slow algorithm
    cat("Done generating data from scratch
")
    data;
}
# Easy step to memoize a function
# aslo possible to resassign function name.
This would work with any functions from external packages.
mzs <- addMemoization(simulate2)

data <- mzs(2.3, 3.0)
data <- mzs(2.3, 3.5)
data <- mzs(2.3, 3.0) # Will load cached data
# aslo possible to resassign function name.
# but different memoizations of the same
# function will return the same cache result
# if input params are the same
simulate2 <- addMemoization(simulate2)
data <- simulate2(2.3, 3.0)

# If the expression being evaluated depends on
# "input" objects, then these must be be specified
# explicitly as "key" objects.
for (ii in 1:2) {
    for (kk in 1:3) {
        cat(sprintf("Iteration #%d:
", kk))
        res <- evalWithMemoization({
            cat("Evaluating expression...")
            a <- kk
            Sys.sleep(1)
            cat("done
")
            a
        }, key=list(kk=kk))
        # expressions inside 'res' are skipped on the repeated run
        print(res)
        # Sanity checks
        stopifnot(a == kk)
        # Clean up
        rm(a)
    } # for (kk ...)
} # for (ii ...)

这篇关于R 中缓存/记忆/散列的选项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-06 05:43