本文介绍了R:jsonlite的stream_out函数产生不完整/截断的JSON文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将一个很大的JSON文件加载到R中.由于该文件太大而无法容纳到我的计算机上的内存中,因此我发现使用jsonlite包的stream_in/stream_out函数确实乐于助人.使用这些功能,我可以先将数据块打包成一个子集而不加载它,将子集数据写到一个新的,较小的JSON文件中,然后将该文件作为data.frame加载.但是,使用stream_out编写时,此中间JSON文件将被截断(如果这是正确的术语).现在,我将尝试进一步解释.

I'm trying to load a really big JSON file into R. Since the file is too big to fit into memory on my machine, I found that using the jsonlite package's stream_in/stream_out functions is really helpful. With these functions, I can subset the data first in chunks without loading it, write the subset data to a new, smaller JSON file, and then load that file as a data.frame. However, this intermediary JSON file is getting truncated (if that's the right term) while being written with stream_out. I will now attempt to explain with further detail.

我正在尝试的事情:

我已经这样写了我的代码(以下是文档中的示例):

I have written my code like this (following an example from documentation):

con_out <- file(tmp <- tempfile(), open = "wb")
stream_in(file("C:/User/myFile.json"), handler = function(df){
      df <- df[which(df$Var > 0), ]
      stream_out(df, con_out, pagesize = 1000)
    }, pagesize = 5000)
myData <- stream_in(file(tmp))

如您所见,我打开了一个到临时文件的连接,使用stream_in读取了我的原始JSON文件,并使handler函数子集的每个数据块都写入了连接.

As you can see, I open a connection to a temporary file, read my original JSON file with stream_in and have the handler function subset each chunk of data and write it to the connection.

问题

此过程运行没有任何问题,直到我尝试在myData <- stream_in(file(tmp))中读取它时,我收到一个错误.手动打开新的临时JSON文件会显示最底行始终不完整.类似于以下内容:

This procedure runs without any problems, until I try to read it in myData <- stream_in(file(tmp)), upon which I receive an error. Manually opening the new, temporary JSON file reveals that the bottom-most line is always incomplete. Something like the following:

{"Var1":"some data","Var2":3,"Var3":"some othe

然后,我必须手动删除最后一行,然后加载文件而不会出现问题.

I then have to manually remove that last line after which the file loads without issue.

我尝试过的解决方案

  1. 我尝试彻底阅读文档并查看stream_out函数,但是我无法弄清楚是什么原因导致了此问题.我仅有的一点线索是stream_out函数在完成后会自动关闭连接,所以也许它在其他组件仍在编写时就关闭了连接?

  1. I've tried reading the documentation thoroughly and looking at the stream_out function, and I can't figure out what may be causing this issue. The only slight clue I have is that the stream_out function automatically closes the connection upon completion, so maybe it's closing the connection while some other component is still writing?

我插入了一个打印功能,以便在handler函数内部的每个块上打印data.frametail()末端,以排除中介data.frame的问题. data.frame在每个间隔都完美无缺地生成,并且我可以看到data.frame的最后两行或三行在被写入文件时被截断(即,它们未被写入).请注意,这是整个data.frame(在stream_out完成了rbind一切之后)的最后部分.

I inserted a print function to print the tail() end of the data.frame at every chunk inside the handler function to rule out problems with the intermediary data.frame. The data.frame is produced flawlessly at every interval, and I can see that the final two or three rows of the data.frame are getting truncated while being written to file (i.e., they're not being written). Notice that it's the very end of the entire data.frame (after stream_out has rbinded everything) that is getting chopped.

我尝试使用pagesize参数,包括尝试很大的数字,没有数字和Inf.什么都没用.

I've tried playing around with the pagesize arguments, including trying very large numbers, no number, and Inf. Nothing has worked.

我无法使用jsonlite的其他功能,例如fromJSON,因为原始JSON文件太大而无法在不进行流传输的情况下进行读取,并且实际上是采用min(?)/ndjson格式的. /p>

I can't use jsonlite's other functions like fromJSON because the original JSON file is too large to read without streaming and it is actually in minified(?)/ndjson format.

系统信息

我在Windows 7 x64上运行R 3.3.3 x64. 6 GB RAM,AMD Athlon II 4核2.6 Ghz.

I'm running R 3.3.3 x64 on Windows 7 x64. 6 GB of RAM, AMD Athlon II 4-Core 2.6 Ghz.

治疗

我仍然可以通过手动打开JSON文件并更正它们来解决此问题,但这会导致一些数据丢失,并且不允许我的脚本自动化,这是不便的,因为我必须在我的整个过程中重复运行它项目.

I can still deal with this issue by manually opening the JSON files and correcting them, but it's leading to some data loss and it's not allowing my script to be automated, which is an inconvenience as I have to run it repeatedly throughout my project.

我真的很感激这方面的帮助;谢谢.

I really appreciate any help with this; thank you.

推荐答案

我相信这可以满足您的要求,而不必执行多余的stream_out/stream_in.

I believe this does what you want, it is not necessary to do the extra stream_out/stream_in.

myData <- new.env()
stream_in(file("MOCK_DATA.json"), handler = function(df){
  idx <- as.character(length(myData) + 1)
  myData[[idx]] <- df[which(df$id %% 2 == 0), ] ## change back to your filter
}, pagesize = 200) ## change back to 1000
myData <- myData %>% as.list() %>% bind_rows()

(我在 Mockaroo 中创建了一些模拟数据:生成了1000行,因此页面尺寸很小,需要检查如果所有内容都可以使用多个块,那么我使用的过滤器甚至是ID,因为我懒于创建Var列.)

(I created some mock data in Mockaroo: generated 1000 lines, hence the small pagesize, to check if everything worked with more than one chunk. The filter I used was even IDs because I was lazy to create a Var column.)

这篇关于R:jsonlite的stream_out函数产生不完整/截断的JSON文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-19 11:00