


I'm trying to load a really big JSON file into R. Since the file is too big to fit into memory on my machine, I found that using the jsonlite package's stream_in/stream_out functions is really helpful. With these functions, I can subset the data first in chunks without loading it, write the subset data to a new, smaller JSON file, and then load that file as a data.frame. However, this intermediary JSON file is getting truncated (if that's the right term) while being written with stream_out. I will now attempt to explain with further detail.



I have written my code like this (following an example from documentation):

con_out <- file(tmp <- tempfile(), open = "wb")
stream_in(file("C:/User/myFile.json"), handler = function(df){
      df <- df[which(df$Var > 0), ]
      stream_out(df, con_out, pagesize = 1000)
    }, pagesize = 5000)
myData <- stream_in(file(tmp))


As you can see, I open a connection to a temporary file, read my original JSON file with stream_in and have the handler function subset each chunk of data and write it to the connection.


此过程运行没有任何问题,直到我尝试在myData <- stream_in(file(tmp))中读取它时,我收到一个错误.手动打开新的临时JSON文件会显示最底行始终不完整.类似于以下内容:

This procedure runs without any problems, until I try to read it in myData <- stream_in(file(tmp)), upon which I receive an error. Manually opening the new, temporary JSON file reveals that the bottom-most line is always incomplete. Something like the following:

{"Var1":"some data","Var2":3,"Var3":"some othe


I then have to manually remove that last line after which the file loads without issue.


  1. 我尝试彻底阅读文档并查看stream_out函数,但是我无法弄清楚是什么原因导致了此问题.我仅有的一点线索是stream_out函数在完成后会自动关闭连接,所以也许它在其他组件仍在编写时就关闭了连接?

  1. I've tried reading the documentation thoroughly and looking at the stream_out function, and I can't figure out what may be causing this issue. The only slight clue I have is that the stream_out function automatically closes the connection upon completion, so maybe it's closing the connection while some other component is still writing?

我插入了一个打印功能,以便在handler函数内部的每个块上打印data.frametail()末端,以排除中介data.frame的问题. data.frame在每个间隔都完美无缺地生成,并且我可以看到data.frame的最后两行或三行在被写入文件时被截断(即,它们未被写入).请注意,这是整个data.frame(在stream_out完成了rbind一切之后)的最后部分.

I inserted a print function to print the tail() end of the data.frame at every chunk inside the handler function to rule out problems with the intermediary data.frame. The data.frame is produced flawlessly at every interval, and I can see that the final two or three rows of the data.frame are getting truncated while being written to file (i.e., they're not being written). Notice that it's the very end of the entire data.frame (after stream_out has rbinded everything) that is getting chopped.


I've tried playing around with the pagesize arguments, including trying very large numbers, no number, and Inf. Nothing has worked.

我无法使用jsonlite的其他功能,例如fromJSON,因为原始JSON文件太大而无法在不进行流传输的情况下进行读取,并且实际上是采用min(?)/ndjson格式的. /p>

I can't use jsonlite's other functions like fromJSON because the original JSON file is too large to read without streaming and it is actually in minified(?)/ndjson format.


我在Windows 7 x64上运行R 3.3.3 x64. 6 GB RAM,AMD Athlon II 4核2.6 Ghz.

I'm running R 3.3.3 x64 on Windows 7 x64. 6 GB of RAM, AMD Athlon II 4-Core 2.6 Ghz.



I can still deal with this issue by manually opening the JSON files and correcting them, but it's leading to some data loss and it's not allowing my script to be automated, which is an inconvenience as I have to run it repeatedly throughout my project.


I really appreciate any help with this; thank you.



I believe this does what you want, it is not necessary to do the extra stream_out/stream_in.

myData <- new.env()
stream_in(file("MOCK_DATA.json"), handler = function(df){
  idx <- as.character(length(myData) + 1)
  myData[[idx]] <- df[which(df$id %% 2 == 0), ] ## change back to your filter
}, pagesize = 200) ## change back to 1000
myData <- myData %>% as.list() %>% bind_rows()

(我在 Mockaroo 中创建了一些模拟数据:生成了1000行,因此页面尺寸很小,需要检查如果所有内容都可以使用多个块,那么我使用的过滤器甚至是ID,因为我懒于创建Var列.)

(I created some mock data in Mockaroo: generated 1000 lines, hence the small pagesize, to check if everything worked with more than one chunk. The filter I used was even IDs because I was lazy to create a Var column.)


08-19 11:00