从zip文件下载非常大(1400万行)csv的快速方法?解压缩，read_csv和read.csv永远不会停止加载

本文介绍了从zip文件下载非常大(1400万行)csv的快速方法?解压缩，read_csv和read.csv永远不会停止加载的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试通过以下链接下载数据集.它大约有 14,000,000 行长.我运行了此代码块，并被困在unzip()上.该代码已经运行了很长时间，并且我的计算机很热.

I am trying to download the dataset at the below link. It is about 14,000,000 rows long.I ran this code chunk, and I am stuck at unzip(). The code has been running for a really long time and my computer is hot.

我尝试了几种不使用unzip的方法，然后陷入了read.csv/vroom/read_csv步骤.有任何想法吗?这是一个公共数据集，因此任何人都可以尝试.

I tried a few different ways that don't use unzip, and then I get stuck at the read.csv/vroom/read_csv step.Any ideas? This is a public dataset so anyone can try.

library(vroom)

temp <- tempfile()
download.file("https://files.consumerfinance.gov/hmda-historic-loan-data/hmda_2017_nationwide_all-records_labels.zip", temp)


unzip(temp, "hmda_2017_nationwide_all-records_labels.csv")


df2017 <- vroom("hmda_2017_nationwide_all-records_labels.csv")

unlink(temp)

推荐答案

我能够先将文件下载到计算机上.
然后使用vroom( https://vroom.r-lib.org/)加载它而无需解压缩它:

I was able to download the file to my computer first.
then use vroom (https://vroom.r-lib.org/) to load it without unzipping it:

library(vroom)
df2017 <- vroom("hmda_2017_nationwide_all-records_labels.zip")

我收到有关可能被截断的警告，但是对象具有以下尺寸:

I get a warning about possible truncation, but the object has these dimensions:

> dim(df2017)
[1] 5448288      78

关于vroom的一件好事是，它不会将数据直接加载到内存中.

one nice thing about vroom, is that it doesn't load the data straight into memory.

这篇关于从zip文件下载非常大(1400万行)csv的快速方法?解压缩，read_csv和read.csv永远不会停止加载的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！