本文介绍了在R中使用fread时如何处理分隔符之间没有空格的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在通过将大的 .txt 文件(> 1GB)读入 R fread 。我正在通过bash命令直接从 .zip 存档中读取文件:

I am reading a large .txt file (>1GB) into R via fread. I am reading the file in directly from a .zip archive, via a bash command:

base = fread('unzip -p Folder.zip File.txt', sep = '|', header = FALSE, 
stringsAsFactors = FALSE, na.strings="", quote = "", col.names = col_namesMain)

文本文件通过 | ,这样典型的行可能看起来像:

The text file separates entries via | so that a typical line might look like:

RRX|||02020||333293||||12123

但是,在很多地方,空条目由分隔符表示,它们之间没有空格,例如上面示例行中的 ||

However, there are many places where empty entries are denoted by separators with no space between them, e.g. || in the example line above.

使用 fread 时,通常会完全读取这些相邻的分隔符,因此上一行将返回以下条目:

When using fread, these adjacent separators are typically read in altogether, so that the above line returns the following entries:

RRX, ||02020|, 333293|||, 12123

,当它读为:

RRX, NA, NA, 02020, NA, 333293, NA, NA, NA, 12123

我尝试使用 read.table 和选项 skipNul = TRUE ,这非常有效。但是,似乎没有任何类似于 fread skipNul 选项。如果可能的话,我宁愿使用 read 而不是 read.table ,因为我有几个非常大的文件。尽管进行了搜索,但有关这个问题的讨论很少。非常感谢任何帮助。

I have tried using read.table with the option skipNul = TRUE, and this works perfectly. However, there doesn't seem to be any option similar to skipNul for fread. I would much prefer to use fread over read.table if possible, since I have several very large files. Despite my searching, I haven't come across much discussion of this problem. Any help much appreciated.

推荐答案

此问题已在dev中修复2019年4月15日的1.12.3(请参阅):

This has been fixed in dev 1.12.3 on 15 Apr 2019 (see NEWS) :


  1. fread()现在跳过嵌入式NUL(\0),#3400。感谢Marcus Davy提供的示例报告,以及Roy Storey的初始PR。


这篇关于在R中使用fread时如何处理分隔符之间没有空格的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-20 09:51