问题描述
假设我打开一个大的(几GB)文件,我无法在整个文件中读取一次。如果是csv文件,我们将使用:
pd.read_csv中的chunk('path / filename',chunksize = 10 ** 7):
#将块保存到磁盘
或者我们可以做一些类似于大熊猫的事情:
将大熊猫导入为pd
打开(fn)作为文件:
在文件中的行:
#将行保存到磁盘,例如df = pd.concat([df,line_data]),然后保存df
块数据与awk脚本? Awk将解析/处理您想要的格式的文本,但是我不知道如何使用awkchunk。可以编写脚本 script1.awk
,然后处理您的数据,但这会一次处理整个文件。
相关问题,更具体的例子:
awk读取单个记录(chunk)一次设计。默认情况下,记录是数据行,但您可以使用 RS
(记录分隔符)变量指定记录。在下一次读取之前,每个代码块都有条件地在当前记录上执行:
$ awk'/ pattern / {printMATCHED ,$ 0> output}'file
上述脚本将一次从输入文件中读取一行,如果该行匹配 pattern
,那么在阅读下一行之前,它将保存前缀为 MATCHED
的文件输出中的行。
Let's say I'm opening a large (several GB) file where I cannot read in the entire file as once.
If it's a csv file, we would use:
for chunk in pd.read_csv('path/filename', chunksize=10**7):
# save chunk to disk
Or we could do something similar with pandas:
import pandas as pd
with open(fn) as file:
for line in file:
# save line to disk, e.g. df=pd.concat([df, line_data]), then save the df
How does one "chunk" data with an awk script? Awk will parse/process text into a format you desire, but I don't know how to "chunk" with awk. One can write a script script1.awk
and then process your data, but this processes the entire file at once.
Related question, with more concrete example: How to preprocess and load a "big data" tsv file into a python dataframe?
awk reads a single record (chunk) at a time by design. By default a record is line of data, but you can specify a record using the RS
(record separator) variable. Each code block is conditionally executed on the current record before the next is read:
$ awk '/pattern/{print "MATCHED", $0 > "output"}' file
The above script will read a line at a time from the input file and if the that line matchs pattern
it will save the line in the file output prepended with MATCHED
before reading the next line.
这篇关于如何使用awk处理和保存数据块?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!