问题描述
我有以下Python代码,其中我从标准输入中收集数据到列表中,并在其上运行语法网。数据采用json对象的形式,我将从中提取文本字段并将其输入语法网。
data = [ ]
表示sys.stdin中的行:
data.append(line)
run_syntaxnet(data)##这是一个函数##
此外,当我在非常大的数据上运行此代码时,我不想永远收集它并耗尽内存。所以我想分块收集数据-一次可能像10000条推文并在它们上运行Syntaxnet。有人可以帮我怎么做吗?
此外,我想了解列表
data $ c的最大长度是多少? $ c>,这样我就不会用完内存。
编辑:
我使用了代码:
data = []
用于sys.stdin中的行:
data.append(line)
如果len(data)== 10000:
run_syntaxnet(data)##这是一个函数##
data = []
如果输入数据中的行数是10000的倍数,则可以很好地运行。我不确定该如何处理其余的行。
例如,如果总行数为12000,则前10000行将根据需要进行处理,但由于条件<$ c,接下来的2000行将被保留$ c> len(data)> 10000 不满足。
我想做类似的事情:
如果len(data)> 10000或'达到输入文件的EOF':
run_syntaxnet(data)
有人告诉我如何检查输入文件的EOF?
PS:到python文件的所有数据都来自Pig Streaming。另外,由于我有数百万行,因此我无法负担实际输入数据并作为参数发送的行数。
解决方案我将把数据收集成块并在它们变大时对其进行处理:
LARGE_DATA = 10
data = []
表示sys.stdin中的行:
data.append(line)
如果len(data)>大数据:
run_syntaxnet(data)
data = []
run_syntaxnet(data)
I have the following Python code where I collect data from standard input into a list and run syntaxnet on it. The data is in the form of json objects from which I will extract the text field and feed it to syntaxnet.
data = [] for line in sys.stdin: data.append(line) run_syntaxnet(data) ##This is a function##
I am doing this because I do not want Syntaxnet to run for every single tweet since it will take a very long time and hence decrease performance.
Also, when I run this code on very large data, I do not want to keep collecting it forever and run out of memory. So I want to collect data in chunks- may be like 10000 tweets at a time and run Syntaxnet on them. Can someone help me how to do this?
Also, I want to understand what can be the maximum length of the list
data
so that I do not run out of memory.EDIT:
I used the code:
data = [] for line in sys.stdin: data.append(line) if len(data) == 10000: run_syntaxnet(data) ##This is a function## data = []
which runs perfectly fine if the number of rows in the input data is a multiple of 10000. I am not sure what to do with the remainder of the rows.
For example, if the total number of rows is 12000, the first 10000 rows get processed as I want, but the next 2000 are left off since the condition
len(data) > 10000
is not met.I want to do something like:
if len(data) > 10000 or 'EOF of input file is reached': run_syntaxnet(data)
Can someone tell me how to check for the EOF of input file? Thanks in advance!
PS: All the data to the python file is from Pig Streaming. Also, I can not afford to actually count the number of row sin the input data and send as a parameter since I have millions of rows and counting itself will take forever.
解决方案I would gather the data into chunks and process those chunks when they get "large":
LARGE_DATA = 10 data = [] for line in sys.stdin: data.append(line) if len(data) > LARGE_DATA: run_syntaxnet(data) data = [] run_syntaxnet(data)
这篇关于从stdin:Python收集数据块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!