本文介绍了从stdin:Python收集数据块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下Python代码,其中我从标准输入中收集数据到列表中,并在其上运行语法网。数据采用json对象的形式,我将从中提取文本字段并将其输入语法网。

  data = [ ] 
表示sys.stdin中的行:
data.append(line)
run_syntaxnet(data)##这是一个函数##


此外,当我在非常大的数据上运行此代码时,我不想永远收集它并耗尽内存。所以我想分块收集数据-一次可能像10000条推文并在它们上运行Syntaxnet。有人可以帮我怎么做吗?



此外,我想了解列表 data ,这样我就不会用完内存。



编辑:



我使用了代码:

  data = [] 
用于sys.stdin中的行:
data.append(line)
如果len(data)== 10000:
run_syntaxnet(data)##这是一个函数##
data = []

如果输入数据中的行数是10000的倍数,则可以很好地运行。我不确定该如何处理其余的行。



例如,如果总行数为12000,则前10000行将根据需要进行处理,但由于条件<$ c,接下来的2000行将被保留$ c> len(data)> 10000 不满足。



我想做类似的事情:

 如果len(data)> 10000或'达到输入文件的EOF':
run_syntaxnet(data)

有人告诉我如何检查输入文件的EOF?



PS:到python文件的所有数据都来自Pig Streaming。另外,由于我有数百万行,因此我无法负担实际输入数据并作为参数发送的行数。

解决方案

我将把数据收集成块并在它们变大时对其进行处理:

  LARGE_DATA = 10 

data = []
表示sys.stdin中的行:
data.append(line)
如果len(data)>大数据:
run_syntaxnet(data)
data = []
run_syntaxnet(data)


I have the following Python code where I collect data from standard input into a list and run syntaxnet on it. The data is in the form of json objects from which I will extract the text field and feed it to syntaxnet.

data = []
for line in sys.stdin:
    data.append(line)
run_syntaxnet(data)    ##This is a function##

I am doing this because I do not want Syntaxnet to run for every single tweet since it will take a very long time and hence decrease performance.

Also, when I run this code on very large data, I do not want to keep collecting it forever and run out of memory. So I want to collect data in chunks- may be like 10000 tweets at a time and run Syntaxnet on them. Can someone help me how to do this?

Also, I want to understand what can be the maximum length of the list data so that I do not run out of memory.

EDIT:

I used the code:

data = []
for line in sys.stdin:
    data.append(line)
    if len(data) == 10000:
        run_syntaxnet(data)    ##This is a function##
        data = []

which runs perfectly fine if the number of rows in the input data is a multiple of 10000. I am not sure what to do with the remainder of the rows.

For example, if the total number of rows is 12000, the first 10000 rows get processed as I want, but the next 2000 are left off since the condition len(data) > 10000 is not met.

I want to do something like:

if len(data) > 10000 or 'EOF of input file is reached':
    run_syntaxnet(data)

Can someone tell me how to check for the EOF of input file? Thanks in advance!

PS: All the data to the python file is from Pig Streaming. Also, I can not afford to actually count the number of row sin the input data and send as a parameter since I have millions of rows and counting itself will take forever.

解决方案

I would gather the data into chunks and process those chunks when they get "large":

LARGE_DATA = 10

data = []
for line in sys.stdin:
    data.append(line)
    if len(data) > LARGE_DATA:
        run_syntaxnet(data)
        data = []
run_syntaxnet(data)

这篇关于从stdin:Python收集数据块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 18:49