从stdin：Python收集数据块 | Python收集数据块

本文介绍了从stdin：Python收集数据块的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有以下Python代码，其中我从标准输入中收集数据到列表中，并在其上运行语法网。数据采用json对象的形式，我将从中提取文本字段并将其输入语法网。

data = [ ] 表示sys.stdin中的行： data.append（line） run_syntaxnet（data）##这是一个函数##

此外，当我在非常大的数据上运行此代码时，我不想永远收集它并耗尽内存。所以我想分块收集数据-一次可能像10000条推文并在它们上运行Syntaxnet。有人可以帮我怎么做吗？

此外，我想了解列表 data ，这样我就不会用完内存。

编辑：我使用了代码： data = [] 用于sys.stdin中的行： data.append（line）如果len（data）== 10000： run_syntaxnet（data）##这是一个函数## data = [] 如果输入数据中的行数是10000的倍数，则可以很好地运行。我不确定该如何处理其余的行。

例如，如果总行数为12000，则前10000行将根据需要进行处理，但由于条件<$ c，接下来的2000行将被保留$ c> len（data）> 10000 不满足。

我想做类似的事情：

 如果len（data）> 10000或'达到输入文件的EOF'：
 run_syntaxnet（data）

有人告诉我如何检查输入文件的EOF？

PS：到python文件的所有数据都来自Pig Streaming。另外，由于我有数百万行，因此我无法负担实际输入数据并作为参数发送的行数。

解决方案

我将把数据收集成块并在它们变大时对其进行处理：

  LARGE_DATA = 10 
 
 data = [] 
表示sys.stdin中的行：
 data.append（line）
如果len（data）>大数据：
 run_syntaxnet（data）
 data = [] 
 run_syntaxnet（data）

I have the following Python code where I collect data from standard input into a list and run syntaxnet on it. The data is in the form of json objects from which I will extract the text field and feed it to syntaxnet.

data = []
for line in sys.stdin:
    data.append(line)
run_syntaxnet(data)    ##This is a function##

I am doing this because I do not want Syntaxnet to run for every single tweet since it will take a very long time and hence decrease performance.

Also, when I run this code on very large data, I do not want to keep collecting it forever and run out of memory. So I want to collect data in chunks- may be like 10000 tweets at a time and run Syntaxnet on them. Can someone help me how to do this?

Also, I want to understand what can be the maximum length of the list data so that I do not run out of memory.

EDIT:

I used the code:

data = []
for line in sys.stdin:
    data.append(line)
    if len(data) == 10000:
        run_syntaxnet(data)    ##This is a function##
        data = []

which runs perfectly fine if the number of rows in the input data is a multiple of 10000. I am not sure what to do with the remainder of the rows.

For example, if the total number of rows is 12000, the first 10000 rows get processed as I want, but the next 2000 are left off since the condition len(data) > 10000 is not met.

I want to do something like:

if len(data) > 10000 or 'EOF of input file is reached':
    run_syntaxnet(data)

Can someone tell me how to check for the EOF of input file? Thanks in advance!

PS: All the data to the python file is from Pig Streaming. Also, I can not afford to actually count the number of row sin the input data and send as a parameter since I have millions of rows and counting itself will take forever.

解决方案

I would gather the data into chunks and process those chunks when they get "large":

LARGE_DATA = 10

data = []
for line in sys.stdin:
    data.append(line)
    if len(data) > LARGE_DATA:
        run_syntaxnet(data)
        data = []
run_syntaxnet(data)

这篇关于从stdin：Python收集数据块的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！