本文介绍了自定义并行提取 - U-SQL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试创建一个自定义并行提取,但我不知道怎么做是正确的。我有一个大的文件(超过250 MB),其中,对于每一行的数据存储在4行。一个文件行存储数据的一列。这是可能创造工作平行提取大文件?恐怕一行的数据,将在不同程度上文件拆分后。

I try create a custom parallel extractor, but i have no idea how do it correctly. I have a big files (more than 250 MB), where data for each row are stored in 4 lines. One file row store data for one column. Is this possible to create working parallely extractor for large files? I am afraid that data for one row, will be in different extents after file splitting.

例如:

...
Data for first row
Data for first row
Data for first row
Data for first row
Data for second row
Data for second row
Data for second row
Data for second row
...

对不起,我的英语水平。

Sorry for my English.

推荐答案

U-SQL提取器默认缩放出并联在输入文件更小的部分,称为扩展工作。这些范围大约在每一个大小250MB。

U-SQL Extractors by default are scaled out to work in parallel over smaller parts of the input files, called extents. These extents are about 250MB in size each.

今天,你要上传的文件为行结构的文件,以确保该行与范围边界对齐(虽然我们将提供跨越边界的程度在不久的将来行支持)。虽然在无论哪种方式,提取UDO型号不知道,如果你的4行都是内部相同的程度或跨越它们。

Today, you have to upload your files as row-structured files to make sure that the rows are aligned with the extent boundaries (although we are going to provide support for rows spanning extent boundaries in the near future). In either way though, the extractor UDO model would not know if your 4 rows are all inside the same extent or across them.

所以,你有两个选择:


  1. 标记提取与提取类之前添加以下行对整个文件操作:

  1. Mark the extractor as operating on the whole file with adding the following line before the extractor class:

[SqlUserDefinedExtractor(AtomicFileProcessing = true)]

现在提取会看到完整的文件。但你失去了向外扩展的文件处理的。

Now the extractor will see the full file. But you lose the scale out of the file processing.

您每行提取一行,并使用U-SQL语句(如使用窗口函数或自定义减速),以行合并成一行。

You extract one row per line and use a U-SQL statement (eg. using Window Functions or a custom REDUCER) to merge the rows into a single row.

这篇关于自定义并行提取 - U-SQL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-19 20:43