本文介绍了可以使用 PIG 读取的文件格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 PIG 可以读取哪些类型的文件格式?

What kind of file formats can be read using PIG?

如何以不同的格式存储它们?假设我们有 CSV 文件,我想将其存储为 MXL 文件,如何做到这一点?每当我们使用 STORE 命令时,它都会创建目录并将文件存储为 part-m-00000 如何更改文件名并覆盖目录?

How can I store them in different formats? Say we have CSV file and I want to store it as MXL file how this can be done? Whenever we use STORE command it makes directory and it stores file as part-m-00000 how can I change name of the file and overwrite directory?

推荐答案

有几个内置加载和存储方法,但它们是有限的:

There are a few built-in loading and storing methods, but they are limited:

  • BinStorage - 二进制"存储
  • PigStorage - 加载和存储由某些内容(例如制表符或逗号)分隔的数据
  • TextLoader - 逐行加载数据(即由换行符分隔)

piggybank 是一个社区贡献的用户定义函数库,它具有多种加载和存储方法,其中包括 XML 加载器,但不包括 XML 存储器.

piggybank is a library of community contributed user-defined functions and it has a number of loading and storing methods, which includes an XML loader, but not a XML storer.

假设我们有一个 CSV 文件,我想将它存储为 MXL 文件,如何做到这一点?

我假设您在这里指的是 XML...在 Hadoop 中以 XML 存储有点粗糙,因为它在减速器的基础上拆分文件,那么您如何知道将根标记放在哪里?这可能应该是某种后处理以生成格式良好的 XML.

I assume you mean XML here... Storing in XML is something that is a bit rough in Hadoop because it splits files on a reducer basis, so how do you know where to put the root tag? this likely should be some sort of post-processing to produce wellformed XML.

您可以做的一件事是编写 UDF 将您的列转换为 XML 字符串:

One thing you can do is to write a UDF that converts your columns into an XML string:

B = FOREACH A GENERATE customudfs.DataToXML(col1, col2, col3);

例如,说col1col2col3"foo"37, "lemons" 分别.您的 UDF 可以输出字符串 "".

For example, say col1, col2, col3 are "foo", 37, "lemons", respectively. Your UDF can output the string "<item><name>Foo</name><num>37</num><fruit>lemons</fruit></item>".

每当我们使用 STORE 命令时,它都会创建目录并将文件存储为 part-m-00000 我如何更改文件名并覆盖目录?

您不能将输出文件的名称更改为 part-m-00000 以外的名称.这就是 Hadoop 的工作原理.如果你想改变它的名字,你应该在事后做一些事情,比如hadoop fs -mv output/part-m-00000 newoutput/myoutputfile.这可以通过运行 pig 脚本然后执行此命令的 bash 脚本来完成.

You can't change the name of the output file to be something other than part-m-00000. That's just how Hadoop works. If you want to change the name of it, you should do something to it after the fact with something like hadoop fs -mv output/part-m-00000 newoutput/myoutputfile. This could be done with a bash script that runs the pig script then executes this command.

这篇关于可以使用 PIG 读取的文件格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-10 22:45