amazon-web-services - 从s3导入压缩(lzo)数据到配置单元

我将DynamoDB表导出到s3作为备份（通过EMR）。导出时，我将数据存储为lzo压缩文件。我的配置单元查询如下，但是基本上我遵循了http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/EMR_Hive_Commands.html上的“使用数据压缩将Amazon DynamoDB表导出到Amazon S3存储桶”

现在，我想做相反的事情-将我的LZO文件放回配置单元表中。你怎么做到这一点？我期望看到一些hive configuration property作为输入，但是没有。我已经在Google上搜索并找到了一些提示，但没有确定的内容，也没有任何效果。

s3中的文件格式为：s3：// [mybucket] /backup/year=2012/month=08/day=01/000000.lzo

这是我执行导出的HQL：

SET dynamodb.throughput.read.percent=1.0;
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK;
SET mapred.output.compression.codec = com.hadoop.compression.lzo.LzopCodec;

CREATE EXTERNAL TABLE hiveSBackup (id bigint, periodStart string, allotted bigint, remaining bigint, created string, seconds bigint, served bigint, modified string)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "${DYNAMOTABLENAME}",
"dynamodb.column.mapping" = "id:id,periodStart:periodStart,allotted:allotted,remaining:remaining,created:created,seconds:seconds,served:served,modified:modified");

CREATE EXTERNAL TABLE s3_export (id bigint, periodStart string, allotted bigint, remaining bigint, created string, seconds bigint, served bigint, modified string)
 PARTITIONED BY (year string, month string, day string)
 ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
 LOCATION 's3://<mybucket>/backup';

INSERT OVERWRITE TABLE s3_export
 PARTITION (year="${PARTITIONYEAR}", month="${PARTITIONMONTH}", day="${PARTITIONDAY}")
 SELECT * from hiveSBackup;

有什么想法如何从s3中获取它，解压缩并使其进入蜂巢表吗？

最佳答案

EMR上的Hive可以直接从S3本地读取数据，您无需导入任何内容。您只需要创建一个外部表并告诉它数据在哪里。
它还具有lzo支持设置。如果文件以.lzo扩展名结尾，则Hive将使用lzo自动解压缩。

因此，要将s3中的lzo数据“导入”到hive中，您只需创建一个指向lzo压缩数据s3的外部表，并且每当hive对它进行查询时，hive都会对其进行解压缩。与“导出”数据时几乎完全一样。那个s3_export表，您也可以从中读取。

如果要将其导入到非外部表中，只需将覆盖插入新表中，然后从外部表中进行选择。

除非我误解了您的问题，并且您想询问有关导入发电机的问题，而不仅仅是问一个蜂巢表？

This is what I've been doing
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK;
SET mapred.output.compression.codec = com.hadoop.compression.lzo.LzopCodec;

CREATE EXTERNAL TABLE users
(id int, username string, firstname string, surname string, email string, birth_date string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 's3://bucket/someusers';

INSERT OVERWRITE TABLE users
SELECT * FROM someothertable;

我最终在s3：// bucket / someusers下有一堆带有.lzo扩展名的文件，这些文件可由蜂巢读取。

您只需要在尝试写入压缩数据时设置编解码器，读取压缩数据即可自动检测到压缩。