问题描述
我正在尝试使用 Confluent 平台提供的 kafka-hdfs-connector 将数据从 Kafka 复制到 Hive 表中.虽然我能够成功做到这一点,但我想知道如何根据时间间隔对传入的数据进行存储.例如,我想每 5 分钟创建一个新分区.
I am trying to copy data from Kafka into Hive tables using kafka-hdfs-connector provided by Confluent platform. While I am able to do it successfully I was wondering how to bucket the incoming data based on time interval. For example, I would like to have a new partition created every 5 minutes.
我尝试了 io.confluent.connect.hdfs.partitioner.TimeBasedPartitioner 和 partition.duration.ms,但我认为我做错了.我在 Hive 表中只看到一个分区,所有数据都进入该特定分区.像这样:
I tried io.confluent.connect.hdfs.partitioner.TimeBasedPartitioner with partition.duration.ms but I think I am doing it the wrong way. I see only one partition in the Hive table with all the data going into that particular partition. Something like this :
hive> show partitions test;
OK
partition
year=2016/month=03/day=15/hour=19/minute=03
所有的 avro 对象都被复制到这个分区中.
And all the avro objects are getting copied into this partition.
相反,我想要这样的东西:
Instead, I would like to have something like this :
hive> show partitions test;
OK
partition
year=2016/month=03/day=15/hour=19/minute=03
year=2016/month=03/day=15/hour=19/minute=08
year=2016/month=03/day=15/hour=19/minute=13
最初连接器将创建路径 year=2016/month=03/day=15/hour=19/minute=03 并将继续将所有传入数据复制到此目录中以供接下来的 5分钟,在第 6 分钟开始时,它应该创建一个新路径,即 year=2016/month=03/day=15/hour=19/minute=08 并复制接下来 5 的数据分钟进入此目录,依此类推.
Initially connector will create the path year=2016/month=03/day=15/hour=19/minute=03 and will continue to copy all the incoming data into this directory for next 5 minutes, and at the start of 6th minute it should create a new path, i.e year=2016/month=03/day=15/hour=19/minute=08 and copy the data for next 5 minutes into this directory, and so on.
这是我的配置文件的样子:
This is how my config file looks like :
name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1
topics=test
hdfs.url=hdfs://localhost:9000
flush.size=3
partitioner.class=io.confluent.connect.hdfs.partitioner.TimeBasedPartitioner
partition.duration.ms=300000
path.format='year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/'minute'=MM/
locale=en
timezone=GMT
logs.dir=/kafka-connect/logs
topics.dir=/kafka-connect/topics
hive.integration=true
hive.metastore.uris=thrift://localhost:9083
schema.compatibility=BACKWARD
如果有人能指出我正确的方向,那将非常有帮助.如果需要,我很乐意分享更多细节.不想让这个问题看起来像一个永无止境的问题.
It would be really helpful if someone could point me in the right direction. I would be glad to share more details in case it's required. Don't want to make this question look like one that never ends.
非常感谢!
推荐答案
您在 path.format 中的分钟字段错误:
your minute field in path.format is wrong:
path.format='year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/'minute'=MM/
应该是:
path.format='year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/'minute'=mm/
这篇关于基于时间的桶记录(kafka-hdfs-connector)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!