问题描述
我在HDFS中以文本格式存储日志文件。当我将日志文件加载到Hive表中时,所有文件都被复制。
我可以避免将所有文本数据存储两次吗?
编辑:我通过以下命令加载它:
LOAD DATA INPATH'/ user / logs / mylogfile'INTO TABLE`sandbox.test` PARTITION(day ='20130221')
然后,我可以找到完全相同的文件:
/user/hive/warehouse/sandbox.db/test/day=20130220 $ b $
解决方案使用外部表: CREATE EXTERNAL TABLE sandbox.test(id BIGINT,name STRING)ROW格式
DELIMITED FIELDS TERMINATED BY','
'\\\
'
作为TEXTFILE存储
LOCATION'/ user / logs /';
如果您想使用外部表进行分区,您将负责管理分区目录。
指定的位置必须是hdfs目录..
如果删除外部表格配置单元,则不会删除源数据。
如果 you 想要管理原始文件,请使用外部表。如果您想要配置,请将仓库路径中的配置单元存储。
I have log files stored as text in HDFS. When I load the log files into a Hive table, all the files are copied.
Can I avoid having all my text data stored twice?
EDIT: I load it via the following command
LOAD DATA INPATH '/user/logs/mylogfile' INTO TABLE `sandbox.test` PARTITION (day='20130221')
Then, I can find the exact same file in:
/user/hive/warehouse/sandbox.db/test/day=20130220
I assumed it was copied.
解决方案 use an external table:
CREATE EXTERNAL TABLE sandbox.test(id BIGINT, name STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/logs/';
if you want to use partitioning with an external table, you will be responsible for managing the partition directories.the location specified must be an hdfs directory..
If you drop an external table hive WILL NOT delete the source data.If you want to manage your raw files, use external tables. If you want hive to do it, the let hive store inside of its warehouse path.
这篇关于是否可以在不复制数据的情况下将数据导入Hive表中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!
08-19 10:06