问题描述
我有十几台Web服务器,每台服务器都将数据写入日志文件。在每个小时的开始,使用运行命令的cron脚本将前一小时的数据加载到配置单元中:
hive -eLOAD DATA LOCAL INPATH'myfile.log'INTO TABLE my_table PARTITION(dt ='2015-08-17-05')
在某些情况下,命令失败并以0以外的代码退出,在这种情况下,我们的脚本将等待并再次尝试。问题是,在某些情况下出现故障时,即使数据加载显示失败消息,数据加载也不会 失败。
加载数据的这种失败的示例:如何确定数据是否已加载?
$ b
编辑:
或者,是否有一种方法可以查询加载到其中的文件名的配置单元?我可以使用 DESCRIBE
查看文件数量。我可以知道他们的名字吗?
关于哪些文件已被加载到分区中如果您使用了 EXTERNAL TABLE
并只将您的原始数据
文件上载到HDFS目录中,则b
LOCATION
,那么你可以
(a) (或使用等效的Java API调用)
(b)运行一个Hive查询,例如 code> hdfs dfs -ls
从(...)
- 中选择不同的INPUT__FILE__NAME,但在您的情况下,您将数据复制到托管表,因此
无法检索数据沿袭(即使用
创建每个托管数据文件的日志文件)
- ......除非您明确地在
course的日志文件(在特殊标题记录或每个记录的开头处)内添加原始文件名称可以用旧的sed
关于如何自动避免重复INSERT :有一种方法,但需要相当多的重新设计,并且会花费你的处理时间/(额外的Map步骤加上MapJoin)/ ...
- 将您的日志文件映射到
EXTERNAL TABLE
,以便您可以运行
INSERT-SELECT查询 - 使用
INPUT__FILE__NAME
伪列作为源将原始文件名上载到您的托管表中
添加一个
WHERE NOT EXISTS
子句与相关的子查询,这样如果源文件名已经存在于目标中,那么您不会再加载任何内容。 b
INSERT INTO TABLE目标
选择ColA,ColB,ColC,INPUT__FILE__NAME AS SrcFileName
源src
不存在
(SELECT DISTINCT 1
FROM Target trg
WHERE trg.SrcFileName = src.INPUT__FILE__NAME
)
注意这个愚蠢的D ISTINCT实际上是为了避免在Mappers中浪费RAM;这对于像Oracle这样的成熟的DBMS来说是没有用的,但Hive优化器仍然是相当粗糙的...
- 将您的日志文件映射到
I have a dozen web servers each writing data to a log file. At the beginning of each hour, the data from the previous hour is loaded to hive using a cron script running the command:
hive -e "LOAD DATA LOCAL INPATH 'myfile.log' INTO TABLE my_table PARTITION(dt='2015-08-17-05')"
In some cases, the command fails and exits with a code other than 0, in which case our script awaits and tries again. The problem is, in some cases of failure, the data loading does not fail, even though it shows a failure message. How can I know for sure whether or not the data has been loaded?
Example for such a "failure" where the data is loaded:
Edit:Alternatively, is there a way to query hive for the filenames loaded into it? I can use DESCRIBE
to see the number of files. Can I know their names?
About "which files have been loaded in a partition":
- if you had used an
EXTERNAL TABLE
and just uploaded your raw datafile in the HDFS directory mapped toLOCATION
, then you could
(a) just run a hdfs dfs -ls
on that directory from command line (or use the equivalent Java API call)(b) run a Hive query such as select distinct INPUT__FILE__NAME from (...)
- but in your case, you copy the data into a "managed" table, so thereis no way to retrieve the data lineage (i.e. which log file was usedto create each managed datafile)
- ...unless you add explicitly the original file name inside the log file, ofcourse (either on "special" header record, or at the beginning of each record - which can be done with good old
sed
)
About "how to automagically avoid duplication on INSERT": there is a way, but it would require quite a bit of re-engineering, and would cost you in terms of processing time /(extra Map step plus MapJoin)/...
- map your log file to an
EXTERNAL TABLE
so that you can run anINSERT-SELECT query - upload the original file name into your managed table using
INPUT__FILE__NAME
pseudo-column as source add a
WHERE NOT EXISTS
clause w/ correlated sub-query, so that if the source file name is already present in target then you load nothing moreINSERT INTO TABLE TargetSELECT ColA, ColB, ColC, INPUT__FILE__NAME AS SrcFileNameFROM Source srcWHERE NOT EXISTS (SELECT DISTINCT 1 FROM Target trg WHERE trg.SrcFileName =src.INPUT__FILE__NAME )
Note the silly DISTINCT that is actually required to avoid blowing away the RAM in your Mappers; it would be useless with a mature DBMS like Oracle, but the Hive optimizer is still rather crude...
这篇关于从多个服务器加载数据时避免数据复制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!