从多个服务器加载数据时避免数据复制 | 从多个服务器加载数据时避免数据复制

本文介绍了从多个服务器加载数据时避免数据复制的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有十几台Web服务器，每台服务器都将数据写入日志文件。在每个小时的开始，使用运行命令的cron脚本将前一小时的数据加载到配置单元中：

  hive -eLOAD DATA LOCAL INPATH'myfile.log'INTO TABLE my_table PARTITION（dt ='2015-08-17-05'）

在某些情况下，命令失败并以0以外的代码退出，在这种情况下，我们的脚本将等待并再次尝试。问题是，在某些情况下出现故障时，即使数据加载显示失败消息，数据加载也不会 失败。

加载数据的这种失败的示例：如何确定数据是否已加载？
$ b
编辑：
或者，是否有一种方法可以查询加载到其中的文件名的配置单元？我可以使用 DESCRIBE 查看文件数量。我可以知道他们的名字吗？
解决方案
关于哪些文件已被加载到分区中如果您使用了 EXTERNAL TABLE 并只将您的原始数据
文件上载到HDFS目录中，则b

映射到 LOCATION ，那么你可以

（a）（或使用等效的Java API调用）
（b）运行一个Hive查询，例如 code> hdfs dfs -ls 从（...）

中选择不同的INPUT__FILE__NAME，但在您的情况下，您将数据复制到托管表，因此
无法检索数据沿袭（即使用
创建每个托管数据文件的日志文件）

......除非您明确地在
course的日志文件（在特殊标题记录或每个记录的开头处）内添加原始文件名称可以用旧的sed 关于如何自动避免重复INSERT ：有一种方法，但需要相当多的重新设计，并且会花费你的处理时间/（额外的Map步骤加上MapJoin）/ ...将您的日志文件映射到 EXTERNAL TABLE ，以便您可以运行 INSERT-SELECT查询使用 INPUT__FILE__NAME 伪列作为源将原始文件名上载到您的托管表中添加一个 WHERE NOT EXISTS 子句与相关的子查询，这样如果源文件名已经存在于目标中，那么您不会再加载任何内容。 b INSERT INTO TABLE目标选择ColA，ColB，ColC，INPUT__FILE__NAME AS SrcFileName 源src 不存在（SELECT DISTINCT 1 FROM Target trg WHERE trg.SrcFileName = src.INPUT__FILE__NAME ）注意这个愚蠢的D ISTINCT实际上是为了避免在Mappers中浪费RAM;这对于像Oracle这样的成熟的DBMS来说是没有用的，但Hive优化器仍然是相当粗糙的... I have a dozen web servers each writing data to a log file. At the beginning of each hour, the data from the previous hour is loaded to hive using a cron script running the command: hive -e "LOAD DATA LOCAL INPATH 'myfile.log' INTO TABLE my_table PARTITION(dt='2015-08-17-05')" In some cases, the command fails and exits with a code other than 0, in which case our script awaits and tries again. The problem is, in some cases of failure, the data loading does not fail, even though it shows a failure message. How can I know for sure whether or not the data has been loaded? Example for such a "failure" where the data is loaded: Edit:Alternatively, is there a way to query hive for the filenames loaded into it? I can use DESCRIBE to see the number of files. Can I know their names? 解决方案 About "which files have been loaded in a partition": if you had used an EXTERNAL TABLE and just uploaded your raw datafile in the HDFS directory mapped to LOCATION, then you could (a) just run a hdfs dfs -ls on that directory from command line (or use the equivalent Java API call)(b) run a Hive query such as select distinct INPUT__FILE__NAME from (...) but in your case, you copy the data into a "managed" table, so thereis no way to retrieve the data lineage (i.e. which log file was usedto create each managed datafile) ...unless you add explicitly the original file name inside the log file, ofcourse (either on "special" header record, or at the beginning of each record - which can be done with good old sed) About "how to automagically avoid duplication on INSERT": there is a way, but it would require quite a bit of re-engineering, and would cost you in terms of processing time /(extra Map step plus MapJoin)/... map your log file to an EXTERNAL TABLE so that you can run anINSERT-SELECT query upload the original file name into your managed table using INPUT__FILE__NAME pseudo-column as source add a WHERE NOT EXISTS clause w/ correlated sub-query, so that if the source file name is already present in target then you load nothing more INSERT INTO TABLE TargetSELECT ColA, ColB, ColC, INPUT__FILE__NAME AS SrcFileNameFROM Source srcWHERE NOT EXISTS (SELECT DISTINCT 1 FROM Target trg WHERE trg.SrcFileName =src.INPUT__FILE__NAME ) Note the silly DISTINCT that is actually required to avoid blowing away the RAM in your Mappers; it would be useless with a mature DBMS like Oracle, but the Hive optimizer is still rather crude... 这篇关于从多个服务器加载数据时避免数据复制的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！