因为我们有一个带有2列的表,所以我们假设在SQL中
(我们在SQL源表中没有任何created_date,Updated_date,Flag列,并且不修改源表)

 id is primary key
  id name
  1 AAAAA
  2 BBBBB
  3 CCCCC
  4 ADAEAB
  5 GGAGAG

我使用sqoop将数据拉入 hive 作为主表也可以
但是如果源数据如下更新
  id name
  1 ACACA
  2 BASBA
  3 CCHAH
  4 AASDA1
  5 GGAGAG

问题:
My Issue is that without effecting the Main table data in hive i need to pull the
Updated or Inserted or Deleted data using Sqoop and
also simultaneously update in the Hive Main Table without effecting the
Existing once....
i have tried  tried to use
--incremental .... so on properties but no result....

结果应为:
output main table is having all the 10 records... it should be 5 records....
If we have More Records like millions of Records Then What is the Solution.....

需求:
on day1 i have 1millions of records
on day 2 i have 1million + current day + updated lets say 2 million
on day2 i have to pull only updated and newly inserted data rather than whole data.
and also
can Anyone Help me how to combine day1 hive data with day2 updated data...


In case if Anyone has Any other solution like any Alternative please suggest me
Clearly Because i m new to hadoop....

最佳答案

  • 首次创建内部表(base_table)加载
  • 为增量和更新记录
  • 创建一个外部表(incremental_table)
  • 基于主键和max_date加入base_table和增量表,并在其上面创建一个 View 。
  • 在 View 顶部创建一个临时表(报告表)。
  • 删除base_table并从report_table中插入数据。
    请引用以下链接:
    https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_dataintegration/content/incrementally-updating-hive-table-with-sqoop-and-ext-table.html
  • 关于sql - 仅将更新的记录从SQL导入到Hive,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/24216430/

    10-16 01:42
    查看更多