问题描述
我们使用Cloudera CDH 4,并且能够按照预期将我们的Oracle数据库中的表导入我们的HDFS仓库。问题是我们的数据库中有数十个表,而sqoop只支持一次导入一个表。
将多个表格导入HDFS或Hive可以使用哪些选项?例如,一次将200个表从oracle导入HDFS或Hive的最佳方式是什么?
到目前为止我看到的唯一解决方案是为每个表导入创建一个sqoop作业,然后单独运行它们。由于Hadoop设计用于处理大型数据集,所以似乎应该有更好的方法。
-
假设每个表的sqoop配置相同,所有需要导入的表格,然后遍历它们启动sqoop作业(最好是异步启动它们)。您可以运行以下命令从Oracle获取表列表:
SELECT所有者,表名FROM dba_tables
-
Sqoop提供了一个选项来导入所有表。检查此。尽管有一些限制。
修改sqoop源代码并重新编译它以满足您的需求。 sqoop代码库已经有很好的文档记录和很好的安排。
We are using Cloudera CDH 4 and we are able to import tables from our Oracle databases into our HDFS warehouse as expected. The problem is we have 10's of thousands of tables inside our databases and sqoop only supports importing one table at a time.
What options are available for importing multiple tables into HDFS or Hive? For example what would be the best way of importing 200 tables from oracle into HDFS or Hive at a time?
The only solution i have seen so far is to create a sqoop job for each table import and then run them all individually. Since Hadoop is designed to work with large dataset it seems like there should be a better way though.
Assuming that the sqoop configuration for each table is the same, you can list all the tables you need to import and then iterate over them launching sqoop jobs (ideally launch them asynchronously). You can run the following to fetch the list of tables from Oracle:
SELECT owner, table_name FROM dba_tables
referenceSqoop does offer an option to import all tables. Check this link. There are some limitations though.
Modify sqoop source code and recompile it to your needs. The sqoop codebase is well documented and nicely arranged.
这篇关于sqoop导入多个表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!