问题描述
我想运行多个 Hive 查询,最好是并行而不是顺序运行,并将每个查询的输出存储到一个 csv 文件中.例如,csv1
中的 query1
输出,csv2
中的 query2
输出等.我将在之后运行这些查询离开工作的目标是在下一个工作日分析输出.我对使用 bash shell 脚本很感兴趣,因为这样我就可以设置一个 cron
任务来在一天中的特定时间运行它.
I would like to run multiple Hive queries, preferably in parallel rather than sequentially, and store the output of each query into a csv file. For example, query1
output in csv1
, query2
output in csv2
, etc. I would be running these queries after leaving work with the goal of having output to analyze during the next business day. I am interested in using a bash shell script because then I'd be able to set-up a cron
task to run it at a specific time of day.
我知道如何将 HiveQL 查询的结果存储在 CSV 文件中,一次一个查询.我使用以下方法执行此操作:
I know how to store the results of a HiveQL query in a CSV file, one query at a time. I do that with something like the following:
hive -e
"SELECT * FROM db.table;"
" | tr " " "," > example.csv;
上面的问题是我必须监视进程何时完成并手动启动下一个查询.我也知道如何按顺序运行多个查询,如下所示:
The problem with the above is that I have to monitor when the process finishes and manually start the next query. I also know how to run multiple queries, in sequence, like so:
hive -f hivequeries.hql
有没有办法将这两种方法结合起来?有没有更聪明的方法来实现我的目标?
Is there a way to combine these two methods? Is there a smarter way to achieve my goals?
代码答案是首选,因为我不太了解 bash,无法从头开始编写.
Code answers are preferred since I do not know bash well enough to write it from scratch.
这个问题是另一个问题的变体:如何将 HiveQL 查询的结果输出到 CSV?
This question is a variant of another question: How do I output the results of a HiveQL query to CSV?
推荐答案
您可以在 shell 脚本中运行和监控并行作业:
You can run and monitor parallel jobs in a shell script:
#!/bin/bash
#Run parallel processes and wait for their completion
#Add loop here or add more calls
hive -e "SELECT * FROM db.table1;" | tr " " "," > example1.csv &
hive -e "SELECT * FROM db.table2;" | tr " " "," > example2.csv &
hive -e "SELECT * FROM db.table3;" | tr " " "," > example3.csv &
#Note the ampersand in above commands says to create parallel process
#You can wrap hive call in a function an do some logging in it, etc
#And call a function as parallel process in the same way
#Modify this script to fit your needs
#Now wait for all processes to complete
#Failed processes count
FAILED=0
for job in `jobs -p`
do
echo "job=$job"
wait $job || let "FAILED+=1"
done
#Final status check
if [ "$FAILED" != "0" ]; then
echo "Execution FAILED! ($FAILED)"
#Do something here, log or send messege, etc
exit 1
fi
#Normal exit
#Do something else here
exit 0
还有其他方法(使用 XARGS、GNU 并行)在 shell 中运行并行进程,并在其上提供大量资源.另请阅读 https://www.slashroot.in/how-run-多命令并行 linux 和 https://thoughtsimproved.wordpress.com/2015/05/18/parellel-processing-in-bash/
There are other ways (using XARGS, GNU parallel) to run parallel processes in shell and a lot of resources on it. Read also https://www.slashroot.in/how-run-multiple-commands-parallel-linux and https://thoughtsimproved.wordpress.com/2015/05/18/parellel-processing-in-bash/
这篇关于如何使用 shell 脚本将 HiveQL 查询的结果输出到 CSV?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!