问题描述
我正在尝试使用Azure数据工厂从点播HD Insight群集中执行Spark作业.
I am trying to execute spark job from on demand HD Insight cluster using Azure datafactory.
文档清楚地表明,ADF(v2)不支持针对点播HD洞察集群的datalake链接服务,因此必须从复制活动中将数据复制到blob上,然后再执行该作业.但是,如果在一个数据湖上有十亿个文件,这种解决方案似乎是非常昂贵的资源.是否有任何有效的方法可以通过执行spark作业的python脚本访问datalake文件,也可以通过任何其他直接访问这些文件的方法来实现.
Documentation indicates clearly that ADF(v2) does not support datalake linked service for on demand HD insight cluster and one have to copy data onto blob from copy activity and than execute the job. BUT this work around seems to be a hugely resource expensive in case of a billion files on a datalake. Is there any efficient way to access datalake files either from python script that execute spark jobs or any other way to directly access the files.
P.S是否有可能在v1中做类似的事情,如果可以,那么怎么办? 使用Azure数据工厂在HDInsight中创建按需Hadoop群集"描述了访问blob存储的按需hadoop集群,但我希望访问datalake的按需Spark集群.
P.S Is there a possiblity of doing similar thing from v1, if yes then how? "Create on-demand Hadoop clusters in HDInsight using Azure Data Factory" describe on demand hadoop cluster that access blob storage but I want on demand spark cluster that access datalake.
P.P.s预先感谢
推荐答案
当前,我们不支持ADF v2中带有HDI Spark群集的ADLS数据存储.我们计划在接下来的几个月中添加该内容.到那时,您将不得不继续使用上面文章中提到的解决方法.不便之处,敬请谅解.
Currently, we don't have support for ADLS data store with HDI Spark cluster in ADF v2. We plan to add that in the coming months. Till then, you will have to contiue using the workaround as you mentioned in your post above. Sorry for the inconvenience.
这篇关于使用按需HD Insight群集从Azure Datafactory V2访问Datalake的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!