Databricks代码无法在大型群集上更快地运行

大约需要20分钟的代码的核心如下.所有其他步骤均<一分钟或2分钟.我是否遗漏了明显的&内容?关于spark/databricks如何工作的基本知识?我期望能够向上扩展资源，运行大量代码数据，然后降低群集的速度.df2是python数据框. python循环中有一个R模型，不确定代码的编写方式是否存在问题，在同一代码块中混合使用R和Python时无法在扩展的资源上运行?对于df2.groupby(by = ['customer_id'])中的customerID，dataForCustomer: startYear = dataForCustomer.head(1).iloc [0] .yr startMonth = dataForCustomer.head(1).iloc [0] .mnth endYear = dataForCustomer.tail(1).iloc [0] .yr endMonth = dataForCustomer.tail(1).iloc [0] .mnth #创建时间序列对象 customerTS = stats.ts(dataForCustomer.usage.astype(int)， start = base.c(startYear，startMonth)， end = base.c(endYear，endMonth)，频率= 12) r.assign('customerTS'，customerTS) ## R代码片段来了尝试: 季节性= r(''' fit& lt-tbats(customerTS，season.periods = 12，use.parallel = TRUE) 适合$季节 ''') 除了: 季节性= 1 df_list.append({'customer_id':customerid，'seasonal':季节性}) season_output = pa.DataFrame(df_list) 解决方案您可以参考到优化性能，看看是否有帮助. 此外，请在此处发布您的查询: https://forums.databricks.com/topics/azure+databricks.html 或 https://stackoverflow.com/questions/tagged/azure-databricks 以获得更好的受众. I didn't see a specific MSDN forum related to databricks and the databricks forums seem pretty quiet so i'll try here.. I'm very new to Azure Databricks, but I was able to get some Python code that I have running in a local anaconda python notebook environment, on Azure Databricks.Doing a side by side test, my local machine, vs. Azure databricks, running the same exact data (~18,000 row sample) & code, the local machine took 15 mins, vs. 19 mins in databricks. I was on the smallest cluster config. available though.So I turned the cluster up much higher, Drivers and workers to Standard_D32s_v3 (128 GB ram, 32 core, 6 DBU) with min workers of 2, max of 4. Ran the code again and it took the same amount of time.The heart of the code that takes the ~20 minutes is below. All other steps take < a minute or 2. Am I missing something obvious & basic about how spark/databricks works? I was expecting to be able to scale resources way up, run the code for a ton of data, then spin the cluster down.df2 is a python dataframe. There's an R model in the python loop, not sure if there's an issue with the way the code is written, where it can't run on scaled out resources when mixing R and Python in the same code block?for customerid, dataForCustomer in df2.groupby(by=['customer_id']): startYear = dataForCustomer.head(1).iloc[0].yr startMonth = dataForCustomer.head(1).iloc[0].mnth endYear = dataForCustomer.tail(1).iloc[0].yr endMonth = dataForCustomer.tail(1).iloc[0].mnth #Creating a time series object customerTS = stats.ts(dataForCustomer.usage.astype(int), start=base.c(startYear,startMonth), end=base.c(endYear, endMonth), frequency=12) r.assign('customerTS', customerTS) ##Here comes the R code piece try: seasonal = r(''' fit<-tbats(customerTS, seasonal.periods = 12, use.parallel = TRUE) fit$seasonal ''') except: seasonal = 1 df_list.append({'customer_id': customerid, 'seasonal': seasonal}) seasonal_output = pa.DataFrame(df_list) 解决方案 You may refer to Optimizing Performance and see if it helps.Also, please post your query here:https://forums.databricks.com/topics/azure+databricks.html orhttps://stackoverflow.com/questions/tagged/azure-databricks for better audience. 这篇关于Databricks代码无法在大型群集上更快地运行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！