本文介绍了Spark数据帧已成功创建,但无法写入本地磁盘的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用IntelliJ IDE在Microsoft Windows平台上执行Spark Scala代码.

I am using IntelliJ IDE for executing Spark Scala code on Microsoft Windows Platform.

我有四个大约有30000条记录的Spark数据框,并且我尝试从这些数据框中取一列作为我的要求的一部分.

I have four Spark Dataframes of around 30000 records each and I tried to take one column from each of those Dataframes as part of my requirement.

我使用Spark SQL函数执行此操作,并成功执行了该操作.当我执行DF.show()或DF.count()方法时,我可以在屏幕上看到结果,但是当我尝试将数据帧写入本地磁盘(Windows目录)时,作业因以下错误而中止:

I used Spark SQL function to do it and it got executed successfully. When I execute DF.show() or DF.count() method, I am able to see results in the screen but when I tried to write the dataframe into my local disk (windows directory) the job is getting aborted with the below error :

驱动程序堆栈跟踪:位于 org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1435) 在 org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1423) 在 org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1422) 在 scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59) 在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) 在 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) 在 org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:802) 在 org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:802) 在scala.Option.foreach(Option.scala:257)在 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) 在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48) 在 org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) 在org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)处 org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)在 org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)在 org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1.apply $ mcV $ sp(FileFormatWriter.scala:127) ... 28更多原因:java.io.IOException:命令中的(空)条目 字符串:null chmod 0644 D:\ Test_Output_File2_temporary \ 0_temporary \ attempt_20170830194047_0031_m_000000_0 \ part-00000-85c32c55-e12d-4433-979d-ccecb2fcd341.csv 在 org.apache.hadoop.util.Shell $ ShellCommandExecutor.execute(Shell.java:770) 在org.apache.hadoop.util.Shell.execCommand(Shell.java:866) org.apache.hadoop.util.Shell.execCommand(Shell.java:849)在 org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733) 在 org.apache.hadoop.fs.RawLocalFileSystem $ LocalFSFileOutputStream.(RawLocalFileSystem.java:225) 在 org.apache.hadoop.fs.RawLocalFileSystem $ LocalFSFileOutputStream.(RawLocalFileSystem.java:209) 在 org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:307) 在 org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:296) 在 org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328) 在 org.apache.hadoop.fs.ChecksumFileSystem $ ChecksumFSOutputSummer.(ChecksumFileSystem.java:398) 在 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461) 在 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440) 在org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)处 org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)在 org.apache.hadoop.fs.FileSystem.create(FileSystem.java:789)在 org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:132) 在 org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.(CSVRelation.scala:208) 在 org.apache.spark.sql.execution.datasources.csv.CSVOutputWriterFactory.newInstance(CSVRelation.scala:178) 在 org.apache.spark.sql.execution.datasources.FileFormatWriter $ SingleDirectoryWriteTask.(FileFormatWriter.scala:234) 在 org.apache.spark.sql.execution.datasources.FileFormatWriter $ .org $ apache $ spark $ sql $ execution $ datasources $ FileFormatWriter $$ executeTask(FileFormatWriter.scala:182) 在 org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1 $$ anonfun $ 3.apply(FileFormatWriter.scala:129) 在 org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1 $$ anonfun $ 3.apply(FileFormatWriter.scala:128) 在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) 在org.apache.spark.scheduler.Task.run(Task.scala:99)在 org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:282) 在 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 在 java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617) 在java.lang.Thread.run(Thread.java:745)拾起_JAVA_OPTIONS: -Xmx512M

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:127) ... 28 more Caused by: java.io.IOException: (null) entry in command string: null chmod 0644 D:\Test_Output_File2_temporary\0_temporary\attempt_20170830194047_0031_m_000000_0\part-00000-85c32c55-e12d-4433-979d-ccecb2fcd341.csv at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:770) at org.apache.hadoop.util.Shell.execCommand(Shell.java:866) at org.apache.hadoop.util.Shell.execCommand(Shell.java:849) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:225) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:209) at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:307) at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:296) at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.(ChecksumFileSystem.java:398) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:789) at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:132) at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.(CSVRelation.scala:208) at org.apache.spark.sql.execution.datasources.csv.CSVOutputWriterFactory.newInstance(CSVRelation.scala:178) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.(FileFormatWriter.scala:234) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:182) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Picked up _JAVA_OPTIONS: -Xmx512M

以退出代码1完成的过程

Process finished with exit code 1

我不知道哪里出了问题.有人可以解释如何克服这个问题吗?

I am not able to understand where it went wrong. Can anybody explain how to overcome this issue ?

更新请注意,直到昨天我才可以写入相同的文件,并且在我的系统或IDE的配置中未进行任何更改.所以我不明白为什么它要运行到昨天,为什么现在不能运行

UPDATEPlease note that I was able to write the same files until yesterday and no changes were made in my system or with the IDE's configuration. So I don't understand why it was running till yesterday and why not it is running now

此链接中有一个类似的帖子: Pyspark上的saveAsTextFile()中命令字符串异常中的(null)条目,但是他们在Jupiter笔记本上使用pyspark,而我的问题是IntelliJ IDE

There was a similar post in this link : (null) entry in command string exception in saveAsTextFile() on Pyspark but they are using pyspark on Jupiter notebook whereas my issue is with IntelliJ IDE

将输出文件写入本地磁盘的超级简化代码

val Test_Output =spark.sql("select A.Col1, A.Col2, B.Col2, C.Col2, D.Col2 from A, B, C, D where A.primaryKey = B.primaryKey and B.primaryKey = C.primaryKey and C.primaryKey = D.primaryKey and D.primaryKey = A.primaryKey")

val Test_Output_File = Test_Output.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").option("nullValue", "0").save("D:/Test_Output_File")

推荐答案

最后,我纠正了自己.在创建数据框时,我使用了 .persist()方法.那帮助我编写输出文件而没有任何错误.尽管我不了解其背后的逻辑.

Finally I rectified myself. I used .persist() method while creating dataframes. That helped me to write the output files without any error. Though I do not understand the logic behind it.

感谢您对此的宝贵意见

这篇关于Spark数据帧已成功创建,但无法写入本地磁盘的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-20 15:16