问题描述
在我的Spark作业(spark 2.4.1)中,我正在读取S3上的CSV文件.这些文件包含日语字符,而且它们可以具有^ M字符(u000D),因此我需要将它们解析为多行.
In my Spark job (spark 2.4.1) , I am reading CSV files on S3.These files contain Japanese characters.Also they can have ^M character (u000D) so I need to parse them as multiline.
首先,我使用以下代码读取CSV文件:
First I used following code to read CSV files:
implicit class DataFrameReadImplicits (dataFrameReader: DataFrameReader) {
def readTeradataCSV(schema: StructType, s3Path: String) : DataFrame = {
dataFrameReader.option("delimiter", "\u0001")
.option("header", "false")
.option("inferSchema", "false")
.option("multiLine","true")
.option("encoding", "UTF-8")
.option("charset", "UTF-8")
.schema(schema)
.csv(s3Path)
}
}
但是当我使用这种方法阅读DF时,所有的日文字符都是乱码.
But when I read DF using this method all the Japanese characters are garbled.
做完一些测试后,我发现如果我使用"spark.sparkContext.textFile(path)" 正确编码的日语字符读取相同的S3文件.
After doing some tests I found out that If I read the same S3 file using "spark.sparkContext.textFile(path)" Japanese characters encoded properly.
所以我尝试了这种方式:
So I tried this way :
implicit class SparkSessionImplicits (spark : SparkSession) {
def readTeradataCSV(schema: StructType, s3Path: String) = {
import spark.sqlContext.implicits._
spark.read.option("delimiter", "\u0001")
.option("header", "false")
.option("inferSchema", "false")
.option("multiLine","true")
.schema(schema)
.csv(spark.sparkContext.textFile(s3Path).map(str => str.replaceAll("\u000D"," ")).toDS())
}
}
现在,编码问题已解决.但是,即使我尝试使用 str.replaceAll("\ u000D",")替换^ M,多行代码也无法正常工作,并且在^ M字符附近断行.
Now the encoding issue is fixed.However multilines doesn't work properly and lines are broken near ^M character , even though I tried to replace ^M using str.replaceAll("\u000D"," ")
关于使用第一种方法读取日语字符的任何提示, 或者使用第二种方法处理多行?
Any tips on how to read Japanese characters using first method, orhandle multi-lines using the second method ?
更新:当应用程序在Spark集群上运行时会发生这种编码问题.当我在本地运行该应用程序时,读取相同的S3文件,编码就可以正常工作.
UPDATE:This encoding issue happens when the app runs on the Spark cluster.When I ran the app locally, reading the same S3 file, encoding works just fine.
推荐答案
代码中包含某些内容,但文档中尚未包含.您是否尝试过显式设置行分隔符 ,从而避免了由于^M
而导致的多行"解决方法?
Some things are in the code but not (yet) in the docs. Did you try setting explicitly your line separator, thus avoiding the "multiline" workaround because of ^M
?
来自Spark"TextSuite"分支2.4的单元测试.
> https://github.com/apache/spark/blob/branch-2.4/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala
From the unit tests for Spark "TextSuite" branch 2.4
https://github.com/apache/spark/blob/branch-2.4/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala
def testLineSeparator(lineSep: String): Unit = {
test(s"SPARK-23577: Support line separator - lineSep: '$lineSep'") {
...
}
// scalastyle:off nonascii
Seq("|", "^", "::", "!!!@3", 0x1E.toChar.toString, "아").foreach { lineSep =>
testLineSeparator(lineSep)
}
// scalastyle:on nonascii
从CSV选项解析的源代码开始,分支3.0
https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
From the source code for CSV options parsing, branch 3.0
https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
val lineSeparator: Option[String] = parameters.get("lineSep").map { sep =>
require(sep.nonEmpty, "'lineSep' cannot be an empty string.")
require(sep.length == 1, "'lineSep' can contain only 1 character.")
sep
}
val lineSeparatorInRead: Option[Array[Byte]] = lineSeparator.map { lineSep =>
lineSep.getBytes(charset)
}
因此,看起来CSV不支持行定界符的字符串,仅支持单个字符,因为它依赖于某些Hadoop库.希望您的情况很好.
So, looks like CSV does not support strings for line delimiters, just single characters, because it relies on some Hadoop library. I hope that's fine in your case.
匹配的JIRA是...
The matching JIRAs are...
SPARK-21289 基于文本的格式不支持自定义结尾行定界符...
SPARK-23577 特定于文本数据源>已在V2.4.0中修复
SPARK-21289 Text based formats do not support custom end-of-line delimiters ...
SPARK-23577 specific to text datasource > fixed in V2.4.0
这篇关于Spark CSV阅读器:日语文本乱码和多行处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!