问题描述
我有一个数据框,我正在用 |^|
替换默认分隔符 ,
.它工作正常,除了在记录中找到 ,
之外,我也得到了预期的结果.例如我有一个像下面这样的记录
I have a data frame where i am replacing default delimiter ,
with |^|
.it is working fine and i am getting the expected result also except where ,
is found in the records .For example i have one such records like below
4295859078|^|914|^|INC|^|Balancing Item - Non Operating Income/(Expense),net|^||^||^|IIII|^|False|^||^||^||^||^|False|^||^||^||^||^|505096|^|505074|^|505074|^|505096|^|505096|^||^|505074|^|True|^||^|3014960|^||^|I|!|
所以在第 4 个字段中有 ,
.
So there is ,
in the 4th field .
现在我这样做是为了替换,
Now i am doing like this to replace the ,
val dfMainOutputFinal = dfMainOutput.na.fill("").select($"DataPartition", $"StatementTypeCode",concat_ws("|^|", dfMainOutput.schema.fieldNames.filter(_ != "DataPartition").map(c => col(c)): _*).as("concatenated"))
val headerColumn = df.columns.filter(v => (!v.contains("^") && !v.contains("_c"))).toSeq
val header = headerColumn.dropRight(1).mkString("", "|^|", "|!|")
val dfMainOutputFinalWithoutNull = dfMainOutputFinal.withColumn("concatenated", regexp_replace(col("concatenated"), "null", "")).withColumnRenamed("concatenated", header)
dfMainOutputFinalWithoutNull.repartition(1).write.partitionBy("DataPartition","StatementTypeCode")
.format("csv")
.option("nullValue", "")
.option("header", "true")
.option("codec", "gzip")
.save("s3://trfsmallfffile/FinancialLineItem/output")
我在保存的输出部分文件中得到这样的输出
And i get output like this in the saved output part file
"4295859078|^|914|^|INC|^|Balancing Item - Non Operating Income/(Expense),net|^||^||^|IIII|^|false|^||^||^||^||^|false|^||^||^||^||^|505096|^|505074|^|505074|^|505096|^|505096|^||^|505074|^|true|^||^|3014960|^||^|I|!|"
我的问题是结果的开头和结尾的" "
.
My problem is " "
at the start and end of the result .
如果去掉逗号,我会得到如下正确的结果
If remove comma then i am getting correct result like below
4295859078|^|914|^|INC|^|Balancing Item - Non Operating Income/(Expense)net|^||^||^|IIII|^|false|^||^||^||^||^|false|^||^||^||^||^|505096|^|505074|^|505074|^|505096|^|505096|^||^|505074|^|true|^||^|3014960|^||^|I|!|
推荐答案
这是一个标准的 CSV 功能.如果实际数据中出现分隔符(称为 Delimiter Collision),该字段用引号括起来.
This is a standard CSV feature. If there's an occurrence of delimiter in the actual data (referred to as Delimiter Collision), the field is enclosed in quotes.
你可以试试
df.write.option("delimiter" , somechar)
其中 somechar
应该是一个不会出现在您的数据中的字符.
where somechar
should be a character that doesn't occur in your data.
更强大的解决方案是完全禁用 quoteMode
,因为您正在编写只有一列的数据框.
A more robust solution would be to disable quoteMode
entirely since you are writing a dataframe with only one column.
dfMainOutputFinalWithoutNull.repartition(1)
.write.partitionBy("DataPartition","StatementTypeCode")
.format("csv")
.option("nullValue", "")
.option("quoteMode", "NONE")
//.option("delimiter", ";") // assuming `;` is not present in data
.option("header", "true")
.option("codec", "gzip")
.save("s3://trfsmallfffile/FinancialLineItem/output")
这篇关于添加自定义分隔符会在最终的 spark 数据框 CSV 输出中添加双引号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!