问题描述
我正在 Pyspark 中读取一个 csv 文件,如下所示:
I am reading a csv file in Pyspark as follows:
df_raw=spark.read.option("header","true").csv(csv_path)
但是,数据文件引用了带有嵌入逗号的字段不应被视为逗号.我如何在 Pyspark 中处理这个问题?我知道 pandas 可以处理这个问题,但 Spark 可以吗?我使用的版本是 Spark 2.0.0.
However, the data file has quoted fields with embedded commas in them whichshould not be treated as commas. How can I handle this in Pyspark ? I know pandas can handle this, but can Spark ? The version I am using is Spark 2.0.0.
这是一个在 Pandas 中有效但在使用 Spark 时失败的示例:
Here is an example which works in Pandas but fails using Spark:
In [1]: import pandas as pd
In [2]: pdf = pd.read_csv('malformed_data.csv')
In [3]: sdf=spark.read.format("org.apache.spark.csv").csv('malformed_data.csv',header=True)
In [4]: pdf[['col12','col13','col14']]
Out[4]:
col12 col13 \
0 32 XIY "W" JK, RE LK SOMETHINGLIKEAPHENOMENON#YOUGOTSOUL~BRINGDANOISE
1 NaN OUTKAST#THROOTS~WUTANG#RUNDMC
col14
0 23.0
1 0.0
In [5]: sdf.select("col12","col13",'col14').show()
+------------------+--------------------+--------------------+
| col12| col13| col14|
+------------------+--------------------+--------------------+
|"32 XIY ""W"" JK| RE LK"|SOMETHINGLIKEAPHE...|
| null|OUTKAST#THROOTS~W...| 0.0|
+------------------+--------------------+--------------------+
文件内容:
col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17,col18,col19
80015360210876000,11.22,X,4076710258,,,sxsw,,"32 YIU ""A""",S5,,"32 XIY ""W"" JK, RE LK",SOMETHINGLIKEAPHENOMENON#YOUGOTSOUL~BRINGDANOISE,23.0,cyclingstats,2012-25-19,432,2023-05-17,CODERED
61670000229561918,137.12,U,8234971771,,,woodstock,,,T4,,,OUTKAST#THROOTS~WUTANG#RUNDMC,0.0,runstats,2013-21-22,1333,2019-11-23,CODEBLUE
推荐答案
我注意到你的问题行有转义,它本身使用双引号:
I noticed that your problematic line has escaping that uses double quotes themselves:
"32 XIY""W""JK,RE LK"
哪个应该是解释器
32 XIYW"JK,RE LK
如 RFC-4180,第 2 页中所述 -
As described in RFC-4180, page 2 -
- 如果使用双引号将字段括起来,那么出现在字段内的双引号必须通过在它前面加上另一个双引号来转义
例如,默认情况下,Excel 就是这样做的.
That's what Excel does, for example, by default.
尽管在 Spark 中(从 Spark 2.1 开始),转义默认是通过非 RFC 方式完成的,使用反斜杠 (\).要解决此问题,您必须明确告诉 Spark 使用双引号作为转义字符:
Although in Spark (as of Spark 2.1), escaping is done by default through non-RFC way, using backslah (\). To fix this you have to explicitly tell Spark to use doublequote to use for as an escape character:
.option("quote", "\"")
.option("escape", "\"")
这可以解释逗号字符没有被解释为它在引用的列中.
This may explain that a comma character wasn't interpreted as it was inside a quoted column.
Spark csv 格式的选项在 Apache Spark 站点上没有很好地记录,但这里有一些旧的文档,我仍然觉得它们经常有用:
Options for Spark csv format are not documented well on Apache Spark site, but here's a bit older documentation which I still find useful quite often:
https://github.com/databricks/spark-csv
2018 年 8 月更新:Spark 3.0 可能会将此行为更改为符合 RFC.有关详细信息,请参阅 SPARK-22236.
Update Aug 2018: Spark 3.0 might change this behavior to be RFC-compliant. See SPARK-22236 for details.
这篇关于读取带有包含嵌入逗号的引用字段的 csv 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!