问题描述
我有一个采用以下结构的CSV文件:
I have a CSV file that is structured this way:
Header
Blank Row
"Col1","Col2"
"1,200","1,456"
"2,000","3,450"
我在读取此文件时遇到两个问题.
I have two problems in reading this file.
- 我要忽略标题而忽略空白行
- 值中的逗号不是分隔符
这是我尝试过的:
df = sc.textFile("myFile.csv")\
.map(lambda line: line.split(","))\ #Split By comma
.filter(lambda line: len(line) == 2).collect() #This helped me ignore the first two rows
但是,这不起作用,因为值中的逗号被读取为分隔符,并且len(line)
返回4而不是2.
However, This did not work, because the commas within the value was being read as a separator and the len(line)
was returning 4 instead of 2.
我尝试了另一种方法:
data = sc.textFile("myFile.csv")
headers = data.take(2) #First two rows to be skipped
当时的想法是使用过滤器而不读取标题.但是,当我尝试打印标题时,我得到了编码值.
The idea was to then use filter and not read the headers. But, when I tried to print the headers, I got encoded values.
[\x00A\x00Y\x00 \x00J\x00u\x00l\x00y\x00 \x002\x000\x001\x006\x00]
读取CSV文件并跳过前两行的正确方法是什么?
What is the correct way to read a CSV file and skip the first two rows?
推荐答案
Zlidime的答案是正确的主意.可行的解决方案是这样的:
Answer by Zlidime had the right idea. The working solution is this:
import csv
customSchema = StructType([ \
StructField("Col1", StringType(), True), \
StructField("Col2", StringType(), True)])
df = sc.textFile("file.csv")\
.mapPartitions(lambda partition: csv.reader([line.replace('\0','') for line in partition],delimiter=',', quotechar='"')).filter(lambda line: len(line) > 2 and line[0] != 'Col1')\
.toDF(customSchema)
这篇关于使用PySpark将CSV文件读取为dataFrame时如何跳过行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!