使用PySpark将CSV文件读取为dataFrame时如何跳过行? | 用PySpark将CSV文件读取为dataFrame时如何跳过行

本文介绍了使用PySpark将CSV文件读取为dataFrame时如何跳过行?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个采用以下结构的CSV文件:

I have a CSV file that is structured this way:

Header
Blank Row
"Col1","Col2"
"1,200","1,456"
"2,000","3,450"

我在读取此文件时遇到两个问题.

I have two problems in reading this file.

我要忽略标题而忽略空白行
值中的逗号不是分隔符

这是我尝试过的:

df = sc.textFile("myFile.csv")\
              .map(lambda line: line.split(","))\ #Split By comma
              .filter(lambda line: len(line) == 2).collect() #This helped me ignore the first two rows

但是，这不起作用，因为值中的逗号被读取为分隔符，并且len(line)返回4而不是2.

However, This did not work, because the commas within the value was being read as a separator and the len(line) was returning 4 instead of 2.

我尝试了另一种方法:

data = sc.textFile("myFile.csv")
headers = data.take(2) #First two rows to be skipped

当时的想法是使用过滤器而不读取标题.但是，当我尝试打印标题时，我得到了编码值.

The idea was to then use filter and not read the headers. But, when I tried to print the headers, I got encoded values.

[\x00A\x00Y\x00 \x00J\x00u\x00l\x00y\x00 \x002\x000\x001\x006\x00]

读取CSV文件并跳过前两行的正确方法是什么?

What is the correct way to read a CSV file and skip the first two rows?

推荐答案

Zlidime的答案是正确的主意.可行的解决方案是这样的:

Answer by Zlidime had the right idea. The working solution is this:

import csv

customSchema = StructType([ \
    StructField("Col1", StringType(), True), \
    StructField("Col2", StringType(), True)])

df = sc.textFile("file.csv")\
        .mapPartitions(lambda partition: csv.reader([line.replace('\0','') for line in partition],delimiter=',', quotechar='"')).filter(lambda line: len(line) > 2 and line[0] != 'Col1')\
        .toDF(customSchema)

这篇关于使用PySpark将CSV文件读取为dataFrame时如何跳过行?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！