如何在Pyspark中读取多行CSV文件 | 如何在Pyspark中读取多行CSV文件

本文介绍了如何在Pyspark中读取多行CSV文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在Pyspark上使用此tweets数据集，以便对其进行处理并根据tweet的位置获取一些趋势.但是，当我尝试创建数据框时遇到了问题.我正在使用 spark.read.options(header =" True).csv(" hashtag_donaldtrump.csv)创建数据框，但是如果我看一下tweets列，这是我得到的结果:

I'm using this tweets dataset with Pyspark in order to process it and get some trends according to the tweet's location. But I'm having a problem when I try to create the dataframe. I'm using spark.read.options(header="True").csv("hashtag_donaldtrump.csv") to create the dataframe, but if I look at the tweets column, this is the result I get:

您知道如何清理CSV文件，以便Spark可以对其进行处理吗?预先谢谢你！

Do you know how can I clean the CSV file so it can be processed by Spark? Thank you in advance!

推荐答案

它看起来像多行csv.尝试做

It looks like a multiline csv. Try doing

df = spark.read.csv("hashtag_donaldtrump.csv", header=True, multiLine=True)

这篇关于如何在Pyspark中读取多行CSV文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！