问题描述
我正在Pyspark上使用此tweets数据集,以便对其进行处理并根据tweet的位置获取一些趋势.但是,当我尝试创建数据框时遇到了问题.我正在使用 spark.read.options(header =" True).csv(" hashtag_donaldtrump.csv)
创建数据框,但是如果我看一下tweets列,这是我得到的结果:
I'm using this tweets dataset with Pyspark in order to process it and get some trends according to the tweet's location. But I'm having a problem when I try to create the dataframe. I'm using spark.read.options(header="True").csv("hashtag_donaldtrump.csv")
to create the dataframe, but if I look at the tweets column, this is the result I get:
您知道如何清理CSV文件,以便Spark可以对其进行处理吗?预先谢谢你!
Do you know how can I clean the CSV file so it can be processed by Spark? Thank you in advance!
推荐答案
它看起来像多行csv.尝试做
It looks like a multiline csv. Try doing
df = spark.read.csv("hashtag_donaldtrump.csv", header=True, multiLine=True)
这篇关于如何在Pyspark中读取多行CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!