问题描述
我正在尝试阅读Kaggle上提供的Sentiment140.csv:
I am trying to read the Sentiment140.csv available on Kaggle: https://www.kaggle.com/kazanova/sentiment140
我的代码是这个:
import pandas as pd
import os
cols = ['sentiment','id','date','query_string','user','text']
BASE_DIR = ''
df = pd.read_csv(os.path.join(BASE_DIR, 'Sentiment140.csv'),header=None, names=cols)
它给了我这个错误:
想了解的是:
1)我该如何解决这个问题?
1) How do I solve this issue?
2)在哪里可以找到?看到基于错误,我应该使用哪种编码类型而不是 utf-8?
2) Where can I see which type of encoding should I use instead of "utf-8", based on the error?
3)使用其他编码方法会导致其他问题
3) Using other encoding methods will cause me other issues later on?
预先感谢
P.s。我在Mac上使用python3
P.s. I am using python3 on a mac
推荐答案
这有效:
结果为 encoding = latin-1
,您必须指定列名,否则它将使用第一行作为列名。这就是糟糕的现实世界数据集可能是哈哈
Turns out encoding="latin-1"
and you have to specify column names, otherwise it will use the first row as column names. This is how lousy real-world dataset can be haha
这篇关于UnicodeDecodeError Sentiment140 Kaggle的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!