问题描述
我正在尝试读取一个名为 df1 的数据集,但它不起作用
将pandas导入为pddf1=pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";")df1.head()
上面的代码有很大的错误,但这是最相关的
UnicodeDecodeError: 'utf-8' 编解码器无法解码位置 18 的字节 0x92:无效的起始字节
数据确实没有编码为 UTF-8;除了单个 0x92 字节外,一切都是 ASCII:
b'Korea, Dem.人们\x92s 代表.'
将其解码为 Windows 代码页 1252,其中 0x92 是花哨的引用,':
df1 = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";", encoding='cp1252')
演示:
>>>将熊猫导入为 pd>>>df1 = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",... sep=";", encoding='cp1252')>>>df1.head()2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 \0 阿富汗 55.1 55.5 55.9 56.2 56.6 57.0 57.4 57.8 58.2 58.61 阿尔巴尼亚 74.3 74.7 75.2 75.5 75.8 76.1 76.3 76.5 76.7 76.82 阿尔及利亚 70.2 70.6 71.0 71.4 71.8 72.2 72.6 72.9 73.2 73.53 美属萨摩亚 .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 安道尔 .. . . . . . . . . . . . . . . . . . . . . . . .2010 2011 2012 2013 未命名:15 2014 20150 59.0 59.3 59.7 60.0 NaN 60.4 60.71 77.0 77.2 77.4 77.6 NaN 77.8 78.02 73.8 74.1 74.3 74.6 NaN 74.8 75.03 .. .. .. .. NaN .. ..4 .. .. .. 南 .. ..然而,我注意到,当您从 URL 加载数据时,Pandas 似乎也从表面上获取 HTTP 标头并生成 Mojibake.当我将数据直接保存到磁盘时,然后用 pd.read_csv()
加载它,数据被正确解码,但从 URL 加载会产生重新编码的数据:
这是一个 Pandas 中的已知错误.您可以使用 urllib.request 解决此问题
加载 URL 并将其传递给 pd.read_csv()
:
I am trying to read in a dataset called df1, but it does not work
import pandas as pd
df1=pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";")
df1.head()
Here are huge errors from the above code, but this is the most relevant
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte
The data is indeed not encoded as UTF-8; everything is ASCII except for that single 0x92 byte:
b'Korea, Dem. People\x92s Rep.'
Decode it as Windows codepage 1252 instead, where 0x92 is a fancy quote, ’
:
df1 = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",
sep=";", encoding='cp1252')
Demo:
>>> import pandas as pd
>>> df1 = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",
... sep=";", encoding='cp1252')
>>> df1.head()
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 \
0 Afghanistan 55.1 55.5 55.9 56.2 56.6 57.0 57.4 57.8 58.2 58.6
1 Albania 74.3 74.7 75.2 75.5 75.8 76.1 76.3 76.5 76.7 76.8
2 Algeria 70.2 70.6 71.0 71.4 71.8 72.2 72.6 72.9 73.2 73.5
3 American Samoa .. .. .. .. .. .. .. .. .. ..
4 Andorra .. .. .. .. .. .. .. .. .. ..
2010 2011 2012 2013 Unnamed: 15 2014 2015
0 59.0 59.3 59.7 60.0 NaN 60.4 60.7
1 77.0 77.2 77.4 77.6 NaN 77.8 78.0
2 73.8 74.1 74.3 74.6 NaN 74.8 75.0
3 .. .. .. .. NaN .. ..
4 .. .. .. .. NaN .. ..
I note however, that Pandas seems to take the HTTP headers at face value too and produces a Mojibake when you load your data from a URL. When I save the data directly to disk, then load it with pd.read_csv()
the data is correctly decoded, but loading from the URL produces re-coded data:
>>> df1[' '][102]
'Korea, Dem. People’s Rep.'
>>> df1[' '][102].encode('cp1252').decode('utf8')
'Korea, Dem. People’s Rep.'
This is a known bug in Pandas. You can work around this by using urllib.request
to load the URL and pass that to pd.read_csv()
instead:
>>> import urllib.request
>>> with urllib.request.urlopen("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv") as resp:
... df1 = pd.read_csv(resp, sep=";", encoding='cp1252')
...
>>> df1[' '][102]
'Korea, Dem. People’s Rep.'
这篇关于“utf-8"编解码器无法解码位置 18 中的字节 0x92:起始字节无效的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!