问题描述
我正在尝试使用python 2.7.12从json文件读取twitter数据.
I am trying to read twitter data from json file using python 2.7.12.
我使用的代码是这样的:
Code I used is such:
import json
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
def get_tweets_from_file(file_name):
tweets = []
with open(file_name, 'rw') as twitter_file:
for line in twitter_file:
if line != '\r\n':
line = line.encode('ascii', 'ignore')
tweet = json.loads(line)
if u'info' not in tweet.keys():
tweets.append(tweet)
return tweets
我得到的结果:
Traceback (most recent call last):
File "twitter_project.py", line 100, in <module>
main()
File "twitter_project.py", line 95, in main
tweets = get_tweets_from_dir(src_dir, dest_dir)
File "twitter_project.py", line 59, in get_tweets_from_dir
new_tweets = get_tweets_from_file(file_name)
File "twitter_project.py", line 71, in get_tweets_from_file
line = line.encode('ascii', 'ignore')
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte
我仔细研究了类似问题的所有答案,并想出了这段代码,它最后一次起作用了.我不知道为什么现在不起作用...我将不胜感激!
I went through all the answers from similar issues and came up with this code and it worked last time. I have no clue why it isn't working now...I would appreciate any help!
推荐答案
拥有sys.setdefaultencoding('utf-8')
并没有帮助,这进一步使事情变得混乱-这是一个讨厌的黑客,您需要将其从代码中删除.有关更多信息,请参见 https://stackoverflow.com/a/34378962/1554386
It doesn't help that you have sys.setdefaultencoding('utf-8')
, which is confusing things further - It's a nasty hack and you need to remove it from your code.See https://stackoverflow.com/a/34378962/1554386 for more information
发生错误是因为line
是一个字符串,而您正在调用encode()
. encode()
仅在字符串是Unicode时才有意义,因此Python会尝试首先使用默认编码(在您的情况下为UTF-8
,但应为ASCII
)将其转换为Unicode.无论哪种方式,0x80
都不是有效的ASCII或UTF-8,因此会失败.
The error is happening because line
is a string and you're calling encode()
. encode()
only makes sense if the string is a Unicode, so Python tries to convert it Unicode first using the default encoding, which in your case is UTF-8
, but should be ASCII
. Either way, 0x80
is not valid ASCII or UTF-8 so fails.
0x80
在某些字符集中有效.在windows-1252
/cp1252
中是€
.
0x80
is valid in some characters sets. In windows-1252
/cp1252
it's €
.
这里的窍门是通过代码一直了解数据的编码.此刻,您还有太多机会. Unicode字符串类型是Python的一种便捷功能,它使您可以解码已编码的字符串,而无需进行编码,直到需要写入或传输数据为止.
The trick here is to understand the encoding of your data all the way through your code. At the moment, you're leaving too much up to chance. Unicode String types are a handy Python feature that allows you to decode encoded Strings and forget about the encoding until you need to write or transmit the data.
使用io
模块以文本模式打开文件并对文件进行解码-不再.decode()
!您需要确保传入数据的编码是一致的.您可以在外部对其进行重新编码,也可以在脚本中更改其编码.这是我将编码设置为windows-1252
.
Use the io
module to open the file in text mode and decode the file as it goes - no more .decode()
! You need to make sure the encoding of your incoming data is consistent. You can either re-encode it externally or change the encoding in your script. Here's I've set the encoding to windows-1252
.
with io.open(file_name, 'r', encoding='windows-1252') as twitter_file:
for line in twitter_file:
# line is now a <type 'unicode'>
tweet = json.loads(line)
io
模块还提供通用换行符.这意味着\r\n
被检测为换行符,因此您不必注意它们.
The io
module also provide Universal Newlines. This means \r\n
are detected as newlines, so you don't have to watch for them.
这篇关于UnicodeDecodeError:'utf8'编解码器无法解码位置3131中的字节0x80:无效的起始字节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!