

我一般对Python和编程都不熟悉.我完成了一些教程,通过一本相当不错的书,大约是2/3.话虽这么说,我只是通过尝试将std lib中的内容尝试使自己更熟悉Python和编程.

I'm fairly new to Python and programming in general. I have done a few tutorials and am about 2/3 through a pretty good book. That being said I've been trying to get more comfortable with Python and proggramming by just trying things in the std lib out.

据说我最近遇到了一个奇怪的怪癖,我确信这是我自己对urllib模块(使用Python 3.2.2)的不正确或非"pythonic"使用的结果.

that being said I have recently run into a wierd quirk that I'm sure is the result of my own incorrect or un-"pythonic" use of the urllib module(with Python 3.2.2)

import urllib.request

HTML_source = urllib.request.urlopen(www.somelink.com).read()



when this bit is run through the active interpreter it returns the HTML source of somelink, however it prefixes it with b'for example

b'<HTML>\r\n<HEAD> (etc). . . .


if I split the string into a list by whitespace it prefixes every item with the b'

我并不是真的想完成某些特定的事情,只是想让自己熟悉std lib.我想知道为什么这个b'被加上前缀

I'm not really trying to accomplish something specific just trying to familiarize myself with the std lib. I would like to know why this b' is getting prefixed


also bonus -- Is there a better way to get HTML source WITHOUT using a third party module. I know all that jazz about not reinventing the wheel and what not but I'm trying to learn by "building my own tools"



前缀"b"表示类型为 bytes 而不是 str .要将字节转换为文本,请使用 decode 方法并命名适当的编码.编码通常在"Content-Type"标头中找到:

The "b" prefix means that the type is bytes not str. To convert the bytes into text, use the decode method and name the appropriate encoding. The encoding is often found in the "Content-Type" header:

>>> u = urllib.request.urlopen('http://cnn.com')
>>> u.getheader('Content-Type')
'text/html; charset=UTF-8'
>>> html = u.read().decode('utf-8')
>>> type(html)
<class 'str'>

如果在标题中找不到编码,请尝试将 utf-8 作为默认值.

If you don't find the encoding in the headers, try utf-8 as a default.


08-27 08:15