python - Beautiful Soup崩溃时会出现特殊字符，例如“”和“<”

我正在尝试使用漂亮的汤酱刮擦基于原子的RSS提要，但事实证明这很困难。捕获数据就很好了，直到出现<item>从而使代码中断并使脚本崩溃为止。这样的<item>始终具有“＆lt;”之类的标签（firefox用橙色标记）。或“＆”；而没有它们的s可以正常工作。我尝试了很多类似BeautifulStoneSoup之类的东西，使用正则表达式剥离特殊字符，并设置“ xml”参数，但没有任何效果，通常它们只是发出关于在BS4中弃用的警告。

为什么出现这些字符，我该如何有效地处理它们？

这是我要抓取的页面：
http://www.thestar.com/feeds.articles.news.gta.rss

这是我的代码：

news_url = "http://www.thestar.com/feeds.articles.news.gta.rss" # Toronto Star RSS Feed

try:
    news_rss = urllib2.urlopen(news_url)
    news = news_rss.read()
    news_rss.close()
    soup = BeautifulSoup(news)
except:
    return "error"

titles = soup.findAll('title')
links = soup.findAll('link')

for link in links:
    link = link.contents    # I want the url without the <link> tags

news_stuff = []
for item in titles:
    if item.text == "TORONTO STAR | NEWS | GTA":    # These have <title> tags and I don't want them; just skip 'em.
        pass
    else:
        news_stuff.append((item.text, links[i]))    # Here's a news story.  Grab it.

i = 0
for thing in news_stuff:
    print '<a href="'
    print thing[1]
    print '"target="_blank">'
    print thing[0]
    print '</a><br/>'
    i += 1

最佳答案

不知道您在谈论哪个问题，但是在运行代码时出现此错误：

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2018' in position 54: ordinal not in range(128)

为了解决这个问题，我只添加了编码：

for thing in news_stuff:
    print '<a href="'
    print thing[1]
    print '"target="_blank">'
    print thing[0].encode("utf-8")
    print '</a><br/>'
    i += 1

之后，脚本将无任何错误地执行。

关于python - Beautiful Soup崩溃时会出现特殊字符，例如“”和“<”，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/19229389/