本文介绍了Python的 - 从刮网站数据问题时,重音字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是尼古拉,Python中的一个新的用户,无需在计算机编程真实背景。所以,我真的需要一些帮助一个问题,我有。我写了一个code凑从这个网页数据:

I'm Nicola, a new user of Python without a real background in computer programming. Therefore, I'd really need some help with a problem I have. I wrote a code to scrape data from this webpage:

<一个href=\"http://finanzalocale.interno.it/sitophp/showQuadro.php?codice=2080500230&tipo=CO&descr_ente=MODENA&anno=2009&cod_modello=CCOU&sigla=MO&tipo_cert=C&isEuro=0&quadro=02\" rel=\"nofollow\">http://finanzalocale.interno.it/sitophp/showQuadro.php?codice=2080500230&tipo=CO&descr_ente=MODENA&anno=2009&cod_modello=CCOU&sigla=MO&tipo_cert=C&isEuro=0&quadro=02

基本上,我的code的目标是从页面中所有的表刮数据和他们在一个txt文件写入。
在这里,我贴我的code:

Basically, the goal of my code is to scrape the data from all the tables in the page and write them in a txt file.Here I paste my code:

#!/usr/bin/env python


from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import urllib2, os


def extract(soup):
table = soup.findAll("table")[1]
for row in table.findAll('tr')[1:19]:
        col = row.findAll('td')
        voce = col[0].string
        accertamento = col[1].string
        competenza = col[2].string
        residui = col[3].string
        record = (voce, accertamento, competenza, residui)
        print >> outfile, "|".join(record)

table = soup.findAll("table")[2]
for row in table.findAll('tr')[1:21]:
        col = row.findAll('td')
        voce = col[0].string
        accertamento = col[1].string
        competenza = col[2].string
        residui = col[3].string
        record = (voce, accertamento, competenza, residui)
        print >> outfile, "|".join(record)

table = soup.findAll("table")[3]
for row in table.findAll('tr')[1:44]:
        col = row.findAll('td')
        voce = col[0].string
        accertamento = col[1].string
        competenza = col[2].string
        residui = col[3].string
        record = (voce, accertamento, competenza, residui)
        print >> outfile, "|".join(record)

table = soup.findAll("table")[4]
for row in table.findAll('tr')[1:18]:
        col = row.findAll('td')
        voce = col[0].string
        accertamento = col[1].string
        competenza = col[2].string
        residui = col[3].string
        record = (voce, accertamento, competenza, residui)
        print >> outfile, "|".join(record)

    table = soup.findAll("table")[5]
for row in table.findAll('tr')[1:]:
        col = row.findAll('td')
        voce = col[0].string
        accertamento = col[1].string
        competenza = col[2].string
        residui = col[3].string
        record = (voce, accertamento, competenza, residui)
        print >> outfile, "|".join(record)

    table = soup.findAll("table")[6]
for row in table.findAll('tr')[1:]:
        col = row.findAll('td')
        voce = col[0].string
        accertamento = col[1].string
        competenza = col[2].string
        residui = col[3].string
        record = (voce, accertamento, competenza, residui)
        print >> outfile, "|".join(record)


outfile = open("modena_quadro02.txt", "w")
br = Browser()
br.set_handle_robots(False)
url = "http://finanzalocale.interno.it/sitophp/showQuadro.php?codice=2080500230&tipo=CO&descr_ente=MODENA&anno=2009&cod_modello=CCOU&sigla=MO&tipo_cert=C&isEuro=0&quadro=02"
page1 = br.open(url)
html1 = page1.read()
soup1 = BeautifulSoup(html1)
extract(soup1)
outfile.close()

一切都将正常工作,但在页面一些表的第一列包含重音字符的话。
当我运行code,我得到以下内容:

Everything would work fine, but the first column of some tables in that page contains words with accented characters.When I run the code, I get the following:

Traceback (most recent call last):
File "modena2.py", line 158, in <module>
  extract(soup1)
File "modena2.py", line 98, in extract
  print >> outfile, "|".join(record)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 32: ordinal not in range(128)

我知道问题出在重音字符的编码​​。我试图找到一个解决的办法,但它确实超出了我的知识。
我想事先大家感谢是要帮助我,我真的AP preciate吧!
很遗憾,如果问题太基本的,但正如我所说的,我刚开始接触蟒蛇,我由我自己学习的一切。

I know that the problem is with the encoding of the accented characters. I tried to find a solution to this, but it really goes beyond my knowledge.I want to thank in advance everybody that is going to help me.I really appreciate it!And sorry if the question is too basic, but, as I said, I'm just getting started with python and I'm learning everything by myself.

谢谢!
尼古拉

Thanks!Nicola

推荐答案

这个问题是统一打印code文本到二进制文件:

The issue is with printing Unicode text to a binary file:

>>> print >>open('e0.txt', 'wb'), u'\xe0'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 0: ordinal not in range(128)

要解决这个问题,无论是连接code单向code文本转换为字节( U'\\ xe0'.en code('utf-8'))或打开在文本模式下的文件:

To fix it, either encode the Unicode text into bytes (u'\xe0'.encode('utf-8')) or open the file in the text mode:

#!/usr/bin/env python
from __future__ import print_function
import io

with io.open('e0.utf8.txt', encoding='utf-8') as file:
    print(u'\xe0', file=file)

这篇关于Python的 - 从刮网站数据问题时,重音字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-29 02:42