本文介绍了BeautifulSoup的行为不同在Amazon EC2上机的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我运行下面的脚本:

from bs4 import BeautifulSoup
import urllib2
import sys

print sys.version

url = 'https://www.google.com/finance'
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)

trends_tag = soup.find('div', {'id': 'topmovers'})

tags = trends_tag.find_all('td', 'change chg')
print len(tags)

tag = tags[0]
print 'Tag: ' + tag.text

在我的电脑,输出的是:

On my computer, the output is:

2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)]
11
Tag: 33.24%

在EC2机,输出的是:

On the EC2 machine, the output is:

2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)]
11
Tag: 33.24%
12.18B


CLX

The Clorox Co
7.35%
11.67B


THOR

Thoratec Corporation
6.12%
1.47B


FOE

Ferro Corporation
6.03%
1.17B


NORD

Nord Anglia Education Inc
5.88%
1.70B


LosersChange
Mkt Cap



CRR

CARBO Ceramics Inc.
-16.10%
1.95B


CMCT

CIM Commercial Trust Corp
-10.54%
1.84B


HLF

Herbalife Ltd.
-10.31%
4.11B


INVN

InvenSense Inc
-10.10%
2.08B


TRS

TriMas Corp
-9.99%
1.34B

我已经更新两台机器相同的Python版本。已安装的软件包都有些不同。我的机器:

I've updated both machines to the same python version. The installed packages are a bit different though. My machine:

>pip freeze
PIL==1.1.7
beautifulsoup4==4.3.2
colorama==0.3.1
cssselect==0.9.1
frida==1.6.0
lxml==3.4.0
newspaper==0.0.7
numpy==1.8.1
pefile==1.2.10-139
pudb==2013.5.1
pygments==1.6
requests==2.4.1
scikit-learn==0.15-git
urwid==1.2.0
xlrd==0.9.2
xlwt==0.7.5

EC2的机器:

The EC2 machine:

>pip freeze
beautifulsoup4==4.3.2

看来,find_all返回一个标签比它应该更大。此外,在运行时打印标签[0] 我得到:

我的机器:

<td class="change chg">33.24%
</td>

在EC2的机器:

<td class="change chg">33.24%
<td class="mktCap">12.18B
<tr>
<td class="symbol">
<a href="/finance?q=NYSE:CLX&amp;ei=lkwhVJDfJKjeiALmvYHACA" title="CLX">CLX</a>
<td class="name">
<a href="/finance?q=NYSE:CLX&amp;ei=lkwhVJDfJKjeiALmvYHACA">The Clorox Co</a>
<td class="change chg">7.35%
<td class="mktCap">11.67B
<tr>
<td class="symbol">
<a href="/finance?q=NASDAQ:THOR&amp;ei=lkwhVJDfJKjeiALmvYHACA" title="THOR">THOR
</a>
<td class="name">
<a href="/finance?q=NASDAQ:THOR&amp;ei=lkwhVJDfJKjeiALmvYHACA">Thoratec Corporat
ion</a>
<td class="change chg">6.12%
<td class="mktCap">1.47B
<tr>
<td class="symbol">
<a href="/finance?q=NYSE:FOE&amp;ei=lkwhVJDfJKjeiALmvYHACA" title="FOE">FOE</a>
<td class="name">
<a href="/finance?q=NYSE:FOE&amp;ei=lkwhVJDfJKjeiALmvYHACA">Ferro Corporation</a
>
<td class="change chg">6.03%
<td class="mktCap">1.17B
<tr>
<td class="symbol">
<a href="/finance?q=NYSE:NORD&amp;ei=lkwhVJDfJKjeiALmvYHACA" title="NORD">NORD</
a>
<td class="name">
<a href="/finance?q=NYSE:NORD&amp;ei=lkwhVJDfJKjeiALmvYHACA">Nord Anglia Educati
on Inc</a>
<td class="change chg">5.88%
<td class="mktCap">1.70B
<tr><td style="height:.7em">
<tr class="colHeader">
<td class="title chr">Losers<td class="change">Change
<td class="mktCap">Mkt Cap
</td></td></td></tr>
<tr>
<td class="symbol">
<a href="/finance?q=NYSE:CRR&amp;ei=lkwhVJDfJKjeiALmvYHACA" title="CRR">CRR</a>
<td class="name">
<a href="/finance?q=NYSE:CRR&amp;ei=lkwhVJDfJKjeiALmvYHACA">CARBO Ceramics Inc.<
/a>
<td class="change chr">-16.10%
<td class="mktCap">1.95B
<tr>
<td class="symbol">
<a href="/finance?q=NASDAQ:CMCT&amp;ei=lkwhVJDfJKjeiALmvYHACA" title="CMCT">CMCT
</a>
<td class="name">
<a href="/finance?q=NASDAQ:CMCT&amp;ei=lkwhVJDfJKjeiALmvYHACA">CIM Commercial Tr
ust Corp</a>
<td class="change chr">-10.54%
<td class="mktCap">1.84B
<tr>
<td class="symbol">
<a href="/finance?q=NYSE:HLF&amp;ei=lkwhVJDfJKjeiALmvYHACA" title="HLF">HLF</a>
<td class="name">
<a href="/finance?q=NYSE:HLF&amp;ei=lkwhVJDfJKjeiALmvYHACA">Herbalife Ltd.</a>
<td class="change chr">-10.31%
<td class="mktCap">4.11B
<tr>
<td class="symbol">
<a href="/finance?q=NYSE:INVN&amp;ei=lkwhVJDfJKjeiALmvYHACA" title="INVN">INVN</
a>
<td class="name">
<a href="/finance?q=NYSE:INVN&amp;ei=lkwhVJDfJKjeiALmvYHACA">InvenSense Inc</a>
<td class="change chr">-10.10%
<td class="mktCap">2.08B
<tr>
<td class="symbol">
<a href="/finance?q=NASDAQ:TRS&amp;ei=lkwhVJDfJKjeiALmvYHACA" title="TRS">TRS</a
>
<td class="name">
<a href="/finance?q=NASDAQ:TRS&amp;ei=lkwhVJDfJKjeiALmvYHACA">TriMas Corp</a>
<td class="change chr">-9.99%
<td class="mktCap">1.34B
<tr><td style="height:.7em">
</td></tr></td></td></td></td></tr></td></td></td></td></tr></td></td></td></td>
</tr></td></td></td></td></tr></td></td></td></td></tr></td></tr></td></td></td>
</td></tr></td></td></td></td></tr></td></td></td></td></tr></td></td></td></td>
</tr></td></td>

注意&LT; / TD&GT;&LT; / TR&GT; 在结束 - 就像它融合了分支机构出于某种原因,

Notice the </td></tr> at the end - Like it merges the branches for some reason.

哪些原因会导致这样的差异?

What can cause such a difference?

对不起,我长的问题

推荐答案

所不同的是 LXML 。 BeautifulSoup使用 LXML 作为默认解析器安装的时候,也可能降低为标准库的HTMLParser 模块时,事实并非如此。

The difference is lxml. BeautifulSoup uses lxml as the default parser when installed, with a fallback to the standard library HTMLParser module when it is not.

您输入HTML格式错误和解析器允许做出最好的它,当psented这样的HTML $ P $。 LXML 的HTMLParser 用不同的方法来如何修复HTML。

Your input HTML is malformed, and parsers are allowed to 'make the best of it' when presented with such HTML. lxml and HTMLParser use different approaches to how to repair the HTML.

您可以强制BeautifulSoup通过创建 BeautifulSoup()实例时,将其命名为在第二个参数为使用特定的语法分析器,看到的:

You can force BeautifulSoup to use a specific parser by naming it in a second argument when creating the BeautifulSoup() instance, see Specifying a parser to use:

soup = BeautifulSoup(page, 'html.parser')

这篇关于BeautifulSoup的行为不同在Amazon EC2上机的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-01 04:19