问题描述
我运行下面的脚本:
from bs4 import BeautifulSoup
import urllib2
import sys
print sys.version
url = 'https://www.google.com/finance'
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
trends_tag = soup.find('div', {'id': 'topmovers'})
tags = trends_tag.find_all('td', 'change chg')
print len(tags)
tag = tags[0]
print 'Tag: ' + tag.text
在我的电脑,输出的是:
On my computer, the output is:
2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)]
11
Tag: 33.24%
在EC2机,输出的是:
On the EC2 machine, the output is:
2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)]
11
Tag: 33.24%
12.18B
CLX
The Clorox Co
7.35%
11.67B
THOR
Thoratec Corporation
6.12%
1.47B
FOE
Ferro Corporation
6.03%
1.17B
NORD
Nord Anglia Education Inc
5.88%
1.70B
LosersChange
Mkt Cap
CRR
CARBO Ceramics Inc.
-16.10%
1.95B
CMCT
CIM Commercial Trust Corp
-10.54%
1.84B
HLF
Herbalife Ltd.
-10.31%
4.11B
INVN
InvenSense Inc
-10.10%
2.08B
TRS
TriMas Corp
-9.99%
1.34B
我已经更新两台机器相同的Python版本。已安装的软件包都有些不同。我的机器:
I've updated both machines to the same python version. The installed packages are a bit different though. My machine:
>pip freeze
PIL==1.1.7
beautifulsoup4==4.3.2
colorama==0.3.1
cssselect==0.9.1
frida==1.6.0
lxml==3.4.0
newspaper==0.0.7
numpy==1.8.1
pefile==1.2.10-139
pudb==2013.5.1
pygments==1.6
requests==2.4.1
scikit-learn==0.15-git
urwid==1.2.0
xlrd==0.9.2
xlwt==0.7.5
EC2的机器:
The EC2 machine:
>pip freeze
beautifulsoup4==4.3.2
看来,find_all返回一个标签比它应该更大。此外,在运行时打印标签[0]
我得到:
我的机器:
<td class="change chg">33.24%
</td>
在EC2的机器:
<td class="change chg">33.24%
<td class="mktCap">12.18B
<tr>
<td class="symbol">
<a href="/finance?q=NYSE:CLX&ei=lkwhVJDfJKjeiALmvYHACA" title="CLX">CLX</a>
<td class="name">
<a href="/finance?q=NYSE:CLX&ei=lkwhVJDfJKjeiALmvYHACA">The Clorox Co</a>
<td class="change chg">7.35%
<td class="mktCap">11.67B
<tr>
<td class="symbol">
<a href="/finance?q=NASDAQ:THOR&ei=lkwhVJDfJKjeiALmvYHACA" title="THOR">THOR
</a>
<td class="name">
<a href="/finance?q=NASDAQ:THOR&ei=lkwhVJDfJKjeiALmvYHACA">Thoratec Corporat
ion</a>
<td class="change chg">6.12%
<td class="mktCap">1.47B
<tr>
<td class="symbol">
<a href="/finance?q=NYSE:FOE&ei=lkwhVJDfJKjeiALmvYHACA" title="FOE">FOE</a>
<td class="name">
<a href="/finance?q=NYSE:FOE&ei=lkwhVJDfJKjeiALmvYHACA">Ferro Corporation</a
>
<td class="change chg">6.03%
<td class="mktCap">1.17B
<tr>
<td class="symbol">
<a href="/finance?q=NYSE:NORD&ei=lkwhVJDfJKjeiALmvYHACA" title="NORD">NORD</
a>
<td class="name">
<a href="/finance?q=NYSE:NORD&ei=lkwhVJDfJKjeiALmvYHACA">Nord Anglia Educati
on Inc</a>
<td class="change chg">5.88%
<td class="mktCap">1.70B
<tr><td style="height:.7em">
<tr class="colHeader">
<td class="title chr">Losers<td class="change">Change
<td class="mktCap">Mkt Cap
</td></td></td></tr>
<tr>
<td class="symbol">
<a href="/finance?q=NYSE:CRR&ei=lkwhVJDfJKjeiALmvYHACA" title="CRR">CRR</a>
<td class="name">
<a href="/finance?q=NYSE:CRR&ei=lkwhVJDfJKjeiALmvYHACA">CARBO Ceramics Inc.<
/a>
<td class="change chr">-16.10%
<td class="mktCap">1.95B
<tr>
<td class="symbol">
<a href="/finance?q=NASDAQ:CMCT&ei=lkwhVJDfJKjeiALmvYHACA" title="CMCT">CMCT
</a>
<td class="name">
<a href="/finance?q=NASDAQ:CMCT&ei=lkwhVJDfJKjeiALmvYHACA">CIM Commercial Tr
ust Corp</a>
<td class="change chr">-10.54%
<td class="mktCap">1.84B
<tr>
<td class="symbol">
<a href="/finance?q=NYSE:HLF&ei=lkwhVJDfJKjeiALmvYHACA" title="HLF">HLF</a>
<td class="name">
<a href="/finance?q=NYSE:HLF&ei=lkwhVJDfJKjeiALmvYHACA">Herbalife Ltd.</a>
<td class="change chr">-10.31%
<td class="mktCap">4.11B
<tr>
<td class="symbol">
<a href="/finance?q=NYSE:INVN&ei=lkwhVJDfJKjeiALmvYHACA" title="INVN">INVN</
a>
<td class="name">
<a href="/finance?q=NYSE:INVN&ei=lkwhVJDfJKjeiALmvYHACA">InvenSense Inc</a>
<td class="change chr">-10.10%
<td class="mktCap">2.08B
<tr>
<td class="symbol">
<a href="/finance?q=NASDAQ:TRS&ei=lkwhVJDfJKjeiALmvYHACA" title="TRS">TRS</a
>
<td class="name">
<a href="/finance?q=NASDAQ:TRS&ei=lkwhVJDfJKjeiALmvYHACA">TriMas Corp</a>
<td class="change chr">-9.99%
<td class="mktCap">1.34B
<tr><td style="height:.7em">
</td></tr></td></td></td></td></tr></td></td></td></td></tr></td></td></td></td>
</tr></td></td></td></td></tr></td></td></td></td></tr></td></tr></td></td></td>
</td></tr></td></td></td></td></tr></td></td></td></td></tr></td></td></td></td>
</tr></td></td>
注意&LT; / TD&GT;&LT; / TR&GT;
在结束 - 就像它融合了分支机构出于某种原因,
Notice the </td></tr>
at the end - Like it merges the branches for some reason.
哪些原因会导致这样的差异?
What can cause such a difference?
对不起,我长的问题
推荐答案
所不同的是 LXML
。 BeautifulSoup使用 LXML
作为默认解析器安装的时候,也可能降低为标准库的HTMLParser
模块时,事实并非如此。
The difference is lxml
. BeautifulSoup uses lxml
as the default parser when installed, with a fallback to the standard library HTMLParser
module when it is not.
您输入HTML格式错误和解析器允许做出最好的它,当psented这样的HTML $ P $。 LXML
和的HTMLParser
用不同的方法来如何修复HTML。
Your input HTML is malformed, and parsers are allowed to 'make the best of it' when presented with such HTML. lxml
and HTMLParser
use different approaches to how to repair the HTML.
您可以强制BeautifulSoup通过创建 BeautifulSoup()
实例时,将其命名为在第二个参数为使用特定的语法分析器,看到的:
You can force BeautifulSoup to use a specific parser by naming it in a second argument when creating the BeautifulSoup()
instance, see Specifying a parser to use:
soup = BeautifulSoup(page, 'html.parser')
这篇关于BeautifulSoup的行为不同在Amazon EC2上机的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!