I am completely puzzled by the behavior of the following HTML-scraping code that I wrote in two different environments and need help finding the root cause of this discrepancy.
import sys
import bs4
import md5
import logging
from urllib2 import urlopen
from platform import platform
# Log particulars of the environment
logging.warning("OS platform is %s" %platform())
logging.warning("Python version is %s" %sys.version)
logging.warning("BeautifulSoup is at %s and its version is %s" %(bs4.__file__, bs4.__version__))
# Open web-page and read HTML
url = 'http://www.ncbi.nlm.nih.gov/Traces/wgs/?val=JXIG&size=all'
response = urlopen(url)
html = response.read()
# Calculate MD5 to ensure that the same string was downloaded
print "MD5 sum for html string downloaded is %s" %md5.new(html).hexdigest()
# Make beautiful soup
soup = bs4.BeautifulSoup(html, 'html')
contigsTable = soup.find("table", {"class" : "zebra"})
contigs = []
# Parse table in soup to find all records
for row in contigsTable.findAll('tr'):
column = row.findAll('td')
if len(column) > 2:
# Expect identical results on any machine that this is run
print "Number of contigs identified is %s" %len(contigs)
WARNING:root:OS platform is Linux-3.10.10-031010-generic-x86_64-with-Ubuntu-12.04-precise
WARNING:root:Python version is 2.7.3 (default, Jun 22 2015, 19:33:41)
[GCC 4.6.3]
WARNING:root:BeautifulSoup is at /usr/local/lib/python2.7/dist-packages/bs4/__init__.pyc and its version is 4.3.2
MD5 sum for html string downloaded is ca76b381df706a2d6443dd76c9d27adf
Number of contigs identified is 630
WARNING:root:OS platform is Linux-2.6.32-431.46.2.el6.nersc.x86_64-x86_64-with-debian-6.0.6
WARNING:root:Python version is 2.7.4 (default, Apr 17 2013, 10:26:13)
[GCC 4.6.3]
WARNING:root:BeautifulSoup is at /global/homes/i/img/.local/lib/python2.7/site-packages/bs4/__init__.pyc and its version is 4.3.2
MD5 sum for html string downloaded is ca76b381df706a2d6443dd76c9d27adf
Number of contigs identified is 462
重叠群的计算的数量是不同的。请注意,在同一code解析HTML表格就不是来自对方,可惜领先截然不同的两种不同的环境产生不同的结果该生产噩梦。人工检测确认结果的机返回2 是不正确的,但迄今无法解释。
The number of contigs calculated is different. Please note that the same code parses an HTML table to yield different results on two different environments that are not strikingly different from each other and unfortunately leading to this production nightmare. Manual inspection confirms that the results returned on Machine 2 are incorrect, but has so far been impossible to explain.
有没有人有类似的经历?你注意到有什么不对的code或我应该停止信任 BeautifulSoup
Does anyone have similar experience? Do you notice anything wrong with this code or should I stop trusting BeautifulSoup
您所遇到的的的是 BeaufitulSoup
You are experiencing the differences between parsers that BeaufitulSoup
chooses automatically for the "html" markup type you've specified. Which parser is picked up depends on what modules are available in the current Python environment:
To have a consistent behavior across the platforms, be explicit:
soup = BeautifulSoup(html, "html.parser")
soup = BeautifulSoup(html, "html5lib")
soup = BeautifulSoup(html, "lxml")