python - BeautifulSoup的“发现”行为不一致(bs4)

我正在刮擦NFL的网站以获取球员统计信息。解析网页并尝试进入包含我要查找的实际信息的HTML表时遇到问题。我已成功下载该页面并将其保存到我正在使用的目录中。作为参考，可以在here中找到我已保存的页面。

# import relevant libraries
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

soup = BeautifulSoup(open("1998.html"))
result = soup.find(id="result")
print result

我发现在某一时刻，我运行了代码，结果打印出了我想要的正确表。每隔一次，它不包含任何东西！我假设这是用户错误，但我无法弄清我所缺少的内容。使用“ lxml”没有返回任何内容，并且我无法使html5lib正常工作（解析库？）。

任何帮助表示赞赏！

最佳答案

首先，您应该先阅读文件的内容，然后再将其传递给BeautifulSoup。

soup = BeautifulSoup(open("1998.html").read())

其次，通过将内容打印到屏幕上来手动验证HTML中是否存在相关的table。 .prettify()方法使数据更易于读取。

print soup.prettify()

最后，如果元素确实存在，则可以找到以下内容：

table = soup.find('table',{'id':'result'})

我编写的简单测试脚本无法重现您的结果。

import urllib
from bs4 import BeautifulSoup

def test():
    # The URL of the page you're scraping.
    url = 'http://www.nfl.com/stats/categorystats?tabSeq=0&statisticCategory=PASSING&conference=null&season=1998&seasonType=REG&d-447263-s=PASSING_YARDS&d-447263-o=2&d-447263-n=1'

    # Make a request to the URL.
    conn = urllib.urlopen(url)

    # Read the contents of the response
    html = conn.read()

    # Close the connection.
    conn.close()

    # Create a BeautifulSoup object and find the table.
    soup = BeautifulSoup(html)
    table = soup.find('table',{'id':'result'})

    # Find all rows in the table.
    trs = table.findAll('tr')

    # Print to screen the number of rows found in the table.
    print len(trs)

每次输出51。

关于python - BeautifulSoup的“发现”行为不一致(bs4)，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/31057586/