python - beautifulsoup find_all找不到全部

下一页是我试图从中收集信息的示例页面。 https://www.hockey-reference.com/boxscores/201610130TBL.html很难说，但是实际上有8个表，因为它使用与其他表相同的类名来调用Scoring摘要和Penalty摘要。

并且我正尝试使用以下代码访问表，并对其进行了一些修改以尝试解决问题。

import os
from bs4 import BeautifulSoup # imports BeautifulSoup

file = open("Detroit_vs_Tampa.txt")
data = file.read()
file.close()

soup = BeautifulSoup(data,'lxml')
get_table = soup.find_all(class_="overthrow table_container")

print(len(get_table))

我从这段代码中得到的输出是6，这显然是不对的。我进一步了解到，它遗漏的表是高级统计报告标题下的两个表。

我还要指出，由于我认为解析器可能存在问题，因此我尝试直接从网站上同时使用html.parser和html.parser / lxml（而不是我在网站中使用的文本文件）示例代码）因此，我认为它不是损坏的html。

我有一个朋友快速浏览了一下，认为这可能是我自己的一个小疏忽，他能够注意到该网站使用的是旧的IE hack，并在表的前面加上了注释标记

我不是100％肯定这就是为什么它不起作用的原因，但是我已经搜索了这个问题，却完全没有发现任何问题。我希望这里的某人能够指出正确的方向。

最佳答案

最后的表格是由js加载的，但是您已经注意到，它们也被嵌入在静态HTML中的注释标记中。如果搜索bs4对象，则可以使用Comment获得它们。

import requests
from bs4 import BeautifulSoup, Comment

url = 'https://www.hockey-reference.com/boxscores/201610130TBL.html'
data = requests.get(url).text
soup = BeautifulSoup(data,'lxml')
get_table = soup.find_all(class_="overthrow table_container")
comment = soup.find(text=lambda text:isinstance(text, Comment) and 'table_container' in text)
get_table += BeautifulSoup(comment.string,'lxml').find_all(class_="overthrow table_container")
print(len(get_table))

另外，您可以使用selenium，但它比urllib或requests重得多。

from selenium import webdriver
from bs4 import BeautifulSoup

url = 'https://www.hockey-reference.com/boxscores/201610130TBL.html'
driver = webdriver.Firefox()
driver.get(url)
data = driver.page_source
driver.quit()

soup = BeautifulSoup(data,'lxml')
get_table = soup.find_all(class_="overthrow table_container")
print(len(get_table))