I try to use beautifulsoup to get the odds for each match on the following site:
The goal is to end up with some kind of text file containing the following:
Match1, Team1, Odds for team1 winning, Team2, Odds for team2 winning
Match2, Team1, Odds for team1 winning, Team2, Odds for team2 winning
我是beautifulsoup的新手,所以在很简单的层次上事情就已经出错了.我的方法是步行"遍历html树,直到到达div标签,在那里我可以看到所有匹配项.效果很好,直到使用class ="sgd-wrapper"命中div标签为止,下面有一个链接以查看图片进行澄清.
I am new to beautifulsoup so things already go wrong at a very elementary level. My approach is to "walk" through the html tree until I arrive in a div tag, where I can see all the matches are contained. This works well until hit a div tag with class="sgd-wrapper", there is a link below to see a picture for clarification.
The following is my code, and neither m1 or m2 works. Python just responses with none.
from bs4 import BeautifulSoup as bs
import requests as res
#Load the webpage content
r = res.get('https://danskespil.dk/oddset/sports/category/990/counter-strike-go/matches').text
#Convert to a beautiful soup object
soup = bs(r,'lxml')
m1 = soup.find("div", attrs={"id": "wrapper"}).find("div", attrs={"class": "page-box"}).find("div", attrs={"class": "page-area"}).find("div", attrs={"id": "oddset-nashville"}).find("div", attrs={"class": "sgd-wrapper"})
m2 = soup.find("div", attrs={"class": "sgd-wrapper"})
If I remove the last find in m1 or redefine m2
m1 = soup.find("div", attrs={"id": "wrapper"}).find("div", attrs={"class": "page-box"}).find("div", attrs={"class": "page-area"}).find("div", attrs={"id": "oddset-nashville"})
m2 = soup.find("div", attrs={"id": "oddset-nashville"})
<div data-digital-portal-loader-url="https://assets.sb.danskespil.dk/front-end/digitalPortal.js?noCache=20201011001813" id="oddset-nashville"></div>
有人可以向我解释为什么这个div class ="sgd-wrapper"有什么特别的?
Can someone explain me why this div class="sgd-wrapper" is so special?
问题出在 r = res.get('https://danskespil.dk/oddset/sports/category/990/counter-strike-go/matches').text
Python requests library just sent your HTTP/HTTPS request to the server and get the raw html and it does not help you to load more resources like pictures and scripts, which means that some elements is manipulate in javascript scripts (for example, create an element, set class name and insert into DOM tree):
另一个示例,如果通过请求 GET
main.html,它将不会加载 main.js
,并且div t1
的类将不能设置为 sgd-wrapper
another example, if you GET
main.html via requests, it does not load main.js
and the class of div t1
will not be set as sgd-wrapper
# main.html
<div id="t1"></div>
<script src="main.js"></script>
# in main.js
您需要做的是使用无头的Chrome(例如 google-chorme --headless
来启动Chrome),并使用Chrome API钩住页面加载事件,然后转储全部内容.
what you need to do is to use headless Chrome (like google-chorme --headless
to launch Chrome) and use Chrome API to hook on page loading events then dump whole complete contents.